Biopolymers

Biopolymer Residues

Living organisms manufacture an assortment of biopolymers: proteins are polymers of amino-acid residues, DNA and RNA are polymers of nucleotide bases, and polysaccharides such as cellulose and starch are polymers of sugars. In a description of these often very large molecules, it is useful to include with each atom information about the monomer it resides in, the chain the monomer is in, the model the chain is in, etc. OEChem‘s OEResidue is a container for just such information, and more.

Properties of OEResidues
Property Type Get Method Set Method
Residue Name char* GetName SetName
Residue Number int GetResidueNumber SetResidueNumber
Insertion Code char GetInsertCode SetInsertCode
Chain ID char GetChainID SetChainID
NMR Model Number int GetModelNumber SetModelNumber
Fragment Number int GetFragmentNumber SetFragmentNumber
Secondary Structure int GetSecondaryStructure SetSecondaryStructure
Heterogen bool IsHetAtom SetHetAtom
Atom Serial Number int GetSerialNumber SetSerialNumber
Alternate Location Code char GetAlternateLocation SetAlternateLocation
Occupancy float GetOccupancy SetOccupancy
Temperature Factor float GetBFactor SetBFactor
../_images/pdb-atom-records.png

PDB format atom records

Note that OEResidue is actually a part of OEChem, it is included in this OEBio chapter because it underlies much of OEBio’s functionality. Table: Properties of OEResidues lists each property along with the Get and Set methods for each. Figure: PDB format atom records shows how these properties are encoded for a structure in PDB format.

The residue number, insertion code, chain ID, model number and fragment number properties describe a hierarchy of structural elements, but do not impose any explicit hierarchy. Because the biological world is inherently messy, the flexibility of a flat representation such as this is very useful. For example, because it does not require a chain to be either above or below a fragment, it can describe both a chain with gaps, composed of multiple covalently bonded fragments and one that is covalently bonded to another chain. And, although residues are in a given sequence order, it does not require the polymer to be linear and can easily handle a glycosylated protein where a chain may have a tree-like, rather than linear, structure.

On the other hand, for some tasks a hierarchical view of biopolymers is more natural and more efficient, despite the limitations of a hierarchy. For more information, see the section A Hierarchy View.

Note that although infrequently used, the insertion code modifies the residue number. When comparing residue numbers it is necessary to compare both the residue number and the insertion code. The best way to avoid errors when comparing OEResidues is to use OESameResidue.

Listing 1: Accessing OEResidue properties

/**********************************************************************
  Copyright (C) 2015 OpenEye Scientific Software, Inc.
***********************************************************************/
//simple residue count
#include <openeye.h>
#include <oesystem.h>
#include <oechem.h>

using namespace OESystem;
using namespace OEChem;

void ResCount(OEMolBase &mol)
{
  if (! OEHasResidues(mol))
    OEPerceiveResidues(mol, OEPreserveResInfo::All);

  int resCt = 0;
  int hetCt = 0;
  OEResidue prevRes;
  for (OEIter<OEAtomBase> atom = mol.GetAtoms(); atom; ++atom)
  {
    OEResidue thisRes = OEAtomGetResidue(atom);
    if (! OESameResidue(prevRes, thisRes))
    {
      ++resCt;
      if (thisRes.IsHetAtom())
      {
        ++hetCt;
      }
      prevRes = thisRes;
    }
  }
  std::cout << "Molecule: " << mol.GetTitle()    << std::endl;
  std::cout << "Residues: " << resCt
                    << " (" << hetCt << " hets)" << std::endl;
}

int main(int argc, char *argv[])
{
  if(argc != 2)
    OEThrow.Usage("%s <mol-infile>", argv[0]);

  oemolistream ims;
  if(! ims.open(argv[1]))
    OEThrow.Fatal("Unable to open %s for reading.",argv[1]);

  OEGraphMol mol;
  while(OEReadMolecule(ims, mol))
  {
    ResCount(mol);
  }
  return 0;
}

It is important to keep in mind that OEResidues are not residues, but instead are information blocks associated with atoms. Several of the properties of an OEResidue such as atom serial number and occupancy are not residue properties at all, but instead are properties of a specific atom. Changing an OEResidue property will not modify anything in the molecule until the OEResidue is saved back into the molecule, and this must be done for each atom for which the property is to be changed.

Listing 2: Updating OEResidue properties

for (OEIter<OEAtomBase> atom = mol.GetAtoms(); atom; ++atom)
{
  OEResidue thisRes = OEAtomGetResidue(atom);
  if (OEGetResidueIndex(thisRes) == OEResidueIndex::MSE)
  {
    thisRes.SetName("MET");          // modify res properties
    thisRes.SetHetAtom(false);
    OEAtomSetResidue(atom, thisRes); // store updated residue

    if (atom->GetAtomicNum() == OEElemNo::Se)
    {
      atom->SetAtomicNum(OEElemNo::S);// fix atom type & name
      atom->SetName(" SD ");
    }
  }
}

The input routines for Protein Data Bank (PDB) files are setup to automatically perceive OEResidue properties but the perception can also be done explicitly, if required, as shown below.

Listing 3: Perceiving OEResidues

if (! OEHasResidues(mol))
{
  OEPerceiveResidues(mol, OEPreserveResInfo::All);
}

A Hierarchy View

At times, it may be more convenient to loop over residues or chains directly, rather than looping over atoms and keeping track as OEResidue properties change. The OEHierView class supports this approach by representing a biopolymer molecule as a hierarchy of the components: OEHierChain, OEHierFragment, and OEHierResidue.

Listing 4: Looping over hierarchical components

/**********************************************************************
  Copyright (C) 2015 OpenEye Scientific Software, Inc.
***********************************************************************/
//residue count
#include <openeye.h>
#include <oesystem.h>
#include <oechem.h>
#include <oebio.h>

using namespace OESystem;
using namespace OEChem;
using namespace OEBio;

static void CalcResCounts(OEMolBase &mol)
{
  OEHierView hv(mol);
  int chainCt = 0;
  int fragCt  = 0;
  int resCt   = 0;
  int watCt   = 0;
  for(OEIter<OEHierChain> chain = hv.GetChains();chain;++chain)
  {
    ++chainCt;
    for(OEIter<OEHierFragment> frag = chain->GetFragments();frag;++frag)
    {
      ++fragCt;
      for(OEIter<OEHierResidue> hres = frag->GetResidues();hres;++hres)
      {
        ++resCt;
        if(OEGetResidueIndex(hres->GetOEResidue()) == OEResidueIndex::HOH)
        {
          ++watCt;
        }
      }
    }
  }
  std::cout << "Molecule : " << mol.GetTitle()      << std::endl;
  std::cout << "Chains   : " << chainCt             << std::endl;
  std::cout << "Fragments: " << fragCt              << std::endl;
  std::cout << "Residues : " << resCt
                     << " (" << watCt << " waters)" << std::endl;
}

int main(int argc, char *argv[])
{
  if(argc != 2)
    OEThrow.Usage("%s <mol-infile>", argv[0]);

  oemolistream ims;
  if(! ims.open(argv[1]))
    OEThrow.Fatal("Unable to open %s for reading",argv[1]);

  OEGraphMol mol;
  while(OEReadMolecule(ims, mol))
  {
    if (! OEHasResidues(mol))
      OEPerceiveResidues(mol, OEPreserveResInfo::All);

    CalcResCounts(mol);
  }
  return 0;
}

Be aware that OEHierView works with what amounts to a snapshot of the molecule, so if changes are made to the original molecule (such as adding or deleting atoms) it may be necessary to rebuild the OEHierView.

The OEHier... classes provide direct access to chains, fragments and residues using common identifiers such as chain id, residue name and residue number.

Listing 5: Selecting a residue

OEHierView hv(mol);
OEHierResidue hres = hv.GetResidue('A', "LEU", 27);
for(OEIter<OEAtomBase> atom = hres.GetAtoms();atom;++atom)
{                                          // only this residue's atoms
  OEResidue res = OEAtomGetResidue(atom);
  std::cout << res.GetSerialNumber() << " " << atom->GetName() << std::endl;
}

Sequence Alignment

Two proteins can be sequence aligned with the function OEGetAlignment which returns the class OESequenceAlignment. Several different amino acid distance matrices are supplied, including OESeqAlignmentMethod::PAM250 [Dayhoff-1978] for closely related proteins, and OESeqAlignmentMethod::BLOSUM62 [Henikoff-1992] & OESeqAlignmentMethod::GONNET [Gonnet-1992] for more distantly related proteins.

Alignments can be written out in standard format using the function OEWriteAlignment.

The function OERMSD will calculate the backbone or C-\(\alpha\) RMSD of previously aligned proteins.

Crystal Symmetry

The space group and unit cell parameters (a, b, c, \(\alpha\), \(\beta\), \(\gamma\)) for an x-ray crystal structure can be accessed and modified using the following OEBio functions:

The function OEExpandCrystalSymmetry will replicate the molecule to fill-in each of the symmetry related molecules in other regions of the crystal lattice, out to a specified radius.

Protein Secondary Structure

OEBio provides functionality for perceiving secondary structural elements in proteins. Perception is performed according to the method of Kabsch and Sander [Kabsch-1983] and identifies helices, beta sheets, and turns based on hydrogen bonding patterns within a protein. The function OEPerceiveSecondaryStructure determines secondary structure for a protein molecule, and stores the assigned secondary structure types in the OEResidue associated with each atom in the protein. Secondary structure assignments are stored as single integer values within OEResidue, into which are packed the type (helix, sheet, etc.), helix/sheet/turn ID numbers, and strand identifiers for individual strands within beta sheets. To help decode the secondary structure information packed in these integers, several functions are provided:

These functions all take an integer argument, which is typically the value returned from OEResidue::GetSecondaryStructure. An additional function, OESecondaryStructurePacked, will do the reverse operation, namely, taking a secondary structure type, ID, and optional strand ID, and packing them into an integer suitable for use with OEResidue::SetSecondaryStructure.

Secondary structure types are defined in the OESecondaryStructure namespace.

A functor is also provided to help identify atoms belonging to particular secondary structure features. OEHasSecondaryStructure takes an integer argument. This integer can be one of the above constants (e.g., to find all atoms belonging to alpha helices); the value retrieved from a call to OEResidue::GetSecondaryStructure (e.g., to find all atoms belonging to the same helix as a particular residue); or a value returned from OESecondaryStructurePacked, which can be used to find atoms from a specific secondary structural element, such as Helix number 4.

Functors

A number of atom predicates are provided in OEBio to make it easy to identify atoms with particular properties:

This last predicate uses PDBAtoms (such as CA) to refer to specific atom types in a Protein Data Bank (PDB) structure.

Misc

The function OEIsStandardProteinResidue indicates whether a given OEHierResidue is a standard amino acid.

The function OEHasBondedResidues indicates whether the atoms in each residue have bonds to the same residue.

Given an OEAtomBase or OEResidue, the function OEGetResidueAtoms will return an iterator over all the atoms in the residue.