Biopolymers¶
Biopolymer Residues¶
Living organisms manufacture an assortment of biopolymers:
proteins are polymers of amino-acid residues,
DNA and RNA are polymers of nucleotide bases,
and polysaccharides such as cellulose and starch
are polymers of sugars.
In a description of these often very large molecules,
it is useful to include with each atom
information about the monomer it resides in, the chain
the monomer is in, the model the chain is in, etc.
OEChem TK’s OEResidue
is a container for just such information, and more.
Property |
Type |
Get Method |
Set Method |
---|---|---|---|
Residue Name |
char* |
||
Residue Number |
int |
||
Insertion Code |
char |
||
Chain ID |
char |
||
NMR Model Number |
int |
||
Fragment Number |
int |
||
Secondary Structure |
int |
||
Heterogen |
bool |
||
Atom Serial Number |
int |
||
Alternate Location Code |
char |
||
Occupancy |
float |
||
Temperature Factor |
float |
Note that OEResidue
is actually a part
of OEChem TK, it is included in this OEBio TK
chapter because it underlies much of OEBio TK’s functionality.
Table: Properties of OEResidues lists
each property along with the Get and Set methods for each.
Figure: PDB format atom records
shows how these properties are encoded for a structure in PDB format.
The residue number, insertion code, chain ID, model number and fragment number properties describe a hierarchy of structural elements, but do not impose any explicit hierarchy. Because the biological world is inherently messy, the flexibility of a flat representation such as this is very useful. For example, because it does not require a chain to be either above or below a fragment, it can describe both a chain with gaps, composed of multiple covalently bonded fragments and one that is covalently bonded to another chain. And, although residues are in a given sequence order, it does not require the polymer to be linear and can easily handle a glycosylated protein where a chain may have a tree-like, rather than linear, structure.
On the other hand, for some tasks a hierarchical view of biopolymers is more natural and more efficient, despite the limitations of a hierarchy. For more information, see the section A Hierarchy View.
Note that although infrequently used, the insertion code modifies
the residue number. When comparing residue numbers it is necessary
to compare both the residue number and the insertion code.
The best way to avoid errors when comparing OEResidues is to
use OESameResidue
.
Listing 1: Accessing OEResidue properties
import sys
from openeye import oechem
def CalcResCounts(mol):
if not oechem.OEHasResidues(mol):
oechem.OEPerceiveResidues(mol, oechem.OEPreserveResInfo_All)
resCt = 0
hetCt = 0
prevRes = oechem.OEResidue()
for atom in mol.GetAtoms():
thisRes = oechem.OEAtomGetResidue(atom)
if not oechem.OESameResidue(prevRes, thisRes):
resCt += 1
if thisRes.IsHetAtom():
hetCt += 1
prevRes = thisRes
print("Molecule: %s" % mol.GetTitle())
print("Residues: %d (%d hets)" % (resCt, hetCt))
def main(argv=[__name__]):
if len(argv) != 2:
oechem.OEThrow.Usage("%s <mol-infile>" % argv[0])
ims = oechem.oemolistream()
if not ims.open(argv[1]):
oechem.OEThrow.Fatal("Unable to open %s for reading" % argv[1])
for mol in ims.GetOEGraphMols():
CalcResCounts(mol)
if __name__ == "__main__":
sys.exit(main(sys.argv))
It is important to keep in mind that
OEResidues
are not residues, but instead are
information blocks associated with atoms.
Several of the properties of an
OEResidue
such as atom serial number and occupancy
are not residue properties at all,
but instead are properties of a specific atom.
Changing an OEResidue
property
will not modify anything in the molecule
until the OEResidue
is saved back into the molecule,
and this must be done for each atom
for which the property is to be changed.
Listing 2: Updating OEResidue properties
for atom in mol.GetAtoms():
thisRes = oechem.OEAtomGetResidue(atom)
if oechem.OEGetResidueIndex(thisRes) == oechem.OEResidueIndex_MSE:
thisRes.SetName("MET") # modify res properties
thisRes.SetHetAtom(False)
oechem.OEAtomSetResidue(atom, thisRes) # store updated residue
if atom.GetAtomicNum() == oechem.OEElemNo_Se:
atom.SetAtomicNum(oechem.OEElemNo_S) # fix atom type & name
atom.SetName(" SD ")
The input routines for Protein Data Bank (PDB) files
are setup to automatically perceive
OEResidue
properties
but the perception can also be done explicitly,
if required, as shown below.
Listing 3: Perceiving OEResidues
if not oechem.OEHasResidues(mol):
oechem.OEPerceiveResidues(mol, oechem.OEPreserveResInfo_All)
Residue Naming Conventions¶
The standard residue name list is ALA, ARG, ASN, ASP, CYS, GLN, GLU, GLY, HIS, ILE, LEU, LYS, MET, PHE, PRO, SER, THR, TRP, TYR, VAL, ASX, GLX, CYX, CYH, HID, HIE or HIP.
GLX and ASX are the Protein Data Bank’s standard names for when it is not clear whether the residue is a GLU or GLN, or an ASP or ASN.
CYH and CYX are ways that groups outside of the PDB describe explicitly reduced and oxidized version of CYS, either cysteine or cystine respectively.
We generally follow the standard-non-standard names from the AMBER force field. For example, we show ASH, CYS, CYH, CYM, GLH, HID, HIE, HIP, LYN, TYM, where H at the end means reduced, N means neutral and M means deprotonated. We will not generate PDB files or rename residues in molecules based on their physical state, but will follow the standard residue names like HIS to cover all states of HIE, HID, HIP. We also respect residue names provided to us.
We discourage the use of UNX, UNK and UNL according to: Worldwide PDB Heterogen Section. They should only be used in instances where the chemical components for which the chemical identity is truly unknown. From the wwPDB: Unknown atoms or ions will be represented as UNX with the chemical formula X1. Unknown ligands are UNL; unknown amino acids are UNK.
A Hierarchy View¶
At times, it may be more convenient to loop over residues or chains directly,
rather than looping over atoms and keeping track as OEResidue
properties change. The OEHierView class supports this approach
by representing a biopolymer molecule as a hierarchy
of the components: OEHierChain
,
OEHierFragment
,
and OEHierResidue
.
Listing 4: Looping over hierarchical components
import sys
from openeye import oechem
def CalcResCounts(mol):
hv = oechem.OEHierView(mol)
chainCt = 0
fragCt = 0
resCt = 0
watCt = 0
for chain in hv.GetChains():
chainCt += 1
for frag in chain.GetFragments():
fragCt += 1
for hres in frag.GetResidues():
resCt += 1
if (oechem.OEGetResidueIndex(hres.GetOEResidue()) ==
oechem.OEResidueIndex_HOH):
watCt += 1
print("Molecule : %s" % mol.GetTitle())
print("Chains : %d" % chainCt)
print("Fragments: %d" % fragCt)
print("Residues : %d (%d waters)" % (resCt, watCt))
def main(argv=[__name__]):
if len(argv) != 2:
oechem.OEThrow.Usage("%s <mol-infile>" % argv[0])
ims = oechem.oemolistream()
if not ims.open(argv[1]):
oechem.OEThrow.Fatal("Unable to open %s for reading" % argv[1])
for mol in ims.GetOEGraphMols():
if not oechem.OEHasResidues(mol):
oechem.OEPerceiveResidues(mol, oechem.OEPreserveResInfo_All)
CalcResCounts(mol)
if __name__ == "__main__":
sys.exit(main(sys.argv))
Be aware that OEHierView
works
with what amounts to a snapshot of the molecule, so if changes are
made to the original molecule (such as adding or deleting atoms)
it may be necessary to rebuild the OEHierView
.
The OEHier… classes provide direct access to chains, fragments and residues using common identifiers such as chain id, residue name and residue number.
Listing 5: Selecting a residue
hv = oechem.OEHierView(mol)
hres = hv.GetResidue("A", "LEU", 27)
for atom in hres.GetAtoms(): # only this residue's atoms
res = oechem.OEAtomGetResidue(atom)
print(res.GetSerialNumber(), atom.GetName())
Sequence Alignment¶
Two proteins can be sequence aligned
with the function OEGetAlignment
which returns the class OESequenceAlignment.
Several different amino acid distance matrices are supplied, including
OESeqAlignmentMethod_PAM250
[Dayhoff-1978] for closely
related proteins, and
OESeqAlignmentMethod_BLOSUM62
[Henikoff-1992] &
OESeqAlignmentMethod_GONNET
[Gonnet-1992] for
more distantly related proteins.
Alignments can be written out in standard format using the function
OEWriteAlignment
.
The function OERMSD
will
calculate the backbone or C-\(\alpha\) RMSD of previously aligned proteins.
Crystal Symmetry¶
The space group and unit cell parameters (a, b, c, \(\alpha\), \(\beta\), \(\gamma\)) for an x-ray crystal structure can be accessed and modified using the following OEBio functions:
The function OEExpandCrystalSymmetry
will
replicate the molecule to fill-in each of the symmetry related
molecules in other regions of the crystal lattice, out to a specified radius.
Protein Secondary Structure¶
OEBio provides functionality for perceiving secondary structural elements
in proteins. Perception is performed according to the method of Kabsch and
Sander [Kabsch-1983] and identifies helices, beta sheets, and turns based on
hydrogen bonding patterns within a protein. The function
OEPerceiveSecondaryStructure
determines secondary structure
for a protein molecule, and stores the assigned secondary structure types in
the OEResidue associated with each atom in the protein.
Secondary structure assignments are stored as single integer values within
OEResidue, into which are packed the type
(helix, sheet, etc.), helix/sheet/turn ID numbers, and strand
identifiers for individual strands within beta sheets. To help decode the
secondary structure information packed in these integers, several functions
are provided:
These functions all take an integer argument, which is typically the value
returned from OEResidue.GetSecondaryStructure
. An
additional function, OESecondaryStructurePacked
, will do the
reverse operation, namely, taking a secondary structure type, ID, and optional
strand ID, and packing them into an integer suitable for use with
OEResidue.SetSecondaryStructure
.
Secondary structure types are defined in the
OESecondaryStructure
namespace.
A functor is also provided to help identify atoms belonging to particular
secondary structure features.
OEHasSecondaryStructure
takes an
integer argument. This integer can be one of the above constants (e.g.,
to find all atoms belonging to alpha helices); the value retrieved from a call
to OEResidue.GetSecondaryStructure
(e.g., to find all atoms
belonging to the same helix as a particular residue); or a value returned
from OESecondaryStructurePacked
, which can be used to find
atoms from a specific secondary structural element, such as Helix number 4.
Functors¶
A number of atom predicates are provided in OEBio to make it easy to identify atoms with particular properties:
This last predicate uses PDBAtoms (such as
CA
)
to refer to specific atom types
in a Protein Data Bank (PDB) structure.
Misc¶
The function OEIsStandardProteinResidue
indicates
whether a given OEHierResidue
is
a standard amino acid.
The function OEHasBondedResidues
indicates
whether the atoms in each residue have bonds to the same residue.
Given an OEAtomBase
or OEResidue
,
the function OEGetResidueAtoms
will return
an iterator over all the atoms in the residue.