Protein preparation is the process of transforming a macromolecular structure into a form more suitable for a computational experiment. The specific set of tasks required depends on the experiment and the source of the structure. Tools for several common prep tasks are described below. For a demonstration of how these can be combined, see the Preparing a Protein example.
Protein structures rarely describe whole molecules. Instead, they are usually missing atoms that could not be resolved well. In addition, many structures contain multiple copies of atoms called alternate (“alt”) locations or alternate conformations that can indicate that part of the structure is moving. As the resolution of X-ray crystal structures improves, alt locations can be expected to become ubiquitous. These conformations are indexed by an ID character such as A or B that is stored in the residue information for that atom (see Biopolymer Residues).
Having extra copies of atoms can interfere with some experiments, such as docking. One way to handle this is to eliminate all but one conformation. By default, OEChem drops all but the A or blank conformations when reading atoms in a PDB format file. To prevent this, the file must be read with the OEIFlavor_PDB_ALTLOC flavor.
Listing 1: Setting the PDB Flavor to Retain All Alts
ims = oemolistream() ims.SetFlavor(OEFormat_PDB, OEIFlavor_PDB_ALTLOC)
Another way to drop all but one conformation is to use the OEAltLocationFactory object (see the Alternate Locations section) and generate, for example, a molecule that retains only the highest occupancy conformation of each location group. This object can be used to iterate through each combination of alt locations (for example, when a group is moving in an active site).
Listing 2: Keeping the Highest Occupancy Alt Locations
alf = OEAltLocationFactory(mol) if alf.GetGroupCount() != 0: alf.MakePrimaryAltMol(mol)
Traditionally, most protein structures have been generated with either no explicit hydrogens or only polar hydrogens. However, having all atoms explicitly represented and positioned to make appropriate hydrogen bonds is sometimes necessary. The function OEPlaceHydrogens does this by adding and placing hydrogens and by “flipping” certain functional groups (e.g., sidechain amides and imidazoles) as required.
Listing 3: Adding and Optimizing the Positions of Hydrogens
if OEPlaceHydrogens(mol): # ...
There are several OEPlaceHydrogensOptions parameters that control the process. For example, it is possible to prevent the bond lengths of any covalent bonds to pre-existing hydrogens from being altered:
Listing 4: Setting Options for Placing Hydrogens
opt = OEPlaceHydrogensOptions() opt.SetStandardizeBondLen(False)
Listing 5: Predicate for Residue Equivalence
class IsSameResidue(OEUnaryAtomPred): def __init__(self, res): OEUnaryAtomPred.__init__(self) self.referenceResidue = res def __call__(self, atom): res = OEAtomGetResidue(atom) return OESameResidue(res, self.referenceResidue) def CreateCopy(self): # __disown__ is required to allow C++ to take ownership of this # object and its memory return IsSameResidue(self.referenceResidue).__disown__()
We can use that predicate to prevent a selected residue from flipping:
Listing 6: Preventing a Residue from Flipping
# given predicate IsSameResidue, residue selectedResidue and molecule mol... opt = OEPlaceHydrogensOptions() opt.SetNoFlipPredicate(IsSameResidue(selectedResidue)) if OEPlaceHydrogens(mol, opt): # selectedResidue will not be flipped...
Using an OEPlaceHydrogensDetails object enables examination of the results, showing the clusters of movable groups explored and the scores. A simple way to get at this information is to display a summary description:
Listing 7: Examining the Details of Placing Hydrogens
# given molecule mol and OEPlaceHydrogensOptions opt... details = OEPlaceHydrogensDetails() if OEPlaceHydrogens(mol, details, opt): print(details.Describe())
The following is a fragment from a details object description:
OEBio 2.1.1: score system MMFF-NIE: flip bias 2.000 87 clusters : 5 flips cluster 0 : score -53.935! CG ASN44(A) : amide: -6.084 (o=-53.93!,f=-8.42!) NZ LYS47(A) : NH3 150deg: -14.865 CG HIS80(A) : +bothHN: -5.628 (o=-53.93!,f=-15.89!) OG1 THR81(A) : OH 65deg: 1.255 CD GLN87(A) :FLIP amide: -4.976 (o=-26.48!,f=-53.93!) CG ASN126(A) : amide: -5.193 (o=-53.93!,f=-35.33!) CG ASN169(A) : amide: -2.230 (o=-53.93!,f=-33.02!) CD GLN203(A) : amide: -1.400 (o=-53.93!,f=-15.53!) CG HIS205(A) : +bothHN: -8.693! (o=-53.93!,f=-6.89!) OG1 THR232(A) : OH 179deg: 9.120 O3B XYP601(A) : OH 280deg: -5.020 O2B XYP601(A) : OH 173deg: 3.917 O4B XYP601(A) : OH 50deg: 1.236 O3 XIF602(A) : OH 29deg: -3.688 cluster 1 : score -85.724! CG HIS85(A) : +bothHN: -7.737! (o=-85.72!,f=-84.85!) OG SER86(A) : OH 241deg: 2.116 OG SER139(A) : OH 185deg: 3.367! OH TYR171(A) : OH 25deg: -3.384 CG ASN172(A) : amide: 1.153 (o=-85.72!,f=-68.75!) NZ LYS179(A) : NH3 188deg: -39.497 ... single 13 : OG1 THR2(A) : OH 151deg: 1.545 single 14 : OG SER25(A) : OH 292deg: 2.046 single 15 : OH TYR29(A) : OH 199deg: -1.262 single 16 : OG SER35(A) : OH 254deg: 1.935 ...
One common step in preparing a macromolecular complex for use as input to a calculation is to separate components based on their functional roles. For example, a SZMAP calculation on a four-chain protein crystal structure might require one of four copies of the ligand to be extracted and the protein-complex associated with that ligand binding site to be separately extracted, ignoring any waters, buffers, etc.
Listing 8 uses the OESplitMolComplex function to separate the input molecule into the ligand for the first binding site, the protein-complex (including cofactors) for that site, the binding site waters, and everything else. In this example, the ligand is identified automatically and the binding site contains components near the ligand. If a ligand cannot be located, the entirety of each protein and everything nearby defines each binding site. If the input molecule is empty or there is some other problem that prevents processing, OESplitMolComplex will return false.
Listing 8: Splitting an Input Molecule
# for input molecule inmol ... lig = OEGraphMol() prot = OEGraphMol() wat = OEGraphMol() other = OEGraphMol() if OESplitMolComplex(lig, prot, wat, other, inmol): # work with the output molecules lig, prot, ...
Listing 8 uses the defaults, so no attempt would be made to recognize covalent ligands. Listing 9 uses an option to turn on the splitting of covalent ligands. There are a large number of options to control the process of splitting a molecule. See OESplitMolComplexOptions for a complete list.
Listing 9: Splitting Covalent Ligands
# given input and output molecules ... opt = OESplitMolComplexOptions() opt.SetSplitCovalent() if OESplitMolComplex(lig, prot, wat, other, inmol, opt): # ...
A related approach uses OEGetMolComplexComponents to return an iterator of molecules. Listing 10 also illustrates providing an explicit ligand residue name that must match for the ligand to be identified.
Listing 10: Iterating Over Split Components
# for input molecule mol and string ligName ... opt = OESplitMolComplexOptions(ligName) for frag in OEGetMolComplexComponents(mol, opt): # ...
OESplitMolComplex accesses separate filters for lig, prot, wat, and other from the OESplitMolComplexOptions object. In OEGetMolComplexComponents, the combination of the ligand, protein, and water filters is used to select the output components. The options method OESplitMolComplexOptions.ResetFilters can be used for basic changes to the filters, such as selecting a different binding site (see Listing 11).
Listing 11: Resetting Filters for All Sites
opt = OESplitMolComplexOptions() allSites = 0 opt.ResetFilters(allSites)
The function OECountMolComplexSites returns the number of binding sites.
Listing 12 shows a more involved case, combining protein-complex and binding-site waters into the protein filter and nothing in the water filter.
Listing 12: Modifying Filters
opt = OESplitMolComplexOptions() p = opt.GetProteinFilter() w = opt.GetWaterFilter() opt.SetProteinFilter(OEOrRoleSet(p, w)) opt.SetWaterFilter( OEMolComplexFilterFactory(OEMolComplexFilterCategory_Nothing))
Listing 13: Explicit Filter When Iterating
# for input molecule mol and options opt ... for l in OEGetMolComplexComponents(mol, opt, opt.GetLigandFilter()): # ...
For more complex workflows, it is useful to explicitly separate the tasks of assigning roles and filtering, if only to avoid the expense of assigning roles multiple times. Listing 14 shows how to assign roles to each fragment in a molecule for later use.
Listing 14: Assigning Roles to Each Mol Fragment
# for input molecule mol and options opt ... frags = OEAtomBondSetVector() if OEGetMolComplexFragments(frags, mol, opt): # ...
In Listing 15, the role information for each fragment is used to determine the number of binding sites and is filtered to produce ligand and protein complex mols.
Listing 15: Filtering Mol Fragments
# for options opt and OEAtomBondSetVector frags produced earlier ... numSites = OECountMolComplexSites(frags) OEThrow.Verbose("sites %d" % numSites) lig = OEGraphMol() l = opt.GetLigandFilter() if not OECombineMolComplexFragments(lig, frags, opt, l): OEThrow.Warning("Unable to combine ligand frags from %s" % mol.GetTitle()) protComplex = OEGraphMol() p = opt.GetProteinFilter() w = opt.GetWaterFilter() if not OECombineMolComplexFragments(protComplex, frags, opt, OEOrRoleSet(p, w)): OEThrow.Warning("Unable to combine complex frags from %s" % mol.GetTitle())
The basic idea behind automatic ligand identification is that the list of compounds that are not ligands (cofactors, buffers, etc.) grows faster than the list of ligands. The OEResidueCategoryData object contained within an OESplitMolComplexOptions is a database of these non-ligands. Although this database is being updated as we run across more cofactors, buffers, etc., it is possible for users to update the database as needed. Listing 16 shows how this is done, involving the Russian nesting doll-like organization of a OEResidueCategoryData within a OEMolComplexCategorizer within a OESplitMolComplexOptions object.
Listing 16: Adding a Cofactor
db = OEResidueCategoryData() db.AddToDB(OEResidueDatabaseCategory_Cofactor, "MTQ") cat = OEMolComplexCategorizer() cat.SetResidueCategoryData(db) opt = OESplitMolComplexOptions() opt.SetCategorizer(cat)
Please help OpenEye maintain and expand the built-in OEResidueCategoryData by informing us of any oversights or mistakes.
Listing 17: Command Line Options
import sys from openeye.oechem import * def main(argv=[__name__]): itf = OEInterface(InterfaceData) OEConfigureSplitMolComplexOptions(itf) if not OEParseCommandLine(itf, argv): OEThrow.Fatal("Unable to interpret command line!") opts = OESplitMolComplexOptions() OESetupSplitMolComplexOptions(opts, itf)