Protein Preparation

Protein preparation is the process of transforming a macromolecular structure into a form more suitable for a computational experiment. The specific set of tasks required depends on the experiment and the source of the structure. Tools for several common prep tasks are described below. For a demonstration of how these can be combined, see the Preparing a Protein example.

Processing Alternate Locations

Protein structures rarely describe whole molecules. Instead, they are usually missing atoms that could not be resolved well. In addition, many structures contain multiple copies of atoms called alternate (“alt”) locations or alternate conformations that can indicate that part of the structure is moving. As the resolution of X-ray crystal structures improves, alt locations can be expected to become ubiquitous. These conformations are indexed by an ID character such as A or B that is stored in the residue information for that atom (see Biopolymer Residues).

Having extra copies of atoms can interfere with some experiments, such as docking. One way to handle this is to eliminate all but one conformation. By default, OEChem drops all but the A or blank conformations when reading atoms in a PDB format file. To prevent this, the file must be read with the OEIFlavor::PDB::ALTLOC flavor.

Listing 1: Setting the PDB Flavor to Retain All Alts

oemolistream ims;
ims.SetFlavor(OEFormat::PDB, OEIFlavor::PDB::ALTLOC);

Another way to drop all but one conformation is to use the OEAltLocationFactory object (see the Alternate Locations section) and generate, for example, a molecule that retains only the highest occupancy conformation of each location group. This object can be used to iterate through each combination of alt locations (for example, when a group is moving in an active site).

Listing 2: Keeping the Highest Occupancy Alt Locations

OEAltLocationFactory alf(mol);
if (alf.GetGroupCount() != 0u)

Adding Hydrogens and Optimizing Hydrogen Bonds

Traditionally, most protein structures have been generated with either no explicit hydrogens or only polar hydrogens. However, having all atoms explicitly represented and positioned to make appropriate hydrogen bonds is sometimes necessary. The function OEPlaceHydrogens does this by adding and placing hydrogens and by “flipping” certain functional groups (e.g., sidechain amides and imidazoles) as required.


Asparagine has been flipped and serine oriented to H-bond with main-chain atoms

Listing 3: Adding and Optimizing the Positions of Hydrogens

if (OEPlaceHydrogens(mol))
  // ...


This is a new tool and some components are incomplete. We hope there is a benefit to making it available while it is still under active development.

There are several OEPlaceHydrogensOptions parameters that control the process. For example, it is possible to prevent the bond lengths of any covalent bonds to pre-existing hydrogens from being altered:

Listing 4: Setting Options for Placing Hydrogens

OEPlaceHydrogensOptions opt;

Using an OEPlaceHydrogensDetails object enables examination of the results, showing the clusters of movable groups explored and the scores. A simple way to get at this information is to display a summary description:

Listing 5: Examining the Details of Placing Hydrogens

// given molecule mol and OEPlaceHydrogensOptions opt...
OEPlaceHydrogensDetails details;
if (OEPlaceHydrogens(mol, details, opt))
  std::cout << details.Describe() << std::endl;

The following is a fragment from a details object description:

OEBio 2.1.1: score system MMFF-NIE: flip bias 2.000
87 clusters : 5 flips
cluster 0  : score -53.935!
             CG  ASN44(A)   :      amide:   -6.084  (o=-53.93!,f=-8.42!)
             NZ  LYS47(A)   : NH3 150deg:  -14.865
             CG  HIS80(A)   :    +bothHN:   -5.628  (o=-53.93!,f=-15.89!)
             OG1 THR81(A)   : OH   65deg:    1.255
             CD  GLN87(A)   :FLIP  amide:   -4.976  (o=-26.48!,f=-53.93!)
             CG  ASN126(A)  :      amide:   -5.193  (o=-53.93!,f=-35.33!)
             CG  ASN169(A)  :      amide:   -2.230  (o=-53.93!,f=-33.02!)
             CD  GLN203(A)  :      amide:   -1.400  (o=-53.93!,f=-15.53!)
             CG  HIS205(A)  :    +bothHN:   -8.693! (o=-53.93!,f=-6.89!)
             OG1 THR232(A)  : OH  179deg:    9.120
             O3B XYP601(A)  : OH  280deg:   -5.020
             O2B XYP601(A)  : OH  173deg:    3.917
             O4B XYP601(A)  : OH   50deg:    1.236
             O3  XIF602(A)  : OH   29deg:   -3.688
cluster 1  : score -85.724!
             CG  HIS85(A)   :    +bothHN:   -7.737! (o=-85.72!,f=-84.85!)
             OG  SER86(A)   : OH  241deg:    2.116
             OG  SER139(A)  : OH  185deg:    3.367!
             OH  TYR171(A)  : OH   25deg:   -3.384
             CG  ASN172(A)  :      amide:    1.153  (o=-85.72!,f=-68.75!)
             NZ  LYS179(A)  : NH3 188deg:  -39.497
single  13 : OG1 THR2(A)    : OH  151deg:    1.545
single  14 : OG  SER25(A)   : OH  292deg:    2.046
single  15 : OH  TYR29(A)   : OH  199deg:   -1.262
single  16 : OG  SER35(A)   : OH  254deg:    1.935

See also

Splitting Macromolecular Complexes

One common step in preparing a macromolecular complex for use as input to a calculation is to separate components based on their functional roles. For example, a SZMAP calculation on a four-chain protein crystal structure might require one of four copies of the ligand to be extracted and the protein-complex associated with that ligand binding site to be separately extracted, ignoring any waters, buffers, etc.

Listing 6 uses the OESplitMolComplex function to separate the input molecule into the ligand for the first binding site, the protein-complex (including cofactors) for that site, the binding site waters, and everything else. In this example, the ligand is identified automatically and the binding site contains components near the ligand. If a ligand cannot be located, the entirety of each protein and everything nearby defines each binding site. If the input molecule is empty or there is some other problem that prevents processing, OESplitMolComplex will return false.

Listing 6: Splitting an Input Molecule

  // for input molecule inmol ...
  OEGraphMol lig, prot, wat, other;

  if (OESplitMolComplex(lig, prot, wat, other, inmol))
    // work with the output molecules lig, prot, ...

Listing 6 uses the defaults, so no attempt would be made to recognize covalent ligands. Listing 7 uses an option to turn on the splitting of covalent ligands. There are a large number of options to control the process of splitting a molecule. See OESplitMolComplexOptions for a complete list.

Listing 7: Splitting Covalent Ligands

  // given input and output molecules ...
  OESplitMolComplexOptions opt;

  if (OESplitMolComplex(lig, prot, wat, other, inmol, opt))
    // ...

A related approach uses OEGetMolComplexComponents to return an iterator of molecules. Listing 8 also illustrates providing an explicit ligand residue name that must match for the ligand to be identified.

Listing 8: Iterating Over Split Components

  // for input molecule mol and string ligName ...
  OESplitMolComplexOptions opt(ligName);

  for (OEIter<OEMolBase> frag
                         = OEGetMolComplexComponents(mol, opt); frag; ++frag)
  // ...

The process used by the OESplitMolComplex and OEGetMolComplexComponents functions involves two steps:

  1. separating components and assigning functional roles
  2. filtering by roles to select the desired components

OESplitMolComplex accesses separate filters for lig, prot, wat, and other from the OESplitMolComplexOptions object. In OEGetMolComplexComponents, the combination of the ligand, protein, and water filters is used to select the output components. The options method OESplitMolComplexOptions::ResetFilters can be used for basic changes to the filters, such as selecting a different binding site (see Listing 9).

Listing 9: Resetting Filters for All Sites

  OESplitMolComplexOptions opt;
  const unsigned allSites = 0u;

The function OECountMolComplexSites returns the number of binding sites.

Listing 10 shows a more involved case, combining protein-complex and binding-site waters into the protein filter and nothing in the water filter.

Listing 10: Modifying Filters

  OESplitMolComplexOptions opt;
  OEOwnedPtr<OEUnaryPredicate<OERoleSet> > p(opt.GetProteinFilter());
  OEOwnedPtr<OEUnaryPredicate<OERoleSet> > w(opt.GetWaterFilter());
  opt.SetProteinFilter(*p || *w);


Listing 11 illustrates using an explicit filter when iterating. OEMolComplexFilter objects can be of arbitrary complexity.

Listing 11: Explicit Filter When Iterating

  // for input molecule mol and options opt ...
  OEOwnedPtr<OEUnaryPredicate<OERoleSet> > ligFilter(opt.GetLigandFilter());
  for (OEIter<OEMolBase> l
                     = OEGetMolComplexComponents(mol, opt, *ligFilter); l; ++l)
  // ...

For more complex workflows, it is useful to explicitly separate the tasks of assigning roles and filtering, if only to avoid the expense of assigning roles multiple times. Listing 12 shows how to assign roles to each fragment in a molecule for later use.

Listing 12: Assigning Roles to Each Mol Fragment

  // for input molecule mol and options opt ...
  std::vector<OEAtomBondSet> frags;

  if (OEGetMolComplexFragments(frags, mol, opt))

In Listing 13, the role information for each fragment is used to determine the number of binding sites and is filtered to produce ligand and protein complex mols.

Listing 13: Filtering Mol Fragments

  // for options opt and vector<OEAtomBondSet> frags produced earlier ...
  const unsigned numSites = OECountMolComplexSites(frags);
  OEThrow.Verbose("sites %d", numSites);

  OEGraphMol lig;
  OEOwnedPtr< OEUnaryPredicate<OERoleSet> > l(opt.GetLigandFilter());
  if (! OECombineMolComplexFragments(lig, frags, opt, *l))
    OEThrow.Warning("Unable to combine ligand frags from %s", mol.GetTitle());

  OEGraphMol protComplex;
  OEOwnedPtr< OEUnaryPredicate<OERoleSet> > p(opt.GetProteinFilter());
  OEOwnedPtr< OEUnaryPredicate<OERoleSet> > w(opt.GetWaterFilter());
  if (! OECombineMolComplexFragments(protComplex,
                                     OEOr<OERoleSet>(*p, *w)))
    OEThrow.Warning("Unable to combine complex frags from %s", mol.GetTitle());

The basic idea behind automatic ligand identification is that the list of compounds that are not ligands (cofactors, buffers, etc.) grows faster than the list of ligands. The OEResidueCategoryData object contained within an OESplitMolComplexOptions is a database of these non-ligands. Although this database is being updated as we run across more cofactors, buffers, etc., it is possible for users to update the database as needed. Listing 14 shows how this is done, involving the Russian nesting doll-like organization of a OEResidueCategoryData within a OEMolComplexCategorizer within a OESplitMolComplexOptions object.

Listing 14: Adding a Cofactor

  OEResidueCategoryData db;
  db.AddToDB(OEResidueDatabaseCategory::Cofactor, "MTQ");

  OEMolComplexCategorizer cat;

  OESplitMolComplexOptions opt;


Please help OpenEye maintain and expand the built-in OEResidueCategoryData by informing us of any oversights or mistakes.

The most important OESplitMolComplexOptions parameters can be controlled from the command line using OEConfigureSplitMolComplexOptions and OESplitMolComplexSetup (see Listing 15).

Listing 15: Command Line Options

#include <openeye.h>

#include <oechem.h>
#include <oebio.h>
#include "SplitMolComplexDoc.itf"

using namespace OESystem;
using namespace OEChem;
using namespace OEBio;

int main(int argc, char** argv)
  OEInterface itf(InterfaceData);

  if(! OEParseCommandLine(itf, argc, argv))
    OEThrow.Fatal("Unable to interpret command line!");

  OESplitMolComplexOptions opts;
  OESetupSplitMolComplexOptions(opts, itf);