Spruce TK is used to process PDB or mmCIF files containing the structures, resulting from
X-ray crystallography, Nuclear Magnetic Resonance (NMR) spectroscopy, or electron microscopy (EM) experiments.
into molecules usable for molecular modeling. Due to the nature of the experiments and data, some processing
is required before use in modeling.
Spruce TK provides an end-to-end preparation workflow using the
OEMakeDesignUnitOptions and associated options classes) are available to
control for the desired behavior. Metadata (see
OEStructureMetadata) can be supplied
to infuse experimental data, like the proper variant sequence and also fix common problems reading
from PDB files (e.g. bond order perception). Spruce has a built-in Chemical Components Dictionary matching 3-letter
residue codes to SMILES, derived from the corresponding one at the RCSB, but curated to fix improper SMILES.
The workflow automatically runs through the following steps to produce
Biological unit extraction (see
If structure is from X-ray, it will also generate packing residues for visualization of crystal contacts
Alternate location assignment or enumeration (see the
Splits the system into components, e.g. identifying ligands, cofactors, excipients.
Hydrogen placement and optimization, including searching ligand tautomer states (see
Assignment of partial charges to the entire system (see
OEInteractionHintContainerencoded for visualization and more
Multiple design units are often produced. This results from multiple biological units generated during step 1, and further expanding that set when enumerating the alternate locations in step 3. The Iridium categorization (see Categorization) should be helpful in choosing which (if not all) of the produced units to use in downstream modeling tasks. We find that the Iridium categories are also helpful for managing modeling expectations, having a quality measure of the input data for modeling.
Included with Spruce TK is a large database of loop templates generated by parsing the structures in the public PDB repository pulled from the RCSB. This database is available from db-downloads. in a platform-independent format. The SPRUCE loop database is large (~13G) and will take time to download with a good internet connection. When the download is complete, move the file to a convenient location, a network drive is not recommended for performance.
To use it during prep (OEMakeDesignUnits), set the path in SetLoopDBFilename method on the
The database can be appended to (updated) using the LoopDB_Builder utility app, that ships with the SPRUCE apps. This would be in the event that the user has a collection of internal/proprietary structures, or if specific structures are released to the public PDB between OpenEye updates of the database that are crucial for a given target that is less well described by the existing templates.
Biological Unit (BU) - the biologically relevant unit for a protein to use in modeling. For example, in Trypsin the BU is a monomer, in HIV Protease the BU is a homo-dimer. See Biological Unit below.
Asymmetric Unit (ASU) - the contents of a PDB or mmCIF file from X-ray contain an asymmetric unit as the output of the experiment. This is sometimes equivalent to the BU, but often requires manipulation to create a correct BU.
Design Unit (DU) - the result of preparation in Spruce is a collection of molecules from a single BU, extracted and ready to use for modeling tasks. See Design Unit below for more details.
Alternate locations (alt locs) - X-ray experiments can often contain results with atoms occupying more than one location. Crystallographers denote this with a measure of the amount of time an atom occupies a given set of coordinates. A well resolved atom will have an occupancy of 1.0, meaning that 100% of the time it is in that location. Sometimes, the atom exists in two positions, alternate locations, and these appear in the input file with single letter designations and with occupancy < 1.0.
In short, the biological unit (BU) is an object that contains the biologically relevant parts of an ASU, which have not been split into various molecular components and are not yet prepped for modeling. For a more detailed explanation of BUs, refer to Introduction to Biological Assemblies and the PDB Archive hosted at the RCSB.
A BU can be constructed from a single PDB from the PDB’s own header remarks, or by using a sequence
alignment to an input reference protein. To extract a BU or set of BUs from a PDB, one should use the helper
function in Spruce TK called
OEExtractBioUnits. The following example shows
how to extract BUs from the PDB remarks:
biounits = oespruce.OEExtractBioUnits(extract_prot, opts)
Using the same function, one can also extract BUs from a PDB using an input reference protein. The following example shows
how to use
OEExtractBioUnits to extract BUs from a given reference:
biounits = oespruce.OEExtractBioUnits(extract_prot, ref_prot, opts)
The design unit (DU) is an object that contains the extracted and prepared parts of a single BU, ready for modeling. The parts include:
Ligand (not always, an apo DU will not contain a ligand)
Packing residues (if any exist near the site)
Excipients (if any exist near the site)
Proton Placement and Optimization¶
Traditionally, most biomolecular structures have been generated with either no
explicit hydrogens or only polar hydrogens. However, having all atoms explicitly
represented and positioned to make appropriate hydrogen bonds is sometimes necessary.
OEPlaceHydrogens does this by adding and
placing hydrogens and by “flipping” certain functional groups
(e.g., sidechain amides and imidazoles) as required for an optimal hydrogen bonding network.
Spruce TK leverages this functionality, but builds on top of this by taking the
ligand protomer and tautomer states into account in the
function. The function identifies the heterogens (ligands, cofactors, etc.) in the structure and
enumerates their states using the
OEGetReasonableTautomers function or
by using user supplied states.
The hydrogen bonding network of the biomolecule is then optimized in the presence of each state
independently and the most favorable state of the complex is chosen based on interaction energies.
If a binding site holds two heterogens, e.g. a ligand and a cofactor and each have multiple tautomer
states, the combinatorial number of states are optimized and evaluated. To do this efficiently
only the binding site(s) are initially optimized. After selecting the appropriate states of each of
the binding sites the remaining parts of the system is optimized keeping these states fixed.
Protein Sidechain Modeling¶
Amino acid sidechains are sometimes missing in protein structures due to low or
missing density from X-ray diffraction crystallography experiments. This can be due
to e.g. high flexibility making assignment of a specific location of the atoms difficult.
However, most molecule simulation software requires a position of these atoms, and the
lack of them can cause incorrect results. Partial sidechains are identified with
OEBuildSidechains these residues are then divided into groups based
on proximity and side-chain orientation. The groups are independently optimized taking into account their local
environment. In each group the sidechains are built out and a standard rotamer library from the
OERotamerLibrary namespace is used. For each side-chain the set of rotamers
are evaluated based on an interaction energy with the nearest environment and the most favorable state is
chosen. In the case where multiple residues are being built in a group, an iterative procedure attempts to
find the optimal energy for the entire set of residues. In the event a sidechain cannot be built due to
poor input geometry causing clashes the entire cluster is skipped. Water locations that block
side-chain rotamers result in those water molecules being deleted (optional) since their placement is most likely an artifact.
Protein Loop Modeling¶
Similarly to missing side-chains above there are sometimes gaps in protein structures, where the position
of the entire amino acid residues could not be resolved based on the experimental data.
OEBuildLoops is able to build missing gaps (loops). The workflow is illustrated
by the figure below. The function identifies missing loops, by comparing the SEQRES section of the
PDB header with the structure using a sequence alignment. The user can optionally input their
own sequence using
OESequenceMetadata if the SEQRES in the header is incorrect or missing.
With the gaps identified, the function loops for templates matching the gap, filters them based on the
ability to fit the gap without clashing, and eventually does an optimization of the top candidates
in the local environment before picking the most favorable conformation.
If multiple results are desired, these can be retrieved using
A protein can be structurally superimposed on to a reference protein structure using
the OESpruce TK.
Proteins can be superimposed with either atomic coordinates in the
OEStructuralSuperposition class, or with secondary structure elements
OEStructuralSuperposition class can superimpose proteins using any of the following four methods:
The global method (default for
OEStructuralSuperposition). This method uses all matching heavy atoms from the reference and fit proteins as the region for performing the superposition calculation (see
The Difference Distance Matrix method. This method calculates the pairwise distance matrix of C-alpha atoms for both the reference and fit proteins, then takes the difference of these two matrices to find the difference distance matrix (DDM). The longest contiguous region of the resulting difference distance matrix (DDM) is used for the structural superposition calculation (see
The weighted-DDM method. This method uses the DDM matrix, and calculates Gaussian weighting factors for all matching C-alpha atoms. These Gaussian weights are used as a weighting function in the superposition calculation (see
The site residue method. This method uses a set of site residues (given as a set of unique residue strings) as a constraint on the protein superposition. Site residues must be set with the
SetSiteResiduesmember function of the
OESecondaryStructureSuperposition class can superimpose proteins using the following method:
The secondary structure superposition method. This method takes two proteins and performs a structural superposition on them based on the shape overlap of the secondary structure elements of the two proteins.
All structural superposition methods in the
class have a corresponding score from the sequence alignment that was used to find the
matching atoms of both proteins.
This score comes from the output of the
where a larger score indicates a better sequence alignment, and scores below a small threshold
(around 200) should be considered a bad alignment.
All structural superposition methods in the
class have a corresponding RMSD value for superposition that can be loosely associated
with the quality of the superposition. The
class does not have an RMSD value, but instead uses the Tanimoto score from the
underlying shape overlap calculation.
How to Correctly Read a PDB File¶
Reading a PDB file correctly for use in subsequent modeling tasks can be challenging.
To correctly read a PDB, one must be aware that PDB header information as well as
information about alternate location codes within the PDB file will be lost unless
a specific combination of PDB-centric
OEIFlavor’s are used.
Furthermore, the protein itself must be processed by
OEAltLocationFactory in order to create a
molecule with all alternate location atoms retained. Spruce handles this for the user,
so they are not handled by this reader. However, for uses other than Spruce the user
needs to either, enumerate the different alternate locations, or pick a specific location.
With that in mind, we recommend reading PDB files for use in OESpruce TK using
the following pattern as shown in ReadProteinFromPDB below:
The mmCIF reader was designed to read all the necessary input by default, and therefore no
OEIFlavor’s are necessary to read them.
Iridium [Warren-2012] is a metric used to estimate the model quality of a structure resulting from an X-ray crystallography experiment. The metric evaluates what we deem essential considerations for using protein-ligand structures in drug discovery, particularly virtual screening.
The metric was developed by aggregating previously published docking validation sets, evaluating each of them by eye and even re-refining some of the data. The high quality structures were then deposited into a new dataset called Iridium HT [Warren-2012].
The metric takes a number of parameters into account and categorizes structures into 4 different categories, grading their trustworthiness to use in modeling.
HT - Highly Trustworthy
MT - Mildly Trustworthy
NT - Not Trustworthy
NA - Not Applicable
The latter category is used when no electron density data is available, or when the structure being evaluated is not coming from an X-ray crystallography experiment. Electron density data is available for at least 85% of the public PDB data and is required for all new depositions. Historical data is also being recovered, so that percentage is only expected to increase.
Iridium is not currently an appropriate metric for electron microscopy (EM) experiments, in part because the contour level, which we use as 1 sigma in X-ray experiments, is determined on a per experiment basis, and also in part due to different diffraction behavior of heavy atoms comparing to the X-ray.
Iridium is not currently an appropriate metric for neutron diffraction experiments, due to the different diffraction behavior for heavy atoms (e.g. sulfur and phosphorus are largely invisible in neutron diffraction data).
Iridium relies on a bound ligand, and is therefore not appropriate for apo structures.
The initial Iridium categorization was done by eye. The rules developed were then formalized to be able to rigorously assess the Iridium category on every X-ray structure. In some cases the original publication is not explicit about threshold values, but the metric below follows the paper in spirit and is approved by the original authors. The outcome of formalizing the rules, resulted in minor category discrepancies on the published dataset, is likely due to the more stringent adherence to the rules.
The criteria to consider include both global and local quality metrics and are as follows (the acronyms in parentheses, are used when spruce logs the Iridium details):
DPI - The diffraction-component precision index or global precision estimate [Cruickshank-1999]
Density coverage of the ligand heavy atoms (LaD)
Density coverage of the active site residue heavy atoms (including co-factors) (ASaD)
Occupancy of ligand heavy atoms (POL)
Occupancy of active site heavy atoms (POAS)
Alternate locations of the ligand (AltConfs)
Alternate locations of the active site residues (AltConfs)
Presence of crystal packing residues near binding site (PackRes)
Presence of excipients near binding site (Excp)
Whether ligand is covalently bound (PossCov)
First, the initial consideration is based on the density coverage of the ligand and the active site residues.
< 0.90 and > 0.50
< 0.95 and > 0.50
Subsequently, a structure can be demoted from HT to MT, if any of the conditions below are true:
DPI > 0.50
Presence of alternate locations of the ligand or the active site residues
Any ligand heavy atoms with an occupancy < 0.90
Any active site heavy atoms with an occupancy < 0.50
Perception of interactions between packing residues and the ligand using
OEPerceiveInteractionOptionswith default settings.
Perception of interactions between any excipient and the ligand using
OEPerceiveInteractionOptionswith default settings.
Perception of covalent interactions of the ligand in the active site using
OEPerceiveInteractionOptionswith default settings.
Furthermore, any structure will be demoted to NT if the Rfree value is larger than 0.45 and the resolution is below 3.5A, as we consider that an irrational Rfree value (IrrRFree).
Not all structures report a DPI, and in such cases it is calculated using