Spruce is used to process PDB files containing the results of X-ray crystallography into molecule files usable for molecular modeling. Since these files are actually experimental output, some processing is required before use.


  1. Biological Unit (BU) - the actual, bioligically relevant unit for a protein to use in modeling. For example, in Trypsin the BU is a monomer, in HIV Protease the BU is a homo-dimer.
  2. Asymmetric Unit (ASU) - the contents of a PDB file from X-ray contain the actual asymmetric unit as the output of the experiment. This is sometimes equivalent to the BU, but often requires manipulation to create a correct BU.
  3. Design Unit (DU) - the result of preparation in spruce is a collection of molecules from a single BU, extracted and ready to use for modeling tasks. See the section on DESIGN UNITS for more details.
  4. Alternate locations (alt locs) - X-ray experiments can often contain results with atoms occupying more than one location. Crystallographers denote this with a measure of the amount of time an atom occupies a given set of coordinates. A well resolved atom will have an occupancy of 1.0, meaning that 100% of the time it is in that location. Sometimes, the atom exists in two positions, alternate locations, and these appear in the input file with single letter designations and with occupancy < 1.0.

Design Unit

The design unit is an object that contains the extracted and prepared parts of a single BU, ready for modeling. The parts include:

  1. Protein
  2. Ligand (not always, an apo DU will not contain a ligand)
  3. Site residues
  4. Packing residues (if any exist near the site)
  5. Excipients (if any exist near the site)

Protein Superposition

A protein can be structurally superimposed on to a reference protein structure using the OESpruce TK. Proteins can be superimposed with either atomic coordinates in the OEStructuralSuperposition class, or with secondary structure elements using the OESecondaryStructureSuperposition class.

The OEStructuralSuperposition class can superimpose proteins using any of the following four methods:

  1. The global method (default for OEStructuralSuperposition). This method uses all matching heavy atoms from the reference and fit proteins as the region for performing the superposition calculation (see Global).
  2. The Difference Distance Matrix method. This method calculates the pairwise distance matrix of C-alpha atoms for both the reference and fit proteins, then takes the difference of these two matrices to find the difference distance matrix (DDM). The longest contiguous region of the resulting difference distance matrix (DDM) is used for the structural superposition calculation (see DDM).
  3. The weighted-DDM method. This method uses the DDM matrix, and calculates Gaussian weighting factors for all matching C-alpha atoms. These Gaussian weights are used as a weighting function in the superposition calculation (see Weighted).
  4. The site residue method. This method uses a set of site residues (given as a set of unique residue strings) as a constraint on the protein superposition. Site residues must be set with the SetSiteResidues member function of the OESuperpositionOptions class (see Site).

The OESecondaryStructureSuperposition class can superimpose proteins using the following method:

  1. The secondary structure superposition method. This method takes two proteins and performs a structural superposition on them based on the shape overlap of the secondary structure elements of the two proteins.


All structural superposition methods in the OEStructuralSuperposition class have a corresponding score from the sequence alignment that was used to find the matching atoms of both proteins. This score comes from the output of the OESequenceAlignment class, where a larger score indicates a better sequence alignment, and scores below a small threshold (around 200) should be considered a bad alignment.


All structural superposition methods in the OEStructuralSuperposition class have a corresponding RMSD value for superposition that can be loosely associated with the quality of the superposition. The OESecondaryStructureSuperposition class does not have an RMSD value, but instead uses the Tanimoto score from the underlying shape overlap calculation.

How to Correctly Read a PDB File

Reading a PDB file correctly for use in subsequent modeling tasks can be challenging. To correctly read a PDB, one must be aware that PDB header information as well as information about alternate locatation codes within the PDB file will be lost unless a specific combination of PDB-centric OEIFlavor‘s are used. Furthermore, the protein itself must be processed by OEAltLocationFactory in order to create a molecule with all alternate location atoms retained. With that in mind, we recommend reading PDB files for use in OESpruce TK using the following pattern as shown in ReadProteinFromPDB below:

bool ReadProteinFromPDB(const string& pdbFile, OEChem::OEMolBase& mol)
  unsigned int pdbFlavor = OEIFlavor::PDB::Default
                          | OEIFlavor::PDB::DATA
                          | OEIFlavor::PDB::ALTLOC;
  oemolistream ifs;
  ifs.SetFlavor(OEFormat::PDB, pdbFlavor);

  if (!
    OEThrow.Warning("Unable to open %s for reading.", pdbFile.c_str());
    return false;

  OEGraphMol tempMol;
  if (!OEReadMolecule(ifs, tempMol))
    OEThrow.Warning("Unable to read molecule from %s.", pdbFile.c_str());
    return false;

  OEAltLocationFactory fact(tempMol);


  return (mol);