Spruce is used to process PDB files containing the results of X-ray crystallography into molecule files usable for molecular modeling. Since these files are actually experimental output, some processing is required before use.
The biological unit (BU) is an object that contains the biologically relevant parts of an ASU, which have not been split into various molecular components and are not yet prepped for modeling. BUs can be constructed from a single PDB from the PDB’s own header remarks, or by using a sequence alignment to an input reference protein.
To extract a BU or set of BUs from a PDB, one should use the helper function in Spruce TK called OEExtractBioUnits. The following example shows how to extract BUs from the PDB remarks:
Using the same function, one can also extract BUs from a PDB using an input reference protein. The following example shows how to use OEExtractBioUnits to extract BUs from a given reference:
The design unit (DU) is an object that contains the extracted and prepared parts of a single BU, ready for modeling. The parts include:
A protein can be structurally superimposed on to a reference protein structure using the OESpruce TK. Proteins can be superimposed with either atomic coordinates in the OEStructuralSuperposition class, or with secondary structure elements using the OESecondaryStructureSuperposition class.
The OEStructuralSuperposition class can superimpose proteins using any of the following four methods:
The OESecondaryStructureSuperposition class can superimpose proteins using the following method:
All structural superposition methods in the OEStructuralSuperposition class have a corresponding score from the sequence alignment that was used to find the matching atoms of both proteins. This score comes from the output of the OESequenceAlignment class, where a larger score indicates a better sequence alignment, and scores below a small threshold (around 200) should be considered a bad alignment.
All structural superposition methods in the OEStructuralSuperposition class have a corresponding RMSD value for superposition that can be loosely associated with the quality of the superposition. The OESecondaryStructureSuperposition class does not have an RMSD value, but instead uses the Tanimoto score from the underlying shape overlap calculation.
Reading a PDB file correctly for use in subsequent modeling tasks can be challenging. To correctly read a PDB, one must be aware that PDB header information as well as information about alternate location codes within the PDB file will be lost unless a specific combination of PDB-centric OEIFlavor‘s are used. Furthermore, the protein itself must be processed by OEAltLocationFactory in order to create a molecule with all alternate location atoms retained. With that in mind, we recommend reading PDB files for use in OESpruce TK using the following pattern as shown in ReadProteinFromPDB below: