Tutorial 2: How to prepare a database for faster load speeds

In this tutorial, you will learn how to pre-process a conformer database file for FastROCS TK, allowing for faster database load times with OEShapeDatabase.Open. Load times could be up to 10x faster. See the figure below for an eMolecules dataset of 14 million conformers.

../../../_images/Tutorial_2_Figure_1.png

Pre-Processing Performance Impact

To gain this extra loading performance, you need to use the following functions:

  1. OEPRECompress – this function works on the output molecule stream object allowing the molecules to be stored in a ‘pre-compressed’ format:

    • Writes rotor-offset-compressed molecules in the perfect-rotor-encoding format
    • There is no need to Gzip which means faster OEMolDatabase.Open.
  2. OEPrepareFastROCSMol – this function woks on each OEMol record of the input.oeb:

    • Sets the energy of each conformer to 0.0 to avoid writing it to OEB.
    • Suppresses hydrogens and reorders reference conformers for compression.
    • Pre-calculates color atoms.
    • Pre-calculates self-color and self-shape terms for all conformers.

    Note

    The color terms cached by OEPrepareFastROCSMol are from the OEColorFFType_ImplicitMillsDean color force field. A different color force field can be given as the second argument to override ImplicitMillsDean.

In general, calling OEPrepareFastROCSMol and OEPRECompress will result in a smaller OEB file than the default OEB.GZ output from OMEGA.

Further reduction in file-size can be achieved by using an OEMCMolType_HalfFloatCartesian molecule to store reference coordinates and torsions as 16-bit floating point.

Here is some example code showing how to pre-process a database with OEPrepareFastROCSMol, save to a precompessed format, and reduce the file size by using half precision:

#!/usr/bin/env python
# (C) 2017 OpenEye Scientific Software Inc. All rights reserved.
#
# TERMS FOR USE OF SAMPLE CODE The software below ("Sample Code") is
# provided to current licensees or subscribers of OpenEye products or
# SaaS offerings (each a "Customer").
# Customer is hereby permitted to use, copy, and modify the Sample Code,
# subject to these terms. OpenEye claims no rights to Customer's
# modifications. Modification of Sample Code is at Customer's sole and
# exclusive risk. Sample Code may require Customer to have a then
# current license or subscription to the applicable OpenEye offering.
# THE SAMPLE CODE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
# EXPRESS OR IMPLIED.  OPENEYE DISCLAIMS ALL WARRANTIES, INCLUDING, BUT
# NOT LIMITED TO, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
# PARTICULAR PURPOSE AND NONINFRINGEMENT. In no event shall OpenEye be
# liable for any damages or liability in connection with the Sample Code
# or its use.

from __future__ import print_function
import sys, os

oepy = os.path.join(os.path.dirname(__file__), "..", "python")
sys.path.insert(0, os.path.realpath(oepy))

from openeye.oechem import *
from openeye.oefastrocs import * 

def main(argv=[__name__]):
    if len(argv) < 3:
        OEThrow.Usage("%s input.oeb output_prepped_database.oeb" % argv[0])
        return 0

    #Input mol stream 
    ifs = oemolistream()
    ifs.open(argv[1])

    #PRE-Compress output mol stream
    ofs  = oemolostream()
    OEPRECompress(ofs)
    ofs.open(argv[2])

    #Prepare mol & write to stream
    for mol in ifs.GetOEMols():
        OEPrepareFastROCSMol(mol)
        halfMol = OEMol(mol, OEMCMolType_HalfFloatCartesian)
        OEWriteMolecule(ofs, mol)

    ofs.close()

if __name__ == '__main__':
    sys.exit(main(sys.argv))

For added convenience, we have created a ShapeDatabasePrep.py example script which can be modified to meet your exact needs:

#!/usr/bin/env python
# (C) 2017 OpenEye Scientific Software Inc. All rights reserved.
#
# TERMS FOR USE OF SAMPLE CODE The software below ("Sample Code") is
# provided to current licensees or subscribers of OpenEye products or
# SaaS offerings (each a "Customer").
# Customer is hereby permitted to use, copy, and modify the Sample Code,
# subject to these terms. OpenEye claims no rights to Customer's
# modifications. Modification of Sample Code is at Customer's sole and
# exclusive risk. Sample Code may require Customer to have a then
# current license or subscription to the applicable OpenEye offering.
# THE SAMPLE CODE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
# EXPRESS OR IMPLIED.  OPENEYE DISCLAIMS ALL WARRANTIES, INCLUDING, BUT
# NOT LIMITED TO, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
# PARTICULAR PURPOSE AND NONINFRINGEMENT. In no event shall OpenEye be
# liable for any damages or liability in connection with the Sample Code
# or its use.

# Cache as much as possible on the molecule to improve the performance
# of starting a server from scratch. Also cull to desired number of
# conformers if requested.

import os, sys

oepy = os.path.join(os.path.dirname(__file__), "..", "python")
sys.path.insert(0, os.path.realpath(oepy))

from openeye.oechem import *
from openeye.oeshape import *
from openeye.oefastrocs import *

def TrimConformers(mol, maxConfs):
    for i, conf in enumerate(mol.GetConfs()):
        if i >= maxConfs:
            mol.DeleteConf(conf)

def main(argv=[__name__]):
    if len(argv) != 3 and len(argv) != 4:
        OEThrow.Usage("%s <database.oeb> <prepped_database.oeb> [max_confs]" % argv[0])

    maxConfs = None
    if len(argv) == 4:
        maxConfs = int(argv[3])
        if maxConfs < 1:
            OEThrow.Fatal("Illegal number of conformer requested %u", maxConfs)

    # input - preserve rotor-offset-compression
    ifs = oemolistream()
    OEPreserveRotCompress(ifs)

    if not ifs.open(argv[1]):
        OEThrow.Fatal("Unable to open %s for reading" % argv[1])

    # output - use PRE-compress for smaller files (no need to .gz the file)
    ofs = oemolostream()
    OEPRECompress(ofs)
    if not ofs.open(argv[2]):
        OEThrow.Fatal("Unable to open %s for writing" % argv[2])

    dots = OEDots(10000, 200, "molecules")
    for mol in ifs.GetOEMols():
        if maxConfs is not None:
            TrimConformers(mol, maxConfs)

        OEPrepareFastROCSMol(mol)
        OEWriteMolecule(ofs, mol)
        dots.Update()
        
    dots.Total()
    ofs.close()

    OEThrow.Info("Indexing %s" % argv[2])
    if not OECreateMolDatabaseIdx(argv[2]):
        OEThrow.Fatal("Failed to index %s" % argv[2])
    
    return 0

if __name__ == '__main__':
    sys.exit(main(sys.argv))