OEShapeDatabase

class OEShapeDatabase

This is the primary class for performing FastROCS TK calculations. It is a very heavy-weight object:

  • consuming many gigabytes of memory
  • managing many host threads
  • managing all GPU interaction

The goal is to allow abstracting away the complexities as much as possible to allow writing to a single API, allowing for the improvement of the underlying compute engine over time.

Ideally, this class is initialized once per dataset. There is a fair amount of pre-calculation done on each molecule and conformer while being loaded into memory. Some of this can be alleviated by pre-calculating and caching, but not all of it, as there is a balance between caching and recalculation that is always being tuned.

Constructors

OEShapeDatabase(const OEShape::OEColorForceField &cff)
OEShapeDatabase(unsigned int dbtype=OEShapeDatabaseType::Default,
                unsigned int cfftype=OEShape::OEColorFFType::OEDefault)

Create a new OEShapeDatabase for managing conformers and performing FastROCS TK calculations.

Whether the OEShapeDatabase can perform color calculations must be chosen at construction. If “shape only” is chosen by passing OEShapeDatabaseType::Shape, then there can be significant memory and load-time performance improvements. Color atom assignment can be a significant cost during load and increase memory usage by roughly 2x.

A custom OEColorForceField can be passed to this constructor as well to allow color scoring to be completely customized. Note, in the case of a custom color force field, OEShapeDatabaseType::Default is assumed. It is currently not possible to perform a “color only” scoring.

AddMol

unsigned int AddMol(const OEChem::OEMCMolBase &mol)

Add a new collection of conformers to this database and return the index used to identify this molecule. This index will start at 0 and monotonically increase by 1 for every multi-conformer molecule added. OEShapeDatabaseScore::GetMolIdx will return this index to be able to map FastROCS scores to molecules added through this mechanism.

Note

Even though this method is not const, it has been made thread-safe so that it can be called from multiple threads. Furthermore, it has been optimized as much as possible to parallelize the pre-calculation this method performs. This makes it very efficient to use multiple threads to load a database file into memory.

GetColorForceFieldType

unsigned int GetColorForceFieldType() const

Return a constant from the OEColorFFType namespace to indicate the color force field used to construct this database. Returns OEColorFFType::Custom if a custom OEColorForceField object was passed to the constructor.

GetColorGridSpacing

float GetColorGridSpacing() const

Return the grid spacing used to calculate color scores. This defaults to 0.5 for good performance. Lower values will yield answers that agree more closely with the Exact analytical calculation, at the expense of performance. Higher values can yield better performance. The default was chosen as a good balance of virtual screening statistical analysis (AUCs) versus raw throughput performance.

GetDatabaseType

unsigned int GetDatabaseType() const

Returns a constant from the OEShapeDatabaseType namespace indicating what type of calculations this database can perform.

GetMaxNumDevices

unsigned int GetMaxNumDevices() const

Returns the maximum number of GPU devices this database will use for calculations. The only way to restrict the GPUs seen by the database is to use the CUDA_VISIBLE_DEVICES environment variable before starting the process.

GetMaxOptIterations

unsigned int GetMaxOptIterations() const

Return the number of optimizer iterations the FastROCS algorithm should use when optimizing the alignment of the database conformer and the query conformer. This currently defaults to 10 based upon analysis of producing good virtual screening statistics (AUCs) without excessive iterations that would cost performance.

GetNumDevices

unsigned int GetNumDevices() const

Returns the number of GPU devices this database will use for calculation. This will default to all the GPUs that are visible, i.e., the value returned from OEShapeDatabase::GetMaxNumDevices.

GetNumOpenThreads

unsigned int GetNumOpenThreads() const

Return how many CPU threads will be used to read a OEMolDatabase from disk into memory using the OEShapeDatabase::Open method. The default, a value of 0, is to use as many CPUs as can be found on the system with OEGetNumProcessors.

GetScores

OESystem::OEIterBase<OEShapeDatabaseScore> *
  GetScores(const OEChem::OEMolBase &query,
            const OEShapeDatabaseOptions &options=OEShapeDatabaseOptions()) const
OESystem::OEIterBase<OEShapeDatabaseScore> *
  GetScores(const OEShape::OEShapeQueryPublic &shapeQry,
            const OEShapeDatabaseOptions &options=OEShapeDatabaseOptions()) const

Return ALL scores of the query against the entire database subject to the options specified in the OEShapeDatabaseOptions passed to this method. This is useful for performing larger scale NxN clustering type of calculations where all pairs of scores need to be processed.

The query can be either a single conformer OEMolBase, or a OEShapeQueryPublic object read from a .sq file.

Warning

The order of the OEShapeDatabaseScore returned by the iterator is non-deterministic and will certainly change for each execution due to the multi-threaded nature of this method. However, the values calculated in each OEShapeDatabaseScore will be the same. Therefore, users should rely on the return value of OEShapeDatabaseScore::GetMolIdx and OEShapeDatabaseScore::GetConfIdx to do further processing, not the location within the iterator.

See also

The OEShapeDatabaseOptions class is used to control many of the parameters to this method. For example, how many conformers per molecule to return.

GetShapeGridSpacing

float GetShapeGridSpacing() const

Return the grid spacing used to calculate shape scores and drive the alignment. This defaults to 0.5 for good performance. Lower values will yield answers that agree more closely with the Exact analytical calculation, at the expense of performance. Higher values can yield better performance. The default was chosen as a good balance of virtual screening statistical analysis (AUCs) versus raw throughput performance.

GetSortedScores

OESystem::OEIterBase<OEShapeDatabaseScore> *
  GetSortedScores(const OEChem::OEMolBase &query, unsigned int limit=0) const
OESystem::OEIterBase<OEShapeDatabaseScore> *
  GetSortedScores(const OEChem::OEMolBase &query,
                  const OEShapeDatabaseOptions &options) const
OESystem::OEIterBase<OEShapeDatabaseScore> *
  GetSortedScores(const OEShape::OEShapeQueryPublic &shapeQry,
                  const OEShapeDatabaseOptions &options=OEShapeDatabaseOptions()) const

Return a hitlist of the query against the database based upon the scoring options of the database and the OEShapeDatabaseOptions passed to this method. The OEShapeDatabaseScore will be returned in descending order, i.e., the better ‘hits’ will come first in the iterator.

The query can be either a single conformer OEMolBase, or a OEShapeQueryPublic object read from a .sq file.

Note

This method is typically used to select only a subset of the results based upon limit or OEShapeDatabaseOptions::SetLimit. It is optimized for rapidly constructing relatively small hitlists. If the entire set of scores for the database is desired, it can be faster to use the OEShapeDatabase::GetScores to avoid the sorting operation.

See also

The OEShapeDatabaseOptions class is used to control many of the parameters to this method. For example, how many conformers per molecule to return.

NumConfs

unsigned int NumConfs() const

Return the number of conformers the database is currently managing. Useful for getting a ballpark idea of the underlying memory usage.

Note

This value has no relation to the indexes returned by OEShapeDatabase::AddMol, except that this value will always be larger than the last index returned.

Open

bool Open(const OEChem::OEMolDatabase &moldb)
bool Open(const OEChem::OEMolDatabase &moldb,
          OESystem::OEThreadedDots &dots)

Initialize the database using a OEMolDatabase. This is the most efficient way to initialize a OEShapeDatabase as this method will launch an operating thread for each CPU core available and parallelize all the parsing and pre-calculation. The progress of the loading operation can be tracked through a thread-safe OEThreadedDots object. This method will block and return true when the database has been successfully loaded into memory.

Note

The indices returned by OEShapeDatabaseScore::GetMolIdx are guaranteed to map directly into the OEMolDatabase index space. OEMolDatabase::GetMolecule can fail and return no molecule for good reason, i.e., an empty molecule from an SD file. Therefore, the index space used by OEShapeDatabase can have “holes” when initialized from a OEMolDatabase.

PrintMemoryUsage

void PrintMemoryUsage(OEPlatform::oeostream &os) const;
void PrintMemoryUsage() const;

Print out memory usage statistics for this object. This will break down how much memory is being used to pre-cache various parts of the calculation. By default, the output will be written to OEPlatform::oeerr. The output stream can also be passed as an argument. The diagnostic output is meant for human consumption and may change format in future releases.

SetColorGridSpacing

bool SetColorGridSpacing(float gridSpacing)

Set the grid spacing to use for static color scoring.

SetMaxOptIterations

void SetMaxOptIterations(unsigned int maxIter)

Set the number of optimizer iterations to use when optimizing the alignment by shape.

SetNumDevices

void SetNumDevices(unsigned int ndevices)

Set the number of GPU devices this calculation should use. This number should be between 1 and OEShapeDatabase::GetMaxNumDevices inclusively. This method is really only useful to efficiently collecting FastROCS scalability data across multiple GPUs. To restrict OEShapeDatabase to only run on a subset of GPUs on the machine, use the CUDA_VISIBLE_DEVICES environment variable instead before the process is launched.

SetNumOpenThreads

void SetNumOpenThreads(unsigned int numThrds)

Set how many CPU threads should be created to read a OEMolDatabase from disk into memory using the OEShapeDatabase::Open method. A value of 0 specifies that all CPUs in the system will be used. The value passed to this method should not exceed the value returned by OEGetNumProcessors.

SetShapeGridSpacing

bool SetShapeGridSpacing(float gridSpacing)

Set the grid spacing to use for shape scoring and alignment.