OEShapeDatabase¶
class OEShapeDatabase
This is the primary class for performing FastROCS TK calculations. It is a very heavy-weight object:
consuming many gigabytes of memory
managing many host threads
managing all GPU interaction
The goal is to allow abstracting away the complexities as much as possible to allow writing to a single API, allowing for the improvement of the underlying compute engine over time.
Ideally, this class is initialized once per dataset. There is a fair amount of pre-calculation done on each molecule and conformer while being loaded into memory. Some of this can be alleviated by pre-calculating and caching, but not all of it, as there is a balance between caching and recalculation that is always being tuned.
Constructors¶
OEShapeDatabase(const OEShape::OEColorForceField &cff)
OEShapeDatabase(unsigned int dbtype=OEShapeDatabaseType::Default,
unsigned int cfftype=OEShape::OEColorFFType::OEDefault)
Create a new OEShapeDatabase for managing conformers and performing FastROCS TK calculations.
Whether the OEShapeDatabase can perform
color calculations must be chosen at construction. If “shape
only” is chosen by passing
OEShapeDatabaseType.Shape
, then there
can be significant memory and load-time performance
improvements. Color atom assignment can be a significant cost
during load and increase memory usage by roughly 2x.
A custom OEColorForceField can be passed to
this constructor as well to allow color scoring to be completely
customized. Note, in the case of a custom color force field,
OEShapeDatabaseType.Default
is
assumed. It is currently not possible to perform a “color only”
scoring.
AddMol¶
unsigned int AddMol(const OEChem::OEMCMolBase &mol,
const unsigned orient = OEFastROCSOrientation::Inertial))
Add a new collection of conformers to this database and return
the index used to identify this molecule. This index will start
at 0
and monotonically increase by 1
for every
multi-conformer molecule
added. OEShapeDatabaseScore.GetMolIdx
will return this index to be able to map FastROCS scores to
molecules added through this mechanism.
If using alternative start method
OEFastROCSOrientation.AsIs
, this
constant needs to be passed as the final argument to the AddMol routine
so the database can be loaded without pre-processing conformer coordinates.
The final argument is ignored for all other use cases and does not need
to be changed from the default argument as all molecules are identically
pre-processed for all other starting points.
Warning
Databases must be re-added if using OEFastROCSOrientation.AsIs
Note
Even though this method is not const
, it has been made
thread-safe so that it can be called from multiple
threads. Furthermore, it has been optimized as much as
possible to parallelize the pre-calculation this method
performs. This makes it very efficient to use multiple
threads to load a database file into memory.
GetColorForceFieldType¶
unsigned int GetColorForceFieldType() const
Return a constant from the OEColorFFType
namespace to indicate the color force field used to construct
this database. Returns
OEColorFFType.Custom
if a custom
OEColorForceField object was passed to the
constructor.
GetColorGridSpacing¶
float GetColorGridSpacing() const
Return the grid spacing used to calculate color scores. This
defaults to 0.5
for good performance. Lower values will
yield answers that agree more closely with the Exact
analytical calculation, at the expense of performance. Higher
values can yield better performance. The default was chosen as a
good balance of virtual screening statistical analysis (AUCs) versus
raw throughput performance.
GetDatabaseType¶
unsigned int GetDatabaseType() const
Returns a constant from the
OEShapeDatabaseType
namespace
indicating what type of calculations this database can perform.
GetMaxNumDevices¶
unsigned int GetMaxNumDevices() const
Returns the maximum number of GPU devices this database will use
for calculations. The only way to restrict the GPUs seen by the
database is to use the CUDA_VISIBLE_DEVICES
environment
variable before starting the process.
See also
GetMaxOptIterations¶
unsigned int GetMaxOptIterations() const
Return the number of optimizer iterations the FastROCS
algorithm should use when optimizing the alignment of the
database conformer and the query conformer. This currently
defaults to 10
based upon analysis of producing good virtual
screening statistics (AUCs) without excessive iterations that
would cost performance.
GetNumDevices¶
unsigned int GetNumDevices() const
Returns the number of GPU devices this database will use for
calculation. This will default to all the GPUs that are visible,
i.e., the value returned from
OEShapeDatabase.GetMaxNumDevices
.
GetNumOpenThreads¶
unsigned int GetNumOpenThreads() const
Return how many CPU threads will be used to read a
OEMolDatabase from disk into memory using the
OEShapeDatabase.Open
method. The
default, a value of 0
, is to use as many CPUs as can be
found on the system with
OEGetNumProcessors
.
GetScores¶
OESystem::OEIterBase<OEShapeDatabaseScore> *
GetScores(const OEChem::OEMolBase &query,
const OEShapeDatabaseOptions &options=OEShapeDatabaseOptions()) const
OESystem::OEIterBase<OEShapeDatabaseScore> *
GetScores(const OEShape::OEShapeQueryPublic &shapeQry,
const OEShapeDatabaseOptions &options=OEShapeDatabaseOptions()) const
OESystem::OEIterBase<OEShapeDatabaseScore> *
GetScores(const OEShape::OEShapeQuery &shapeQry,
const OEShapeDatabaseOptions &options=OEShapeDatabaseOptions()) const
Return ALL scores of the query against the entire database
subject to the options specified in the
OEShapeDatabaseOptions passed to this
method. This is useful for performing larger scale NxN
clustering type of calculations where all pairs of scores need
to be processed.
The query can be either a single conformer
OEMolBase, a OEShapeQuery or a
OEShapeQueryPublic object read from a .sq
file. OEShapeQuery or OEShapeQueryPublic
objects must contain a molecule. To determine if your OEShapeQuery object
is FastROCS-friendly see OEIsFastROCSShapeQuery
in the
Shape TK.
Warning
The order of the OEShapeDatabaseScore
returned by the iterator is non-deterministic and will
certainly change for each execution due to the multi-threaded
nature of this method. However, the values calculated in each
OEShapeDatabaseScore will be the
same. Therefore, users should rely on the return value of
OEShapeDatabaseScore.GetMolIdx
and
OEShapeDatabaseScore.GetConfIdx
to do further
processing, not the location within the iterator.
See also
The OEShapeDatabaseOptions class is used to control many of the parameters to this method. For example, how many conformers per molecule to return.
GetShapeGridSpacing¶
float GetShapeGridSpacing() const
Return the grid spacing used to calculate shape scores and drive
the alignment. This defaults to 0.5
for good
performance. Lower values will yield answers that agree more
closely with the Exact
analytical calculation, at the
expense of performance. Higher values can yield better
performance. The default was chosen as a good balance of virtual
screening statistical analysis (AUCs) versus raw throughput
performance.
GetSortedScores¶
OESystem::OEIterBase<OEShapeDatabaseScore> *
GetSortedScores(const OEChem::OEMolBase &query, unsigned int limit=0) const
OESystem::OEIterBase<OEShapeDatabaseScore> *
GetSortedScores(const OEChem::OEMolBase &query,
const OEShapeDatabaseOptions &options) const
OESystem::OEIterBase<OEShapeDatabaseScore> *
GetSortedScores(const OEShape::OEShapeQueryPublic &shapeQry,
const OEShapeDatabaseOptions &options=OEShapeDatabaseOptions()) const
OESystem::OEIterBase<OEShapeDatabaseScore> *
GetSortedScores(const OEShape::OEShapeQuery &shapeQry,
const OEShapeDatabaseOptions &options=OEShapeDatabaseOptions()) const
Return a hitlist of the query
against the database based
upon the scoring options of the database and the
OEShapeDatabaseOptions passed to this
method. The OEShapeDatabaseScore will be
returned in descending order, i.e., the better ‘hits’ will come
first in the iterator.
The query can be either a single conformer
OEMolBase, a OEShapeQuery or a
OEShapeQueryPublic object read from a .sq
file. OEShapeQuery or OEShapeQueryPublic
objects must contain a molecule. To determine if your OEShapeQuery object
is FastROCS-friendly see OEIsFastROCSShapeQuery
in the
Shape TK.
Note
This method is typically used to select only a subset of the
results based upon limit
or
OEShapeDatabaseOptions.SetLimit
. It
is optimized for rapidly constructing relatively small
hitlists. If the entire set of scores for the database is
desired, it can be faster to use the
OEShapeDatabase.GetScores
to avoid
the sorting operation.
See also
The OEShapeDatabaseOptions class is used to control many of the parameters to this method. For example, how many conformers per molecule to return.
NumConfs¶
unsigned int NumConfs() const
Return the number of conformers the database is currently managing. Useful for getting a ballpark idea of the underlying memory usage.
Note
This value has no relation to the indexes returned by
OEShapeDatabase.AddMol
, except that
this value will always be larger than the last index
returned.
Open¶
bool Open(const OEChem::OEMolDatabase &moldb,
const unsigned int orient=OEFastROCSOrientation::Default)
bool Open(const OEChem::OEMolDatabase &moldb,
OESystem::OEThreadedDots &dots,
const unsigned int orient=OEFastROCSOrientation::Default)
Initialize the database using a OEMolDatabase. This is the most efficient way to initialize a OEShapeDatabase as this method will launch an operating thread for each CPU core available and parallelize all the parsing and pre-calculation. The progress of the loading operation can be tracked through a thread-safe OEThreadedDots object.
If using alternative start method
OEFastROCSOrientation.AsIs
, this
constant needs to be passed as the final argument to the Open routine
so the database can be loaded without adjusting conformer coordinates.
The final argument can be left blank for all other use cases.
Warning
Databases must be re-opened if using OEFastROCSOrientation.AsIs
This method will block and return true
when the
database has been successfully loaded into memory.
Note
The indices returned by
OEShapeDatabaseScore.GetMolIdx
are
guaranteed to map directly into the
OEMolDatabase index space.
OEMolDatabase.GetMolecule
can fail and
return no molecule for good reason, i.e., an empty molecule
from an SD file. Therefore, the index space used by
OEShapeDatabase can have “holes” when
initialized from a OEMolDatabase.
PrintMemoryUsage¶
void PrintMemoryUsage(OEPlatform::oeostream &os) const;
void PrintMemoryUsage() const;
Print out memory usage statistics for this object. This will
break down how much memory is being used to pre-cache various
parts of the calculation. By default, the output will be written to
OEPlatform.oeerr
. The output stream can also
be passed as an argument. The diagnostic output is meant for human
consumption and may change format in future releases.
SetColorGridSpacing¶
bool SetColorGridSpacing(float gridSpacing)
Set the grid spacing to use for static color scoring.
See also
SetMaxOptIterations¶
void SetMaxOptIterations(unsigned int maxIter)
Set the number of optimizer iterations to use when optimizing the alignment by shape.
See also
SetNumDevices¶
void SetNumDevices(unsigned int ndevices)
Set the number of GPU devices this calculation should use. This
number should be between 1
and
OEShapeDatabase.GetMaxNumDevices
inclusively. This method is really only useful to efficiently
collecting FastROCS scalability data across multiple
GPUs. To restrict OEShapeDatabase to only
run on a subset of GPUs on the machine, use the
CUDA_VISIBLE_DEVICES
environment variable instead before the
process is launched.
SetNumOpenThreads¶
void SetNumOpenThreads(unsigned int numThrds)
Set how many CPU threads should be created to read a
OEMolDatabase from disk into memory using the
OEShapeDatabase.Open
method. A value of
0
specifies that all CPUs in the system will be used. The
value passed to this method should not exceed the value returned
by OEGetNumProcessors
.
SetShapeGridSpacing¶
bool SetShapeGridSpacing(float gridSpacing)
Set the grid spacing to use for shape scoring and alignment.
See also