OEFastFPDatabase

class OEFastFPDatabase

The OEFastFPDatabase class is designed to perform rapid CUDA-accelerated, in-memory or memory-mapped fingerprint searches using the popcount method. Each OEFastFPDatabase object is associated with a fingerprint type (OEFPTypeBase) that is set when the database is initialized from a pre-generated binary fingerprint file.

For CUDA-accelerated fingerprint searching please see the prerequisites for OpenEye’s GPU-accelerated software.

Note

GraphSim TK currently only supports the popcount search method for fingerprints with the size of multiple of 256. This means that the OEFastFPDatabase class currently does not support:

For the fingerprint types listed above, the original OEFPDatabase class can be utilized.

Note

OEFastFPDatabase gives identical results to OEFPDatabase. However, OEFPDatabase calculates similarity scores in single precision (float) while OEFastFPDatabase uses double precision. As a result, small similarity score differences can be observed.

See also

Code Example

../../_images/searchfastfp.png

Schematic representation of fast fingerprint search process

Constructors

OEFastFPDatabase(const std::string &dbfile,
                 unsigned int memtype=OEFastFPDatabaseMemoryType::Default)

Constructs an OEFastFPDatabase object.

dbfile

The name of the file which contains the fingerprint data. The file has to be generated with the OECreateFastFPDatabaseFile function.

memtype

Defines whether the fingerprints are pre-loaded into GPU-memory, CPU-memory, or memory-mapped during the search process. This value has to be from the OEFastFPDatabaseMemoryType namespace.

Note

If the OEFastFPDatabase object can not be initialized with the OEFastFPDatabaseMemoryType_CUDA option, the following warning message will be throw:

Warning: OEFastFPDatabase::OEFastFPDatabase() : no CUDA-enabled
device available falling back to memory-mapped type!

As a rule of thumb, 1 million finger prints requires 0.6GB of GPU memory. GPU memory can be queried using the nvidia-smi command from terminal.

See also

GetAllScores

OESystem::OEIterBase<float> *GetAllScores(const OEFPDatabaseOptions &opts) const

Performs \(NxN\) similarity calculations between all pairs of fingerprints stored in the OEFastFPDatabase object. It returns an iterator over the calculated similarity scores. The scores are not sorted, but returned as a flattened square matrix.

For all similarity measures other than Tversky, the returned ‘matrix’ will be symmetrical.

opts

The OEFPDatabaseOptions object controls all the parameters that determine the search (i.e. similarity measure parameters). Cutoff and order parameters are ignored as the results are not filtered or sorted.

Note

This operation scales with \(n^2\) in memory and can easily overwhelm the system memory for larger databases. Therefore, the OEFastFPDatabase.GetRawScores method returns a single row of the similarity matrix and is the recommended usage for querying large databases.

See also

GetFingerPrint

bool GetFingerPrint(OEFingerPrint& fp, size_t idx) const

Returns the \(idx^{th}\) fingerprint of the database.

Warning

This function returns false if the fingerprint index is not identical to the corresponding molecule index. This can occur if the fingerprint binary file is generated in multi-thread process. If direct access to the fingerprint is required when using the OEFastFPDatabase.GetFingerPrint method, the fingerprint file should be generated in a single-threaded mode. This can be done by setting SetNumProcessors(1) for the option class used for creating the binary file.

GetFPTypeBase

const OEFPTypeBase *GetFPTypeBase() const

Returns the fingerprint type of the OEFastFPDatabase object. An OEFastFPDatabase object can only store fingerprints with identical types.

GetHistogram

OEFPHistogram *GetHistogram(const OEFPDatabaseOptions &opts,
                            const size_t nrbins=200u) const

Performs similarity calculations between all pairs of fingerprints stored in the OEFastFPDatabase object. It returns the histogram over the calculated similarity scores in a OEFPHistogram object.

opts

The OEFPDatabaseOptions object contains the settings available to control the search (i.e. similarity measure, \(\alpha\) and \(\beta\) parameters for Tversky similarity). Cutoff and order parameters are ignored as the results are not filtered or sorted.

nrbins

Number of bins in the returned OEFPHistogram object.

Note

For all similarity measures other than Tversky, the histogram only contains the upper-triangular similarity scores (excluding the diagonal). In case of the asymmetric Tversky similarity measure, the histogram of the whole \(NxN\) matrix is returned.

Note

When the OEFastFPDatabase object is initialized with OEFastFPDatabaseMemoryType_CUDA, nrbins is limited to at most 1024.

Hint

This method calculates similarities identically to OEFastFPDatabase.GetAllScores but is not bound by the system memory. It can be used to quickly obtain statistics on larger databases.

See also

GetMemoryType

unsigned int GetMemoryType() const

Returns the memory type of the fingerprint database. The return value is taken from the OEFastFPDatabaseMemoryType namespace.

GetMemoryTypeString

std::string GetMemoryTypeString() const

Returns the string representation if memory type of the fingerprint database.

GetMoleculeIndex

size_t GetMoleculeIndex(const size_t fpidx) const

Returns the molecule index that corresponds to the fingerprint index.

Note

When building fingerprint databases using OECreateFastFPDatabaseOptions, the molecule index is always the same as the fingerprint index. However, there are private database building APIs that allow to specify the molecule index associated with each fingerprint. This OEFastFPDatabase.GetMoleculeIndex method allows to handle these private databases.

GetRawScores

OESystem::OEIterBase<double> *GetRawScores(const size_t fpidx,
                                           const OEFPDatabaseOptions &opts) const
OESystem::OEIterBase<double> *GetRawScores(const OEFingerPrint &fp,
                                           const OEFPDatabaseOptions &opts) const
OESystem::OEIterBase<double> *GetRawScores(const OEChem::OEMolBase &mol,
                                           const OEFPDatabaseOptions &opts) const

Performs similarity calculations between a molecule or a fingerprint and the fingerprints stored in the OEFastFPDatabase object. It returns an iterator over the calculated similarity scores. The scores are not sorted, but returned in the same order as the database. The number of elements in the returned iterator is equal to the number of fingerprints in the database.

fpidx

If the method is called with an integer index, the query fingerprint is taken from the OEFastFPDatabase object with the given index.

mol

If the method is called with an OEMolBase object, a fingerprint is generated from this molecule before looping over the fingerprints of the database and calculating similarities.

fp

If the method is called with an OEFingerPrint object, its type has to match the type of the OEFastFPDatabase.

opts

The OEFPDatabaseOptions object contains the settings available to control the search (i.e. similarity measure parameters). Cutoff and order parameters are ignored as the results are not filtered or sorted.

GetScores

OESystem::OEIterBase<OESimScore> *GetScores(const size_t idx,
                                            const OEFPDatabaseOptions &opts) const
OESystem::OEIterBase<OESimScore> *GetScores(const OEFingerPrint &fp,
                                            const OEFPDatabaseOptions &opts) const
OESystem::OEIterBase<OESimScore> *GetScores(const OEChem::OEMolBase &mol,
                                            const OEFPDatabaseOptions &opts) const

Performs similarity calculations between a molecule or fingerprint and the fingerprints stored in the OEFastFPDatabase object. It returns an iterator over the calculated similarity scores (OESimScore). The results are filtered according to the cutoff and order parameters specified in opts, but are not sorted.

fpidx

If the method is called with an integer index, the query fingerprint is taken from the OEFastFPDatabase object with the given index.

mol

If the method is called with an OEMolBase object, a fingerprint is generated from this molecule before looping over the fingerprints of the database and calculating similarities.

fp

If the method is called with an OEFingerPrint object, its type has to match the type of the OEFastFPDatabase.

opts

The OEFPDatabaseOptions object controls all the parameters that determine the search (i.e. similarity measure parameters, score cutoff and order).

See also

GetSortedScores

OESystem::OEIterBase<OESimScore> *GetSortedScores(const size_t idx,
                                                  const OEFPDatabaseOptions &opts) const
OESystem::OEIterBase<OESimScore> *GetSortedScores(const OEFingerPrint &fp,
                                                  const OEFPDatabaseOptions &opts) const
OESystem::OEIterBase<OESimScore> *GetSortedScores(const OEChem::OEMolBase &mol,
                                                  const OEFPDatabaseOptions &opts) const

Performs similarity calculations between a molecule or fingerprint and the fingerprints stored in the OEFastFPDatabase object. It returns an iterator over the calculated similarity scores (OESimScore) in sorted order. Each OESimScore holds a similarity score and index of the corresponding fingerprint of the database.

fpidx

If the method is called with an integer index, the query fingerprint is taken from the OEFastFPDatabase object with the given index.

mol

If the method is called with an OEMolBase object, a fingerprint is generated from this molecule before looping over the fingerprints of the database and calculating similarities.

fp

If the method is called with an OEFingerPrint object, its type has to match the type of the OEFastFPDatabase.

opts

The OEFPDatabaseOptions object controls all the parameters that determine the search (i.e. similarity measure parameters, score cutoff, order and limit).

See also

GetSparseMatrix

OESystem::OEIterBase<OESimScorePair> *GetSparseMatrix(const OEFPDatabaseOptions &opts) const

Performs \(NxN\) similarity calculations between all pairs of fingerprints stored in the OEFastFPDatabase object and returns either the top K scores for each fingerprint or all scores above a cutoff for each fingerprint. The sparse matrix is returned as an iterator over OESimScorePair objects. The limit of scores to return can be set using OEFPDatabaseOptions.SetLimit and the cutoff of scores to return can be set using OEFPDatabaseOptions.SetCutoff. If no limit is set, all scores will be returned above the cutoff value. If a limit is set, only OEFPDatabaseOptions.GetLimit scores will be returned regardless of whether more scores fall within the cutoff range. A limit should be set for best performance.

opts

The OEFPDatabaseOptions object controls all the parameters that determine the search (i.e. similarity measure parameters).

Hint

This operation scales with \(n^2\) in memory when limit = 0 (or when limit is not set), so it can easily overwhelm the system memory for larger databases. The best practice is to set a reasonable limit that will capture the scores of interest.

See also

GetVariogram

OEGraphSim::OEFPVariogram *GetVariogram(const std::vector<float>& obsdata,
                                        const OEFPDatabaseOptions &opts,
                                        const size_t nrbins=200u) const

Performs similarity calculations between all pairs of fingerprints stored in the OEFastFPDatabase object. It returns the empirical variogram over the calculated similarity scores with respect to the measurements provided in the obsdata parameter in a OEFPVariogram object.

obsdata

User-provided empirical measurements for each fingerprint in the database.

opts

The OEFPDatabaseOptions object controls all the parameters that determine the scoring (i.e. similarity measure parameters). Cutoff and order parameters are ignored as the results are not filtered or sorted.

nrbins

Number of bins in the returned OEFPVariogram object.

Note

The empirical variogram is defined over distances rather than similarities. It is therefore not possible to calculate a variogram using Tversky similarity. For all other similarity measures, empirical variogram is calculated using \(distance = 1-similarity\).

Hint

This method calculates similarities identically to OEFastFPDatabase.GetAllScores but is not bound by the system memory. It can be used to quickly obtain statistics on larger databases.

Note

When the OEFastFPDatabase object is initialized with OEFastFPDatabaseMemoryType_CUDA, nrbins is limited to at most 1024.

Hint

The returned OEFPVariogram object also contains a histogram, but note that this histogram is over distances rather than similarities.

See also

IsValid

bool IsValid() const

Returns whether the database was initialized correctly.

NumFingerPrints

size_t NumFingerPrints() const

Returns the number of OEFingerPrint objects stored in the database.

SortedSearch

unsigned SortedSearch(OESimSearchResult &result,
                      const OEChem::OEMolBase &mol,
                      const OEFPDatabaseOptions &opts) const

Performs multi-threaded similarity calculations between a molecule and the fingerprints stored in the OEFastFPDatabase object. The method combines the functionality of the OEFastFPDatabase.GetSortedScores and the OEFastFPDatabase.GetHistogram methods.

result

The OESimSearchResult object that stores the result of the search along with the progress of the search and the histogram of all scores.

mol

If the method is called with an OEMolBase object, a fingerprint is generated from this molecule before looping over the fingerprints of the database and calculating similarities.

opts

The OEFPDatabaseOptions object controls all the parameters that determine the search (i.e. similarity measure parameters). The OEFastFPDatabase.SortedSearch method can use multiple threads to accelerate the search process. The number of processors used can be controlled by the OEFPDatabaseOptions.SetNumProcessors method.

The OEFastFPDatabase.SortedSearch method returns:

Note

This method is currently only available in OEFastFPDatabaseMemoryType_MemoryMapped and OEFastFPDatabaseMemoryType_InMemory modes.

Example:

query = oechem.OEGraphMol()
oechem.OESmilesToMol(query, "Cc1c(c2cc(ccc2n1C(=O)c3ccc(cc3)Cl)OC)CC(=O)O")

limit = 5
opts = oegraphsim.OEFPDatabaseOptions(limit, oegraphsim.OESimMeasure_Tanimoto)

nrbins = 5
result = oegraphsim.OESimSearchResult(nrbins)
status = fpdb.SortedSearch(result, query, opts)
print("Search status = {}".format(oegraphsim.OESimSearchStatusToName(status)))
print("Number of searched = {}".format(result.NumSearched()))

# print scores
for score in result.GetSortedScores():
    print("{:.3f}".format(score.GetScore()))

# print histogram
hist = result.GetHistogram()
bounds = hist.GetBinBoundaries()
for idx, count in enumerate(hist.GetCounts()):
    print("[{:.3f}-{:.3f}] = {}".format(bounds[idx], bounds[idx+1], count))

The output of the code snippet above might look like this:

Search status = Finished
Number of searched = 1000
0.950
0.912
0.901
0.886
0.859
[0.000-0.200] = 428
[0.200-0.400] = 312
[0.400-0.600] = 225
[0.600-0.800] = 25
[0.800-1.000] = 10