OEFastFPDatabase

class OEFastFPDatabase

The OEFastFPDatabase class is designed to perform rapid CUDA-accelerated, in-memory or memory-mapped fingerprint searches using the popcount method. Each OEFastFPDatabase object is associated with a fingerprint type (OEFPTypeBase) that is set when the database is initialized from a pre-generated binary fingerprint file.

Note

GraphSim TK currently only supports the popcount search method for fingerprints with the size of multiple of 256. This means that the OEFastFPDatabase class currently does not support:

For the fingerprint types listed above, the original OEFPDatabase class can be utilized.

Note

OEFastFPDatabase gives identical results to OEFPDatabase. However OEFPDatabase calculates similarity scores in single precision (float) while OEFastFPDatabase uses double precision. As a result small similarity score differences can be observed.

See also

Code Example

../../_images/searchfastfp.png

Schematic representation of fast fingerprint search process

Constructors

OEFastFPDatabase(const std::string &dbfile,
                 unsigned int memtype=OEFastFPDatabaseMemoryType::Default)

Constructs an OEFastFPDatabase object.

dbfile
The name of the file which contains the fingerprint data. The file has to be generated with the OECreateFastFPDatabaseFile function.
memtype
Defines whether the fingerprints are pre-loaded into GPU-memory, CPU-memory or memory-mapped during the search process. This value has to be from the OEFastFPDatabaseMemoryType namespace.

Note

If the OEFastFPDatabase object can not be initialized with the OEFastFPDatabaseMemoryType.CUDA option the following warning message will be throw: Warning: OEFastFPDatabase::OEFastFPDatabase() : no CUDA-enabled device available falling back to memory-mapped type! As a rule of thumb, 1 million finger prints requires 0.6GB of GPU memory. GPU memory can be queried using the nvidia-smi command from terminal.

See also

GetAllScores

OESystem::OEIterBase<float> *GetAllScores(const OEFPDatabaseOptions &opts) const

Performs \(NxN\) similarity calculations between all pairs of fingerprints stored in the OEFastFPDatabase object. It returns an iterator over the calculated similarity scores. The scores are not sorted, but returned as a flattened square matrix.

For all similarity measures other than Tversky, the returned ‘matrix’ will be symmetrical.

opts
The OEFPDatabaseOptions object controls all the parameters that determine the search i.e. similarity measure parameters. Cutoff and order parameters are ignored as the results are not filtered or sorted.

Note

This operation scales with \(n^2\) in memory and can easily overwhelm the system memory for larger databases. Therefore, the usage of the OEFastFPDatabase.GetRawScores method is recommended for large databases that returns single rows of the similarity matrix.

See also

GetFingerPrint

bool GetFingerPrint(OEFingerPrint& fp, size_t idx) const

Returns the \(idx^{th}\) fingerprint of the database.

Warning

This function returns false if the fingerprint index is not identical to corresponding the molecule index. This can occur if the fingerprint binary file is generated in multi-thread process. In case when accessing fingerprint directly is required by using this OEFastFPDatabase.GetFingerPrint method, the fingerprint file should be generated in a single-threaded mode. This can be done by setting SetNumProcessors(1) for the option class used for creating the binary file.

GetFPTypeBase

const OEFPTypeBase *GetFPTypeBase() const

Returns the fingerprint type of the OEFastFPDatabase object. An OEFastFPDatabase object can only store fingerprints with identical types.

GetHistogram

Attention

PRELIMINARY-IMAGE This is a preliminary API until 2019.Apr and may be improved based on user feedback. It is currently available in C++ and Python.

OEFPHistogram *GetHistogram(const OEFPDatabaseOptions &opts, size_t nrbins=200) const

Performs similarity calculations between all pairs of fingerprints stored in the OEFastFPDatabase object. It returns the histogram over the calculated similarity scores in a OEFPHistogram object.

opts
The OEFPDatabaseOptions object controls all the parameters that determine the search i.e. similarity measure, \(\alpha\) and \(\beta\) parameters for Tversky similarity. Cutoff and order parameters are ignored as the results are not filtered or sorted.
nrbins
Number of bins in the returned OEFPHistogram object.

Note

For all similarity measures other than Tversky, the histogram will only contain upper-triangular similarity scores (excluding the diagonal). In case of the asymmetric Tversky similarity measure, the histogram for the whole \(NxN\) matrix is returned.

Note

When the OEFastFPDatabase object is initialized with OEFastFPDatabaseMemoryType.CUDA, nrbins is limited to at most 1024.

Hint

This method calculates similarities identically to OEFastFPDatabase.GetAllScores but is not bound by system memory. It can be used to quickly obtain statistics on larger databases.

See also

GetMemoryType

unsigned int GetMemoryType() const

Returns the memory type of the fingerprint database. The return value is taken from the OEFastFPDatabaseMemoryType namespace.

GetMemoryTypeString

std::string GetMemoryTypeString() const

Returns the string representation if memory type of the fingerprint database.

GetRawScores

OESystem::OEIterBase<double> *GetRawScores(const size_t fpidx,
                                           const OEFPDatabaseOptions &opts) const
OESystem::OEIterBase<double> *GetRawScores(const OEFingerPrint &fp,
                                           const OEFPDatabaseOptions &opts) const
OESystem::OEIterBase<double> *GetRawScores(const OEChem::OEMolBase &mol,
                                           const OEFPDatabaseOptions &opts) const

Performs similarity calculations between a molecule or a fingerprint and the fingerprints stored in the OEFastFPDatabase object. It returns an iterator over the calculated similarity scores. The scores are not sorted, but returned in the same order as the database. The number of elements in the returned iterator is equal to the number of fingerprints in the database.

fpidx
If the method is called with an integer index, the query fingerprint is taken from the OEFastFPDatabase object with the given index.
mol
If the method is called with an OEMolBase object, then a fingerprint is generated from this molecule before looping over the fingerprints of the database and calculating similarities.
fp
If the method is called with an OEFingerPrint object, then its type has to match with the type of the OEFastFPDatabase.
opts
The OEFPDatabaseOptions object controls all the parameters that determine the search i.e. similarity measure parameters. Cutoff and order parameters are ignored as the results are not filtered or sorted.

GetScores

OESystem::OEIterBase<OESimScore> *GetScores(const size_t idx,
                                            const OEFPDatabaseOptions &opts) const
OESystem::OEIterBase<OESimScore> *GetScores(const OEFingerPrint &fp,
                                            const OEFPDatabaseOptions &opts) const
OESystem::OEIterBase<OESimScore> *GetScores(const OEChem::OEMolBase &mol,
                                            const OEFPDatabaseOptions &opts) const

Performs similarity calculations between a molecule or fingerprint and the fingerprints stored in the OEFastFPDatabase object. It returns an iterator over the calculated similarity scores (OESimScore). The results are filtered according the cutoff and order specified in opts, but not sorted.

fpidx
If the method is called with an integer index, the query fingerprint is taken from the OEFastFPDatabase object with the given index.
mol
If the method is called with an OEMolBase object, then a fingerprint is generated from this molecule before looping over the fingerprints of the database and calculating similarities.
fp
If the method is called with an OEFingerPrint object, then its type has to match with the type of the OEFastFPDatabase.
opts
The OEFPDatabaseOptions object controls all the parameters that determine the search i.e. similarity measure parameters, score cutoff and order.

See also

GetSortedScores

OESystem::OEIterBase<OESimScore> *GetSortedScores(const size_t idx,
                                                  const OEFPDatabaseOptions &opts) const
OESystem::OEIterBase<OESimScore> *GetSortedScores(const OEFingerPrint &fp,
                                                  const OEFPDatabaseOptions &opts) const
OESystem::OEIterBase<OESimScore> *GetSortedScores(const OEChem::OEMolBase &mol,
                                                  const OEFPDatabaseOptions &opts) const

Performs similarity calculations between a molecule or fingerprint and the fingerprints stored in the OEFastFPDatabase object. It returns an iterator over the calculated similarity scores (OESimScore) in sorted order. Each OESimScore holds a similarity score and index of the corresponding fingerprint of the database.

fpidx
If the method is called with an integer index, the query fingerprint is taken from the OEFastFPDatabase object with the given index.
mol
If the method is called with an OEMolBase object, then a fingerprint is generated from this molecule before looping over the fingerprints of the database and calculating similarities.
fp
If the method is called with an OEFingerPrint object, then its type has to match with the type of the OEFastFPDatabase.
opts
The OEFPDatabaseOptions object controls all the parameters that determine the search i.e. similarity measure parameters, score cutoff, order and limit.

See also

GetSparseMatrix

OESystem::OEIterBase<OESimScorePair> *GetSparseMatrix(const OEFPDatabaseOptions &opts) const

Performs \(NxN\) similarity calculations between all pairs of fingerprints stored in the OEFastFPDatabase object and returns either the top K scores for each fingerprint or all scores above a cutoff for each fingerprint. The sparse matrix is returned as an iterator over OESimScorePair objects. The limit of scores to return can be set using OEFPDatabaseOptions.SetLimit and the cutoff of scores to return can be set using OEFPDatabaseOptions.SetCutoff. If no limit is set, all scores will be returned above the cutoff value. If a limit is set, only OEFPDatabaseOptions.GetLimit scores will be returned regardless of whether more scores fall within the cutoff range. A limit should be set for best performance.

opts
The OEFPDatabaseOptions object controls all the parameters that determine the search i.e. similarity measure parameters.

Hint

This operation scales with \(n^2\) in memory for the case where the limit = 0 (or is not set) and so can easily overwhelm the system memory for larger databases. Best practice, is to set a reasonable limit that will capture the scores of interest.

See also

GetVariogram

Attention

PRELIMINARY-IMAGE This is a preliminary API until 2019.Apr and may be improved based on user feedback. It is currently available in C++ and Python.

OEGraphSim::OEFPVariogram *GetVariogram(const std::vector<float>& obsdata,
                                        const OEFPDatabaseOptions &opts,
                                        const size_t nrbins=200) const

Performs similarity calculations between all pairs of fingerprints stored in the OEFastFPDatabase object. It returns the empirical variogram over calculated similarity scores with respect to the measurements provided in the obsdata parameter, in a OEFPVariogram object.

obsdata
User-provided empirical measurements for each fingerprint in the database.
opts
The OEFPDatabaseOptions object controls all the parameters that determine the scoring i.e. similarity measure parameters. Cutoff and order parameters are ignored as the results are not filtered or sorted.
nrbins
Number of bins in the returned OEFPVariogram object.

Note

The empirical variogram is defined over distances rather than similarities. It is therefore not possible to calculate a variogram using Tversky similarity. For all other similarity measures, empirical variogram is calculated using \(distance = 1-similarity\).

Hint

This method calculates similarities identically to OEFastFPDatabase.GetAllScores but is not bound by system memory. It can be used to quickly obtain statistics on larger databases.

Note

When the OEFastFPDatabase object is initialized with OEFastFPDatabaseMemoryType.CUDA, nrbins is limited to at most 1024.

Hint

The returned OEFPVariogram object also contains a histogram, but note that this histogram is over distances rather than similarities.

See also

IsValid

bool IsValid() const

Returns whether the database was initialized correctly.

NumFingerPrints

size_t NumFingerPrints() const

Returns the number of OEFingerPrint objects stored in the database.