Fingerprint Database

The following four examples perform the same task, detailed below:

  1. reading a query structure

  2. printing out the similarity score between the fingerprint of this query and the fingerprint generated for each molecule read from a database file.

In Listing 9, after importing the query structure and generating its path fingerprint, the program loops over the database file creating a path fingerprint for each structure. Then the program calculates the Tanimoto similarity between the fingerprint of the query and the database entry by calling the OETanimoto function.

Listing 9: Similarity calculation from file

if len(sys.argv) != 3:
    oechem.OEThrow.Usage("%s <queryfile> <targetfile>" % sys.argv[0])

ifs = oechem.oemolistream()
if not ifs.open(sys.argv[1]):
    oechem.OEThrow.Fatal("Unable to open %s for reading" % sys.argv[1])

qmol = oechem.OEGraphMol()
if not oechem.OEReadMolecule(ifs, qmol):
    oechem.OEThrow.Fatal("Unable to read query molecule")
qfp = oegraphsim.OEFingerPrint()
oegraphsim.OEMakeFP(qfp, qmol, oegraphsim.OEFPType_Path)

if not ifs.open(sys.argv[2]):
    oechem.OEThrow.Fatal("Unable to open %s for reading" % sys.argv[2])

tfp = oegraphsim.OEFingerPrint()
for tmol in ifs.GetOEGraphMols():
    oegraphsim.OEMakeFP(tfp, tmol, oegraphsim.OEFPType_Path)
    print("%.3f" % oegraphsim.OETanimoto(qfp, tfp))

In Listing 10 only the code block that is different from Listing 9 is shown.

In this example, it is assumed that the fingerprints are pre-calculated and stored in an OEB binary file as generic data attached to the corresponding molecules. The program loops over the file and accesses the pre-generated fingerprints or calculates them if they are not available.

The obvious advantage of this process is that the fingerprints one have to be generated once when the binary file is created. This can be significantly faster, than generating the fingerprints on-the-fly every time the program is executed.

See also

The Storage and Retrieval section shows an example of how to generate an OEB binary file which stores molecule along with their corresponding fingerprints.

Listing 10: Similarity calculation from OEB file

tfp = oegraphsim.OEFingerPrint()
for tmol in ifs.GetOEGraphMols():
    if tmol.HasData("PATH_FP"):
        tfp = tmol.GetData("PATH_FP")
    else:
        oechem.OEThrow.Warning("Unable to access fingerprint for %s" % tmol.GetTitle())
        oegraphsim.OEMakeFP(tfp, tmol, oegraphsim.OEFPType_Path)
    print("%.3f" % oegraphsim.OETanimoto(qfp, tfp))

Listing 11 differs from Listing 9 in that it uses an OEFPDatabase object to store the generated fingerprints. The OEFPDatabase class is designed to perform in-memory fingerprint searches.

Listing 11: Similarity calculation with fingerprint database from file

fpdb = oegraphsim.OEFPDatabase(qfp.GetFPTypeBase())
for tmol in ifs.GetOEGraphMols():
    fpdb.AddFP(tmol)

for score in fpdb.GetScores(qfp):
    print("%.3f" % score.GetScore())

After building the fingerprint database, the scores can be accessed by the OEFPDatabase.GetScores method. This will return an iterator over the similarity scores calculated.

Note

The OEFPDatabase only stores fingerprints and not the molecules from which they are generated. A correspondence between a molecule and its fingerprint stored in the database can be established by using the index returned by the OEFPDatabase.AddFP method.

See also

Listing 13 shows how to keep track of the correspondence between a fingerprint added to a OEFPDatabase object and a molecule from which it is calculated.

In the last example (Listing 12), OEFPDatabase is used again to store the fingerprints. If the fingerprint is read from the OEB input binary file, then it is directly added to the database, otherwise the fingerprint is generated on-the-fly when passing the OEMolBase molecule itself to the OEFPDatabase.AddFP method.

Listing 12: Similarity calculation with fingerprint database from OEB

fpdb = oegraphsim.OEFPDatabase(qfp.GetFPTypeBase())
for tmol in ifs.GetOEGraphMols():
    if tmol.HasData("PATH_FP"):
        tfp = tmol.GetData("PATH_FP")
        fpdb.AddFP(tfp)
    else:
        oechem.OEThrow.Warning("Unable to access fingerprint for %s" % tmol.GetTitle())
        fpdb.AddFP(tmol)

for score in fpdb.GetScores(qfp):
    print("%.3f" % score.GetScore())

Searching with User-defined Similarity Measures

By default, the Tanimoto similarity is used when calling either the OEFPDatabase.GetScores method or the OEFPDatabase.GetSortedScores method. The user can set other types of similarity measures to be applied by calling the OEFPDatabase.SetSimFunc method with a value from the OESimMeasure namespace. Each of the constants from this namespace corresponds to one of the built-in similarity calculation methods.

There is also a facility to use user-defined similarity measures when searching a fingerprint database. The following example shows how a similarity calculation can be implemented by deriving from the OESimFuncBase class.

Formula: \(Sim_{Simpson}(A,B) = \sqrt{\frac{bothAB}{min(onlyA+ bothAB),(onlyB+ bothAB))}}\)

class SimpsonSimFunc(oegraphsim.OESimFuncBase):

    def __call__(self, fpA, fpB):
        onlyA, onlyB, bothAB, neitherAB = oechem.OEGetBitCounts(fpA, fpB)
        if onlyA + onlyB == 0:
            return 1.0
        if bothAB == 0:
            return 0.0
        sim = float(bothAB)
        sim /= min(float(onlyA + bothAB), float(onlyB + bothAB))
        return sim

    def GetSimTypeString(self):
        return "Simpson"

    def CreateCopy(self):
        return SimpsonSimFunc().__disown__()

After implementing the similarity calculation, it can be added to an OEFPDatabase object, henceforth this new similarity calculation will be used.

fpdb = oegraphsim.OEFPDatabase(oegraphsim.OEFPType_Path)
fpdb.SetSimFunc(SimpsonSimFunc())

See also