Fingerprint Database¶
The following four examples perform the same task, detailed below:
reading a query structure
printing out the similarity score between the fingerprint of this query and the fingerprint generated for each molecule read from a database file.
In Listing 9
, after importing the query
structure and generating its path fingerprint, the program loops over
the database file creating a path fingerprint for each structure.
Then the program calculates the Tanimoto similarity between the
fingerprint of the query and the database entry by calling the
OETanimoto
function.
Listing 9: Similarity calculation from file
if len(sys.argv) != 3:
oechem.OEThrow.Usage("%s <queryfile> <targetfile>" % sys.argv[0])
ifs = oechem.oemolistream()
if not ifs.open(sys.argv[1]):
oechem.OEThrow.Fatal("Unable to open %s for reading" % sys.argv[1])
qmol = oechem.OEGraphMol()
if not oechem.OEReadMolecule(ifs, qmol):
oechem.OEThrow.Fatal("Unable to read query molecule")
qfp = oegraphsim.OEFingerPrint()
oegraphsim.OEMakeFP(qfp, qmol, oegraphsim.OEFPType_Path)
if not ifs.open(sys.argv[2]):
oechem.OEThrow.Fatal("Unable to open %s for reading" % sys.argv[2])
tfp = oegraphsim.OEFingerPrint()
for tmol in ifs.GetOEGraphMols():
oegraphsim.OEMakeFP(tfp, tmol, oegraphsim.OEFPType_Path)
print("%.3f" % oegraphsim.OETanimoto(qfp, tfp))
In Listing 10
only the code block that is
different from Listing 9
is shown.
In this example, it is assumed that the fingerprints are
pre-calculated and stored in an OEB
binary file as generic data
attached to the corresponding molecules. The program loops over the
file and accesses the pre-generated fingerprints or calculates them if
they are not available.
The obvious advantage of this process is that the fingerprints one have to be generated once when the binary file is created. This can be significantly faster, than generating the fingerprints on-the-fly every time the program is executed.
See also
The Storage and Retrieval section shows an example of how to
generate an OEB
binary file which stores molecule along
with their corresponding fingerprints.
Listing 10: Similarity calculation from OEB file
tfp = oegraphsim.OEFingerPrint()
for tmol in ifs.GetOEGraphMols():
if tmol.HasData("PATH_FP"):
tfp = tmol.GetData("PATH_FP")
else:
oechem.OEThrow.Warning("Unable to access fingerprint for %s" % tmol.GetTitle())
oegraphsim.OEMakeFP(tfp, tmol, oegraphsim.OEFPType_Path)
print("%.3f" % oegraphsim.OETanimoto(qfp, tfp))
Listing 11
differs from
Listing 9
in that
it uses an OEFPDatabase object to store the
generated fingerprints.
The OEFPDatabase class is designed to perform
in-memory fingerprint searches.
Listing 11: Similarity calculation with fingerprint database from file
fpdb = oegraphsim.OEFPDatabase(qfp.GetFPTypeBase())
for tmol in ifs.GetOEGraphMols():
fpdb.AddFP(tmol)
for score in fpdb.GetScores(qfp):
print("%.3f" % score.GetScore())
After building the fingerprint database, the scores can be accessed by
the OEFPDatabase.GetScores
method.
This will return an iterator over the similarity scores calculated.
Note
The OEFPDatabase only stores
fingerprints and not the molecules from which they are generated.
A correspondence between a molecule and its fingerprint stored in
the database can be established by using the index returned by the
OEFPDatabase.AddFP
method.
See also
Listing 13
shows how to keep
track of the correspondence between a fingerprint added to a
OEFPDatabase object and a molecule from which
it is calculated.
In the last example (Listing 12
),
OEFPDatabase is used again
to store the fingerprints. If the fingerprint is read from the
OEB
input binary file, then it is directly added to the
database, otherwise the fingerprint is generated on-the-fly when
passing the OEMolBase molecule itself to the
OEFPDatabase.AddFP
method.
Listing 12: Similarity calculation with fingerprint database from OEB
fpdb = oegraphsim.OEFPDatabase(qfp.GetFPTypeBase())
for tmol in ifs.GetOEGraphMols():
if tmol.HasData("PATH_FP"):
tfp = tmol.GetData("PATH_FP")
fpdb.AddFP(tfp)
else:
oechem.OEThrow.Warning("Unable to access fingerprint for %s" % tmol.GetTitle())
fpdb.AddFP(tmol)
for score in fpdb.GetScores(qfp):
print("%.3f" % score.GetScore())
Sorted Search¶
Similarity searching based on a 2D representation of molecular structure (such as fingerprints) is one of the most common approaches for virtual screening. A molecule that is structurally similar to an active molecule is more likely to be active.
A virtual screening strategy involves going through a molecule database and calculating the similarity between a reference structure and each of the molecules, followed by ranking the similarity scores in descending order to identify molecules that are the most similar to the reference structure.
Listing 13
shows how to search a molecule
database using the OEFPDatabase.GetSortedScores
method to identify analogs based on their fingerprint similarity.
First the molecules are imported using an OEMolDatabase object.
Then a OEFPDatabase object is created that will
store the corresponding fingerprints.
Iterating over the molecules a fingerprints are added to the database by calling
the OEFPDatabase.AddFP
method.
In case when a molecule can not be accessed from the OEMolDatabase
object, an empty fingerprint is added to the OEFPDatabase
object. This ensures that the indices of the two databases are synchronized.
After generating the fingerprints, the program reads
reference molecules (in the SMILES format) from standard input.
Then the input SMILES string is parsed and the fingerprint database is
searched to identify structures with the highest similarity scores.
Finally, the SMILES string of the best hits are written to
standard output.
Listing 13: Similarity search in memory
if len(sys.argv) != 2:
oechem.OEThrow.Usage("%s <database>" % sys.argv[0])
ifs = oechem.oemolistream()
if not ifs.open(sys.argv[1]):
oechem.OEThrow.Fatal("Cannot open database molecule file!")
# load molecules
moldb = oechem.OEMolDatabase(ifs)
nrmols = moldb.GetMaxMolIdx()
# generate fingerprints
fpdb = oegraphsim.OEFPDatabase(oegraphsim.OEFPType_Path)
emptyfp = oegraphsim.OEFingerPrint()
emptyfp.SetFPTypeBase(fpdb.GetFPTypeBase())
mol = oechem.OEGraphMol()
for idx in range(0, nrmols):
if moldb.GetMolecule(mol, idx):
fpdb.AddFP(mol)
else:
fpdb.AddFP(emptyfp)
nrfps = fpdb.NumFingerPrints()
timer = oechem.OEWallTimer()
while True:
# read query SMILES from stdin
sys.stdout.write("Enter SMILES> ")
line = sys.stdin.readline()
line = line.rstrip()
if len(line) == 0:
sys.exit(0)
# parse query
query = oechem.OEGraphMol()
if not oechem.OESmilesToMol(query, line):
oechem.OEThrow.Warning("Invalid SMILES string")
continue
# calculate similarity scores
timer.Start()
scores = fpdb.GetSortedScores(query, 5)
oechem.OEThrow.Info("%5.2f seconds to search %i fingerprints" % (timer.Elapsed(), nrfps))
hit = oechem.OEGraphMol()
for si in scores:
if moldb.GetMolecule(hit, si.GetIdx()):
smi = oechem.OEMolToSmiles(hit)
oechem.OEThrow.Info("Tanimoto score %4.3f %s" % (si.GetScore(), smi))
As mentioned before, OEFPDatabase is a
fingerprint container and does not store the corresponding
molecule. Therefore the molecules have to be stored in a separate container
in this case in an OEMolDatabase object.
When the OEFPDatabase.GetSortedScores
returns
the iterator over the best similarity scores, the associated index can
be utilized to access the corresponding structure in the
OEMolDatabase object.
In the above example, the entire database was searched to identify structurally similar molecules. However, the user can also specify a segment of the database to be searched by providing a begin and end index.
See also
OEMolDatabase class in the OEChem TK manual
Examples of fingerprint searches in the API section:
OEFPDatabase.GetScores
methodOEFPDatabase.GetSortedScores
method
Searching with User-defined Similarity Measures¶
By default, the Tanimoto
similarity is used when calling either the
OEFPDatabase.GetScores
method or the
OEFPDatabase.GetSortedScores
method.
The user can set other types of similarity measures to be applied by
calling the OEFPDatabase.SetSimFunc
method with
a value from the OESimMeasure
namespace.
Each of the constants from this namespace corresponds to one of the
built-in similarity calculation methods.
There is also a facility to use user-defined similarity measures when searching a fingerprint database. The following example shows how a similarity calculation can be implemented by deriving from the OESimFuncBase class.
Formula: \(Sim_{Simpson}(A,B) = \sqrt{\frac{bothAB}{min(onlyA+ bothAB),(onlyB+ bothAB))}}\)
class SimpsonSimFunc(oegraphsim.OESimFuncBase):
def __call__(self, fpA, fpB):
onlyA, onlyB, bothAB, neitherAB = oechem.OEGetBitCounts(fpA, fpB)
if onlyA + onlyB == 0:
return 1.0
if bothAB == 0:
return 0.0
sim = float(bothAB)
sim /= min(float(onlyA + bothAB), float(onlyB + bothAB))
return sim
def GetSimTypeString(self):
return "Simpson"
def CreateCopy(self):
return SimpsonSimFunc().__disown__()
After implementing the similarity calculation, it can be added to an OEFPDatabase object, henceforth this new similarity calculation will be used.
fpdb = oegraphsim.OEFPDatabase(oegraphsim.OEFPType_Path)
fpdb.SetSimFunc(SimpsonSimFunc())
See also
The User-defined Similarity Measures section