Similarity Searching of Large Molecule Files
Problem
You want to perform 2D similarity search on large molecule files.
See also
If you have access for OpenEye toolkits version 2017.Jun or later please use example in Rapid Similarity Searching of Large Molecule Files.
Ingredients
|
Difficulty Level
Solution
In order to solve this problem two databases are utilized:
The OEMolDatabase class of OEChem TK is used is to provide fast read-only random access to molecular file formats that are readable by OEChem TK. This class is designed to handle large molecule files that can not be held in memory all at once.
The OEFPDatabase class of GraphSim TK is used to perform rapid in-memory fingerprint searches.
See also
The Manipulating Large Molecule Files recipe for the introduction to the OEMolDatabase class.
This recipe discusses three code examples:
Pre-generating Fingerprints
In order to perform a 2D similarity search fingerprints have to be generated that can be compared to assess molecular similarity. Generating fingerprints for a large molecule file can be time-consuming since it involves:
either performing a large number of substructure searches (in the case of the MACCS key fingerprint)
or exhaustively enumerating certain sub-graphs of a molecule (in the case of path, tree or circular fingerprints)
However, fingerprints can be pre-generated and saved into an .oeb
file for
later usage. Thereby eliminating the time-consuming on-the-fly generation
of the fingerprints when performing similarity searches.
The code below shows how to loop over the molecules of an input file
stream, generate all four built-in fingerprint types, attach them to
the molecule as generic data, and write the molecule into the output stream.
If the output stream is an .oeb
file, the fingerprints attached to the molecule are
written to the binary file as well.
This provides a convenient way to transfer the pre-generated fingerprint
along with the molecules from one application to another.
1def GenerateFingerPrints(ifs, ofs):
2
3 dots = oechem.OEDots(100000, 1000, "molecules")
4
5 fptypes = [(oegraphsim.OEFPType_Circular, "circular"),
6 (oegraphsim.OEFPType_Tree, "tree"),
7 (oegraphsim.OEFPType_Path, "path"),
8 (oegraphsim.OEFPType_MACCS166, "maccs")]
9
10 fp = oegraphsim.OEFingerPrint()
11 for mol in ifs.GetOEGraphMols():
12 dots.Update()
13 for (fptype, fptag) in fptypes:
14 oegraphsim.OEMakeFP(fp, mol, fptype)
15 mol.SetData(fptag, fp)
16 oechem.OEWriteMolecule(ofs, mol)
17
18 dots.Total()
Usage:
prompt > python3 fp2oeb.py drugs.sdf drugs-fp.oeb
Creating Molecule Database Index File
The next step is to generate an index file that will accelerate the process of
loading the .oeb
molecule file that is generated by the fp2oeb.py
script above.
See also
See more details in the the Creating Molecule Database Index File section
Download code
Usage:
prompt > python3 moldb_create.py drugs-fp.oeb
Running the above command will generate the drugs-fp.oeb.idx
molecule
database index file.
Searching Fingerprints
The last step is to perform the similarity search i.e. find molecules with the highest fingerprint similarity to a query molecule. In the GetSimilarMolecules function, first a fingerprint is generated for the query molecule. The target molecules are then loaded into the OEMolDatabase object. By looping over the molecules stored in the molecule database, the fingerprints that have been attached to the molecules as generic data are retrieved and inserted into the OEFPDatabase object.
Calling the OEFPDatabase.GetSortedScores method returns an iterator over the calculated similarity scores in sorted order.
1def GetSimilarMolecules(qmol, ifs, ofs, fptype, fptag, numhits):
2
3 qfp = oegraphsim.OEFingerPrint()
4 oegraphsim.OEMakeFP(qfp, qmol, fptype)
5
6 moldb = oechem.OEMolDatabase(ifs)
7
8 fpdb = oegraphsim.OEFPDatabase(fptype)
9
10 mol = oechem.OEGraphMol()
11 for idx in range(moldb.GetMaxMolIdx()):
12 if not moldb.GetMolecule(mol, idx):
13 continue
14 if mol.HasData(fptag):
15 tfp = mol.GetData(fptag)
16 fpdb.AddFP(tfp)
17 else:
18 oechem.OEThrow.Warning("Unable to access fingerprint for %s" % mol.GetTitle())
19 fpdb.AddFP(mol)
20
21 sdtag = "%s-fingerprint-score" % fptag
22
23 hit = oechem.OEGraphMol()
24 scores = fpdb.GetSortedScores(qfp, numhits)
25 for s in scores:
26 if moldb.GetMolecule(hit, s.GetIdx()):
27 oechem.OESetSDData(hit, sdtag, str(s.GetScore()))
28 oechem.OEWriteMolecule(ofs, hit)
Download code
prompt > python3 moldb_simcalc.py -query .ism -target drugs-fp.oeb -out .ism -fingerprint tree
c1ncccc1
Running the above command will generate the following output:
CC(=O)Oc1ccccc1C(=O)O acetsali
Cc1c(c(=O)n(n1C)c2ccccc2)N(C)C aminopy
CC(C)NCC(COc1ccccc1CC=C)O alprenol
Cn1cnc2c1c(=O)n(c(=O)n2C)C caffeine
c1nc2c(=O)[nH]c(nc2n1COCCO)N acyclovi
CC(C)NCC(COc1ccc(cc1)CC(=O)N)O atenolol
If the output file is a .sdf
molecule file, then the similarity score
will be attached to each molecule with the <tree-fingerprint-score>
tag.
Discussion
The OEFPDatabase class provides several ways to search fingerprints:
By default, OETanimoto calculation is used to quantify the degree of resemblance between two fingerprints. However, other built-in similarity coefficients (Cosine, Dice, Euclidean, Manhattan, Tversky) and user-defined similarity measures can also be used.
By default, the OEFPDatabase.GetSortedScores method returns similarity scores in descending order i.e. it identifies molecules that are most similar to the query. However, the order can be reversed to identify molecules that are most dissimilar to the query.
It allows searching the entire fingerprint database or only a segment of the database.
It allows setting a cut-off value for the similarity scores returned.
It provides sorted (OEFPDatabase.GetSortedScores) and un-sorted (OEFPDatabase.GetScores) access to the fingerprint similarites.
See also in OEChem TK manual
Theory
Molecular Database Handling chapter
Generic Data chapter
API
OECreateMolDatabaseIdx function
OEMolDatabase class
See also in GraphSim TK manual
Theory
Fingerprint Generation chapter
API
OEFingerPrint class
OEFPDatabase class
OEMakeFP function