Similarity Searching of Large Molecule Files¶
You want to perform 2D similarity search on large molecule files.
If you have access for OpenEye toolkits version 2017.Jun or later please use example in Rapid Similarity Searching of Large Molecule Files.
In order to solve this problem two databases are utilized:
- The OEMolDatabase class of OEChem TK is used is to provide fast read-only random access to molecular file formats that are readable by OEChem TK. This class is designed to handle large molecule files that can not be held in memory all at once.
- The OEFPDatabase class of GraphSim TK is used to perform rapid in-memory fingerprint searches.
This recipe discusses three code examples:
In order to perform a 2D similarity search fingerprints have to be generated that can be compared to assess molecular similarity. Generating fingerprints for a large molecule file can be time-consuming since it involves:
- either performing a large number of substructure searches (in the case of the MACCS key fingerprint)
- or exhaustively enumerating certain sub-graphs of a molecule (in the case of path, tree or circular fingerprints)
However, fingerprints can be pre-generated and saved into an .oeb file for later usage. Thereby eliminating the time-consuming on-the-fly generation of the fingerprints when performing similarity searches.
The code below shows how to loop over the molecules of an input file stream, generate all four built-in fingerprint types, attach them to the molecule as generic data, and write the molecule into the output stream. If the output stream is an .oeb file, the fingerprints attached to the molecule are written to the binary file as well. This provides a convenient way to transfer the pre-generated fingerprint along with the molecules from one application to another.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
def GenerateFingerPrints(ifs, ofs): dots = oechem.OEDots(100000, 1000, "molecules") fptypes = [(oegraphsim.OEFPType_Circular, "circular"), (oegraphsim.OEFPType_Tree, "tree"), (oegraphsim.OEFPType_Path, "path"), (oegraphsim.OEFPType_MACCS166, "maccs")] fp = oegraphsim.OEFingerPrint() for mol in ifs.GetOEGraphMols(): dots.Update() for (fptype, fptag) in fptypes: oegraphsim.OEMakeFP(fp, mol, fptype) mol.SetData(fptag, fp) oechem.OEWriteMolecule(ofs, mol) dots.Total()
prompt > python3 fp2oeb.py drugs.sdf drugs-fp.oeb
Creating Molecule Database Index File¶
The next step is to generate an index file that will accelerate the process of loading the .oeb molecule file that is generated by the fp2oeb.py script above.
See more details in the the Creating Molecule Database Index File section
prompt > python3 moldb_create.py drugs-fp.oeb
Running the above command will generate the drugs-fp.oeb.idx molecule database index file.
The last step is to perform the similarity search i.e. find molecules with the highest fingerprint similarity to a query molecule. In the GetSimilarMolecules function, first a fingerprint is generated for the query molecule. The target molecules are then loaded into the OEMolDatabase object. By looping over the molecules stored in the molecule database, the fingerprints that have been attached to the molecules as generic data are retrieved and inserted into the OEFPDatabase object.
Calling the OEFPDatabase.GetSortedScores method returns an iterator over the calculated similarity scores in sorted order.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
def GetSimilarMolecules(qmol, ifs, ofs, fptype, fptag, numhits): qfp = oegraphsim.OEFingerPrint() oegraphsim.OEMakeFP(qfp, qmol, fptype) moldb = oechem.OEMolDatabase(ifs) fpdb = oegraphsim.OEFPDatabase(fptype) mol = oechem.OEGraphMol() for idx in range(moldb.GetMaxMolIdx()): if not moldb.GetMolecule(mol, idx): continue if mol.HasData(fptag): tfp = mol.GetData(fptag) fpdb.AddFP(tfp) else: oechem.OEThrow.Warning("Unable to access fingerprint for %s" % mol.GetTitle()) fpdb.AddFP(mol) sdtag = "%s-fingerprint-score" % fptag hit = oechem.OEGraphMol() scores = fpdb.GetSortedScores(qfp, numhits) for s in scores: if moldb.GetMolecule(hit, s.GetIdx()): oechem.OESetSDData(hit, sdtag, str(s.GetScore())) oechem.OEWriteMolecule(ofs, hit)
prompt > python3 moldb_simcalc.py -query .ism -target drugs-fp.oeb -out .ism -fingerprint tree c1ncccc1
Running the above command will generate the following output:
CC(=O)Oc1ccccc1C(=O)O acetsali Cc1c(c(=O)n(n1C)c2ccccc2)N(C)C aminopy CC(C)NCC(COc1ccccc1CC=C)O alprenol Cn1cnc2c1c(=O)n(c(=O)n2C)C caffeine c1nc2c(=O)[nH]c(nc2n1COCCO)N acyclovi CC(C)NCC(COc1ccc(cc1)CC(=O)N)O atenolol
If the output file is a .sdf molecule file, then the similarity score will be attached to each molecule with the <tree-fingerprint-score> tag.
The OEFPDatabase class provides several ways to search fingerprints:
- By default, OETanimoto calculation is used to quantify the degree of resemblance between two fingerprints. However, other built-in similarity coefficients (Cosine, Dice, Euclidean, Manhattan, Tversky) and user-defined similarity measures can also be used.
- By default, the OEFPDatabase.GetSortedScores method returns similarity scores in descending order i.e. it identifies molecules that are most similar to the query. However, the order can be reversed to identify molecules that are most dissimilar to the query.
- It allows searching the entire fingerprint database or only a segment of the database.
- It allows setting a cut-off value for the similarity scores returned.
- It provides sorted (OEFPDatabase.GetSortedScores) and un-sorted (OEFPDatabase.GetScores) access to the fingerprint similarites.
See also in OEChem TK manual¶