You want to perform 2D similarity search on large molecule files.
If you have access for OpenEye toolkits version 2017.Jun or later please use example in Rapid Similarity Searching of Large Molecule Files.
In order to solve this problem two databases are utilized:
This recipe discusses three code examples:
In order to perform a 2D similarity search fingerprints have to be generated that can be compared to assess molecular similarity. Generating fingerprints for a large molecule file can be time-consuming since it involves:
However, fingerprints can be pre-generated and saved into an
.oeb file for
later usage. Thereby eliminating the time-consuming on-the-fly generation
of the fingerprints when performing similarity searches.
The code below shows how to loop over the molecules of an input file
stream, generate all four built-in fingerprint types, attach them to
the molecule as generic data, and write the molecule into the output stream.
If the output stream is an
.oeb file, the fingerprints attached to the molecule are
written to the binary file as well.
This provides a convenient way to transfer the pre-generated fingerprint
along with the molecules from one application to another.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
def GenerateFingerPrints(ifs, ofs): dots = oechem.OEDots(100000, 1000, "molecules") fptypes = [(oegraphsim.OEFPType_Circular, "circular"), (oegraphsim.OEFPType_Tree, "tree"), (oegraphsim.OEFPType_Path, "path"), (oegraphsim.OEFPType_MACCS166, "maccs")] fp = oegraphsim.OEFingerPrint() for mol in ifs.GetOEGraphMols(): dots.Update() for (fptype, fptag) in fptypes: oegraphsim.OEMakeFP(fp, mol, fptype) mol.SetData(fptag, fp) oechem.OEWriteMolecule(ofs, mol) dots.Total()
prompt > python3 fp2oeb.py drugs.sdf drugs-fp.oeb
The next step is to generate an index file that will accelerate the process of
.oeb molecule file that is generated by the fp2oeb.py
See more details in the the Creating Molecule Database Index File section
prompt > python3 moldb_create.py drugs-fp.oeb
Running the above command will generate the
database index file.
The last step is to perform the similarity search i.e. find molecules with the highest fingerprint similarity to a query molecule. In the GetSimilarMolecules function, first a fingerprint is generated for the query molecule. The target molecules are then loaded into the OEMolDatabase object. By looping over the molecules stored in the molecule database, the fingerprints that have been attached to the molecules as generic data are retrieved and inserted into the OEFPDatabase object.
Calling the OEFPDatabase.GetSortedScores method returns an iterator over the calculated similarity scores in sorted order.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
def GetSimilarMolecules(qmol, ifs, ofs, fptype, fptag, numhits): qfp = oegraphsim.OEFingerPrint() oegraphsim.OEMakeFP(qfp, qmol, fptype) moldb = oechem.OEMolDatabase(ifs) fpdb = oegraphsim.OEFPDatabase(fptype) mol = oechem.OEGraphMol() for idx in range(moldb.GetMaxMolIdx()): if not moldb.GetMolecule(mol, idx): continue if mol.HasData(fptag): tfp = mol.GetData(fptag) fpdb.AddFP(tfp) else: oechem.OEThrow.Warning("Unable to access fingerprint for %s" % mol.GetTitle()) fpdb.AddFP(mol) sdtag = "%s-fingerprint-score" % fptag hit = oechem.OEGraphMol() scores = fpdb.GetSortedScores(qfp, numhits) for s in scores: if moldb.GetMolecule(hit, s.GetIdx()): oechem.OESetSDData(hit, sdtag, str(s.GetScore())) oechem.OEWriteMolecule(ofs, hit)
prompt > python3 moldb_simcalc.py -query .ism -target drugs-fp.oeb -out .ism -fingerprint tree c1ncccc1
Running the above command will generate the following output:
CC(=O)Oc1ccccc1C(=O)O acetsali Cc1c(c(=O)n(n1C)c2ccccc2)N(C)C aminopy CC(C)NCC(COc1ccccc1CC=C)O alprenol Cn1cnc2c1c(=O)n(c(=O)n2C)C caffeine c1nc2c(=O)[nH]c(nc2n1COCCO)N acyclovi CC(C)NCC(COc1ccc(cc1)CC(=O)N)O atenolol
If the output file is a
.sdf molecule file, then the similarity score
will be attached to each molecule with the
The OEFPDatabase class provides several ways to search fingerprints: