Similarity Searching of Large Molecule Files


You want to perform 2D similarity search on large molecule files.

See also

If you have access for OpenEye toolkits version 2017.Jun or later please use example in Rapid Similarity Searching of Large Molecule Files.


Difficulty Level

../_images/chilly.png ../_images/chilly.png


In order to solve this problem two databases are utilized:

  • The OEMolDatabase class of OEChem TK is used is to provide fast read-only random access to molecular file formats that are readable by OEChem TK. This class is designed to handle large molecule files that can not be held in memory all at once.

  • The OEFPDatabase class of GraphSim TK is used to perform rapid in-memory fingerprint searches.

See also

This recipe discusses three code examples:

Pre-generating Fingerprints

In order to perform a 2D similarity search fingerprints have to be generated that can be compared to assess molecular similarity. Generating fingerprints for a large molecule file can be time-consuming since it involves:

  • either performing a large number of substructure searches (in the case of the MACCS key fingerprint)

  • or exhaustively enumerating certain sub-graphs of a molecule (in the case of path, tree or circular fingerprints)

However, fingerprints can be pre-generated and saved into an .oeb file for later usage. Thereby eliminating the time-consuming on-the-fly generation of the fingerprints when performing similarity searches.

The code below shows how to loop over the molecules of an input file stream, generate all four built-in fingerprint types, attach them to the molecule as generic data, and write the molecule into the output stream. If the output stream is an .oeb file, the fingerprints attached to the molecule are written to the binary file as well. This provides a convenient way to transfer the pre-generated fingerprint along with the molecules from one application to another.

 1def GenerateFingerPrints(ifs, ofs):
 3    dots = oechem.OEDots(100000, 1000, "molecules")
 5    fptypes = [(oegraphsim.OEFPType_Circular, "circular"),
 6               (oegraphsim.OEFPType_Tree, "tree"),
 7               (oegraphsim.OEFPType_Path, "path"),
 8               (oegraphsim.OEFPType_MACCS166, "maccs")]
10    fp = oegraphsim.OEFingerPrint()
11    for mol in ifs.GetOEGraphMols():
12        dots.Update()
13        for (fptype, fptag) in fptypes:
14            oegraphsim.OEMakeFP(fp, mol, fptype)
15            mol.SetData(fptag, fp)
16        oechem.OEWriteMolecule(ofs, mol)
18    dots.Total()

Download code and drugs.sdf supporting data file


prompt > python3 drugs.sdf drugs-fp.oeb

Creating Molecule Database Index File

The next step is to generate an index file that will accelerate the process of loading the .oeb molecule file that is generated by the script above.

See also

See more details in the the Creating Molecule Database Index File section

Download code


prompt > python3 drugs-fp.oeb

Running the above command will generate the drugs-fp.oeb.idx molecule database index file.

Searching Fingerprints

The last step is to perform the similarity search i.e. find molecules with the highest fingerprint similarity to a query molecule. In the GetSimilarMolecules function, first a fingerprint is generated for the query molecule. The target molecules are then loaded into the OEMolDatabase object. By looping over the molecules stored in the molecule database, the fingerprints that have been attached to the molecules as generic data are retrieved and inserted into the OEFPDatabase object.

Calling the OEFPDatabase.GetSortedScores method returns an iterator over the calculated similarity scores in sorted order.

 1def GetSimilarMolecules(qmol, ifs, ofs, fptype, fptag, numhits):
 3    qfp = oegraphsim.OEFingerPrint()
 4    oegraphsim.OEMakeFP(qfp, qmol, fptype)
 6    moldb = oechem.OEMolDatabase(ifs)
 8    fpdb = oegraphsim.OEFPDatabase(fptype)
10    mol = oechem.OEGraphMol()
11    for idx in range(moldb.GetMaxMolIdx()):
12        if not moldb.GetMolecule(mol, idx):
13            continue
14        if mol.HasData(fptag):
15            tfp = mol.GetData(fptag)
16            fpdb.AddFP(tfp)
17        else:
18            oechem.OEThrow.Warning("Unable to access fingerprint for %s" % mol.GetTitle())
19            fpdb.AddFP(mol)
21    sdtag = "%s-fingerprint-score" % fptag
23    hit = oechem.OEGraphMol()
24    scores = fpdb.GetSortedScores(qfp, numhits)
25    for s in scores:
26        if moldb.GetMolecule(hit, s.GetIdx()):
27            oechem.OESetSDData(hit, sdtag, str(s.GetScore()))
28            oechem.OEWriteMolecule(ofs, hit)

Download code

prompt > python3 -query .ism -target drugs-fp.oeb -out .ism -fingerprint tree


Running the above command will generate the following output:

CC(=O)Oc1ccccc1C(=O)O acetsali
Cc1c(c(=O)n(n1C)c2ccccc2)N(C)C aminopy
CC(C)NCC(COc1ccccc1CC=C)O alprenol
Cn1cnc2c1c(=O)n(c(=O)n2C)C caffeine
c1nc2c(=O)[nH]c(nc2n1COCCO)N acyclovi
CC(C)NCC(COc1ccc(cc1)CC(=O)N)O atenolol

If the output file is a .sdf molecule file, then the similarity score will be attached to each molecule with the <tree-fingerprint-score> tag.


The OEFPDatabase class provides several ways to search fingerprints:

  • By default, OETanimoto calculation is used to quantify the degree of resemblance between two fingerprints. However, other built-in similarity coefficients (Cosine, Dice, Euclidean, Manhattan, Tversky) and user-defined similarity measures can also be used.

  • By default, the OEFPDatabase.GetSortedScores method returns similarity scores in descending order i.e. it identifies molecules that are most similar to the query. However, the order can be reversed to identify molecules that are most dissimilar to the query.

  • It allows searching the entire fingerprint database or only a segment of the database.

  • It allows setting a cut-off value for the similarity scores returned.

  • It provides sorted (OEFPDatabase.GetSortedScores) and un-sorted (OEFPDatabase.GetScores) access to the fingerprint similarites.

See also in OEChem TK manual



See also in GraphSim TK manual