Similarity Searching of Large Molecule Files

Problem

You want to perform 2D similarity search on large molecule files.

See also

If you have access for OpenEye toolkits version 2017.Jun or later please use example in Rapid Similarity Searching of Large Molecule Files.

Ingredients

Difficulty Level

../_images/chilly.png ../_images/chilly.png

Solution

In order to solve this problem two databases are utilized:

  • The OEMolDatabase class of OEChem TK is used is to provide fast read-only random access to molecular file formats that are readable by OEChem TK. This class is designed to handle large molecule files that can not be held in memory all at once.
  • The OEFPDatabase class of GraphSim TK is used to perform rapid in-memory fingerprint searches.

See also

This recipe discusses three code examples:

Pre-generating Fingerprints

In order to perform a 2D similarity search fingerprints have to be generated that can be compared to assess molecular similarity. Generating fingerprints for a large molecule file can be time-consuming since it involves:

  • either performing a large number of substructure searches (in the case of the MACCS key fingerprint)
  • or exhaustively enumerating certain sub-graphs of a molecule (in the case of path, tree or circular fingerprints)

However, fingerprints can be pre-generated and saved into an .oeb file for later usage. Thereby eliminating the time-consuming on-the-fly generation of the fingerprints when performing similarity searches.

The code below shows how to loop over the molecules of an input file stream, generate all four built-in fingerprint types, attach them to the molecule as generic data, and write the molecule into the output stream. If the output stream is an .oeb file, the fingerprints attached to the molecule are written to the binary file as well. This provides a convenient way to transfer the pre-generated fingerprint along with the molecules from one application to another.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
def GenerateFingerPrints(ifs, ofs):

    dots = oechem.OEDots(100000, 1000, "molecules")

    fptypes = [(oegraphsim.OEFPType_Circular, "circular"),
               (oegraphsim.OEFPType_Tree, "tree"),
               (oegraphsim.OEFPType_Path, "path"),
               (oegraphsim.OEFPType_MACCS166, "maccs")]

    fp = oegraphsim.OEFingerPrint()
    for mol in ifs.GetOEGraphMols():
        dots.Update()
        for (fptype, fptag) in fptypes:
            oegraphsim.OEMakeFP(fp, mol, fptype)
            mol.SetData(fptag, fp)
        oechem.OEWriteMolecule(ofs, mol)

    dots.Total()

Download code

fp2oeb.py and drugs.sdf supporting data file

Usage:

prompt > python3 fp2oeb.py drugs.sdf drugs-fp.oeb

Creating Molecule Database Index File

The next step is to generate an index file that will accelerate the process of loading the .oeb molecule file that is generated by the fp2oeb.py script above.

See also

See more details in the the Creating Molecule Database Index File section

Download code

moldb_create.py

Usage:

prompt > python3 moldb_create.py drugs-fp.oeb

Running the above command will generate the drugs-fp.oeb.idx molecule database index file.

Searching Fingerprints

The last step is to perform the similarity search i.e. find molecules with the highest fingerprint similarity to a query molecule. In the GetSimilarMolecules function, first a fingerprint is generated for the query molecule. The target molecules are then loaded into the OEMolDatabase object. By looping over the molecules stored in the molecule database, the fingerprints that have been attached to the molecules as generic data are retrieved and inserted into the OEFPDatabase object.

Calling the OEFPDatabase.GetSortedScores method returns an iterator over the calculated similarity scores in sorted order.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def GetSimilarMolecules(qmol, ifs, ofs, fptype, fptag, numhits):

    qfp = oegraphsim.OEFingerPrint()
    oegraphsim.OEMakeFP(qfp, qmol, fptype)

    moldb = oechem.OEMolDatabase(ifs)

    fpdb = oegraphsim.OEFPDatabase(fptype)

    mol = oechem.OEGraphMol()
    for idx in range(moldb.GetMaxMolIdx()):
        if not moldb.GetMolecule(mol, idx):
            continue
        if mol.HasData(fptag):
            tfp = mol.GetData(fptag)
            fpdb.AddFP(tfp)
        else:
            oechem.OEThrow.Warning("Unable to access fingerprint for %s" % mol.GetTitle())
            fpdb.AddFP(mol)

    sdtag = "%s-fingerprint-score" % fptag

    hit = oechem.OEGraphMol()
    scores = fpdb.GetSortedScores(qfp, numhits)
    for s in scores:
        if moldb.GetMolecule(hit, s.GetIdx()):
            oechem.OESetSDData(hit, sdtag, str(s.GetScore()))
            oechem.OEWriteMolecule(ofs, hit)

Download code

moldb_simcalc.py

prompt > python3 moldb_simcalc.py -query .ism -target drugs-fp.oeb -out .ism -fingerprint tree

c1ncccc1

Running the above command will generate the following output:

CC(=O)Oc1ccccc1C(=O)O acetsali
Cc1c(c(=O)n(n1C)c2ccccc2)N(C)C aminopy
CC(C)NCC(COc1ccccc1CC=C)O alprenol
Cn1cnc2c1c(=O)n(c(=O)n2C)C caffeine
c1nc2c(=O)[nH]c(nc2n1COCCO)N acyclovi
CC(C)NCC(COc1ccc(cc1)CC(=O)N)O atenolol

If the output file is a .sdf molecule file, then the similarity score will be attached to each molecule with the <tree-fingerprint-score> tag.

Discussion

The OEFPDatabase class provides several ways to search fingerprints:

  • By default, OETanimoto calculation is used to quantify the degree of resemblance between two fingerprints. However, other built-in similarity coefficients (Cosine, Dice, Euclidean, Manhattan, Tversky) and user-defined similarity measures can also be used.
  • By default, the OEFPDatabase.GetSortedScores method returns similarity scores in descending order i.e. it identifies molecules that are most similar to the query. However, the order can be reversed to identify molecules that are most dissimilar to the query.
  • It allows searching the entire fingerprint database or only a segment of the database.
  • It allows setting a cut-off value for the similarity scores returned.
  • It provides sorted (OEFPDatabase.GetSortedScores) and un-sorted (OEFPDatabase.GetScores) access to the fingerprint similarites.

See also in OEChem TK manual

Theory

API

See also in GraphSim TK manual

Theory

API