Drawing Fingerprint Score Histogram


You want to plot the histogram of molecular similarity scores of a dataset (see Figure 1).


Figure 1. Similarity scores of molecules with the same activity class


Difficulty level

../_images/chilly.png ../_images/chilly.png


Download code


See also the Usage subsection.


The plot_similarity_score_histogram function shows how to calculate the similarity scores histogram by calling the the OEFastFPDatabase.GetHistogram method. The OEFastFPDatabase.GetHistogram method calculates the similarity scores for all pairs of fingerprints stored in the fingerprint database and returns the scores in OEFPHistogram object (line 19). The plot is then initialized iterating over the bins of the OEFPHistogram object. Finally, the average similarity score along with the corresponding legend are plotted.


For symmetric similarity measures, the histogram will only contain upper-triangular similarity scores (excluding the diagonal). In case of the asymmetric OETversky similarity measure, the histogram for the whole NxN matrix is returned.

 1def plot_similarity_score_histogram(fpdb, opts, nrbins):
 2    """
 3    Plots similarity score histograms
 5    :type fpdb: oegraphsim.OEFastFPDatabase
 6    :type opts: oegraphsim.OEFPDatabaseOptions
 7    :type  nrbins: int
 8    """
10    nrfps = fpdb.NumFingerPrints()
11    fptype = fpdb.GetFPTypeBase()
12    memtypestr = fpdb.GetMemoryTypeString()
13    simfuncstr = oegraphsim.OEGetSimilarityMeasureName(opts.GetSimFunc())
15    # get histogram
17    timer = oechem.OEWallTimer()
18    timer.Start()
19    hist = fpdb.GetHistogram(opts, nrbins)
20    oechem.OEThrow.Info("%5.2f sec to get histogram for %d fingerprints (%s)" % (timer.Elapsed(), nrfps, memtypestr))
22    # plot histogram
24    plt.xlim(hist.GetMin(), hist.GetMax())
26    centers = [c for c in hist.GetBinCenters()]
27    densities = [d for d in hist.GetDensity()]
29    plt.plot(centers, densities, 'r', linewidth=2, drawstyle='steps-mid')
31    plt.axis([hist.GetMin(), hist.GetMax(), 0.0,  max(densities)])
33    plt.title(fptype.GetFPTypeString(), fontsize=10)
35    plt.ylabel('Probability', fontsize=18)
36    plt.xlabel('Similarity score (%s)' % simfuncstr, fontsize=18)
37    plt.yticks(fontsize=16)
38    plt.xticks(fontsize=16)
40    mean = hist.Mean()
41    plt.plot([mean, mean], [0.0, 1.0], linestyle='dashed', color='blue', linewidth=2, label='mean=%.3f' % hist.Mean())
43    plt.legend(loc='upper right', fontsize=20)




First use the makefastfp.py script (described in the Rapid Similarity Searching of Large Molecule Files recipe) with drugs.sdf supporting data to generate a binary fingerprint file.

prompt > python3 makefastfp.py -in drugs.sdf -fpdbfname drugs-tree.fpbin -fptype Tree

Then the fphist2img.py script can be used to plot the histogram of the similarity scores.

prompt > python3 fphist2img.py -fpdbfname drugs-tree.fpbin -out drugs-tree.svg

Command Line Parameters

Simple parameter list
    fingerprint database options
      -memorytype : Fingerprint database memory type

    fingerprint database search options
      -simfunc : Similarity measure

    input/output options
      -fpdbfname : Input filename of binary fingerprint database
      -out : Output filename of the generated image

    histogram options
      -nrbins : Number of bins in the histogram


When using GraphSim TK to identify molecules that are similar to a query molecule, no default cutoff value is given. The reason for this that the score distribution highly depends on not only the similarity measure used to calculate the scores (see Table 1) but also the fingerprint types (see Table 2).

Table 1: Histograms of various similarity measures of the same dataset with Tree fingerprint type




../_images/Thrombin-Inhibitor-tree-tanimoto.svg ../_images/Thrombin-Inhibitor-tree-euclid.svg ../_images/Thrombin-Inhibitor-tree-tversky.svg
Table 2: Histograms of Tanimoto scores of the same dataset with various fingerprint types




../_images/Thrombin-Inhibitor-tree-tanimoto.svg ../_images/Thrombin-Inhibitor-circular-tanimoto.svg ../_images/Thrombin-Inhibitor-path-tanimoto.svg

Plotting the similarity score distribution of molecules that belong the same activity class as the query as well as molecules with different activity classes could gives an idea about what a reasonable cutoff value would be.

Table 3: Similarity scores of molecule with the same and different activity classes

molecules with same activity class

molecules with different activity classes

../_images/Renin-Inhibitor-tree-tanimoto.svg ../_images/MDDR-tree-tanimoto.svg


the fphist2img.py script can run in three different modes that determine how fingerprints are stored and searched in the OEFastFPDatabase object:

prompt > python3 fphist2img.py --help -memorytype
Contents of parameter -memorytype
   Aliases : -m
   Type : string
   Default : memory-mapped
   Legal values : memory-mapped in-memory CUDA
   Brief : Fingerprint database memory type

 API : OEFastFPDatabaseMemoryType namespace

The in-memory mode involves pre-loading all fingerprints into memory prior to and performing the search in the memory. While this represents the fastest way to perform similarity searches once the fingerprints are loaded, searches are limited by memory availability.


The memory-mapped mode has no load time penalty or memory limitation but the search itself takes more time.


GPU-enabled calculation that provides 200x faster calculation than the two CPU modes above. CUDA mode involves pre-loading all fingerprints into GPU memory prior to performing similarity calculations. While this represents the fastest way to perform similarity searches once the fingerprints are loaded, searches are limited by GPU memory availability and will fall back to the memory-mapped CPU mode if the entire set of fingerprints cannot be preloaded into the GPU memory.

See also

Graph 1. The performance of generation of similarity score histogram
  • 16M 4096-bit long tree fingerprints = 8GB data

  • CPU benchmarked on m4.10xlarge AWS instance (single threaded)

  • GPU benchmarked on p3.8xlarge AWS instance

  • In-memory timings are extrapolated from shorter runs

See also in GraphSim TK manual



See also in matplotlib documentation