Drawing Fingerprint Score Histogram
Problem
You want to plot the histogram of molecular similarity scores of a dataset (see Figure 1).
Ingredients
|
Difficulty level
Download
Solution
The plot_similarity_score_histogram function shows how to calculate the similarity scores histogram by calling the the OEFastFPDatabase.GetHistogram method. The OEFastFPDatabase.GetHistogram method calculates the similarity scores for all pairs of fingerprints stored in the fingerprint database and returns the scores in OEFPHistogram object (line 19). The plot is then initialized iterating over the bins of the OEFPHistogram object. Finally, the average similarity score along with the corresponding legend are plotted.
Note
For symmetric similarity measures, the histogram will only contain upper-triangular similarity scores (excluding the diagonal). In case of the asymmetric OETversky similarity measure, the histogram for the whole NxN matrix is returned.
1def plot_similarity_score_histogram(fpdb, opts, nrbins):
2 """
3 Plots similarity score histograms
4
5 :type fpdb: oegraphsim.OEFastFPDatabase
6 :type opts: oegraphsim.OEFPDatabaseOptions
7 :type nrbins: int
8 """
9
10 nrfps = fpdb.NumFingerPrints()
11 fptype = fpdb.GetFPTypeBase()
12 memtypestr = fpdb.GetMemoryTypeString()
13 simfuncstr = oegraphsim.OEGetSimilarityMeasureName(opts.GetSimFunc())
14
15 # get histogram
16
17 timer = oechem.OEWallTimer()
18 timer.Start()
19 hist = fpdb.GetHistogram(opts, nrbins)
20 oechem.OEThrow.Info("%5.2f sec to get histogram for %d fingerprints (%s)" % (timer.Elapsed(), nrfps, memtypestr))
21
22 # plot histogram
23
24 plt.xlim(hist.GetMin(), hist.GetMax())
25
26 centers = [c for c in hist.GetBinCenters()]
27 densities = [d for d in hist.GetDensity()]
28
29 plt.plot(centers, densities, 'r', linewidth=2, drawstyle='steps-mid')
30
31 plt.axis([hist.GetMin(), hist.GetMax(), 0.0, max(densities)])
32
33 plt.title(fptype.GetFPTypeString(), fontsize=10)
34
35 plt.ylabel('Probability', fontsize=18)
36 plt.xlabel('Similarity score (%s)' % simfuncstr, fontsize=18)
37 plt.yticks(fontsize=16)
38 plt.xticks(fontsize=16)
39
40 mean = hist.Mean()
41 plt.plot([mean, mean], [0.0, 1.0], linestyle='dashed', color='blue', linewidth=2, label='mean=%.3f' % hist.Mean())
42
43 plt.legend(loc='upper right', fontsize=20)
Usage
Usage
First use the makefastfp.py
script (described in the Rapid Similarity Searching of Large Molecule Files recipe)
with drugs.sdf
supporting data
to generate a binary fingerprint file.
prompt > python3 makefastfp.py -in drugs.sdf -fpdbfname drugs-tree.fpbin -fptype Tree
Then the fphist2img.py
script can be used to plot the histogram of the similarity scores.
prompt > python3 fphist2img.py -fpdbfname drugs-tree.fpbin -out drugs-tree.svg
Command Line Parameters
Simple parameter list
fingerprint database options
-memorytype : Fingerprint database memory type
fingerprint database search options
-simfunc : Similarity measure
input/output options
-fpdbfname : Input filename of binary fingerprint database
-out : Output filename of the generated image
histogram options
-nrbins : Number of bins in the histogram
Discussion
When using GraphSim TK to identify molecules that are similar to a query molecule, no default cutoff value is given. The reason for this that the score distribution highly depends on not only the similarity measure used to calculate the scores (see Table 1) but also the fingerprint types (see Table 2).
Tanimoto |
Euclid |
Tversky |
Tree |
Circular |
Path |
Plotting the similarity score distribution of molecules that belong the same activity class as the query as well as molecules with different activity classes could gives an idea about what a reasonable cutoff value would be.
molecules with same activity class |
molecules with different activity classes |
Performance
the fphist2img.py script can run in three different modes that determine how fingerprints are stored and searched in the OEFastFPDatabase object:
prompt > python3 fphist2img.py --help -memorytype
Contents of parameter -memorytype
Aliases : -m
Type : string
Default : memory-mapped
Legal values : memory-mapped in-memory CUDA
Brief : Fingerprint database memory type
Detail
API : OEFastFPDatabaseMemoryType namespace
See also
OEFastFPDatabaseMemoryType namespace in the GraphSim TK manual
|
See also in GraphSim TK manual
Theory
Fingerprint Generation chapter
API
OEGetSimilarityMeasureName function
OEFastFPDatabase class
OEFPDatabaseOptions class
OEFPHistogram class
See also in matplotlib documentation
matplotlib.pyplot