Plotting ROC Curves of Fingerprint Similarity

Problem

You want to evaluate the performance of a molecule fingerprint method for discriminating between a set of molecules with the same activity class and a set of decoys. As the Drawing ROC Curve recipe shows that the performance of binary classification methods can be visualized (an evaluated) by depicting ROC curves and calculating AUC (area under the curve) number.

In order to perform cross validation for the fingerprint method, the generated plot (see Figure 1) depicts one ROC curve (thin green line) for each of the molecule from the activate set as well as the mean ROC curve (thick green line) and standard deviation (gray region). This graph along with the AUC number gives a clear picture about how well a fingerprint method works to identify molecules that are in the same activity class.

For more details about how to interpret a ROC curves and an AUC, see the Drawing ROC Curve recipe.

Figure 1. Validation of Tree fingerprint method as binary classification for ACE inhibitors

Ingredients

This recipe is based on the scikit-learn example: Receiver Operating Characteristic (ROC) with cross validation

Difficulty level

../_images/chilly6.png ../_images/chilly6.png ../_images/chilly6.png

Download

Download code

fprocs2img.py

See also the Usage subsection.

Solution

The plot_ROC_curves function calculates and depicts the ROC response for each molecule of the same activity class. Prior to calling the plot_ROC_curves function, two fingerprint databases are initialized with a specific fingerprint type (Tree, Path, Circular). The first, active_fpdb, stores the fingerprints of molecules that belong to the same activity class. While the other database, decoy_fpdb, stores fingerprints for decoy molecules i.e. molecules with different or unknown activity classes.

The plot_ROC_curves function iterates over all the fingerprint in the active dataset and calculates all the similarity scores for both the actives and the decoys by calling the OEFastFPDatabase.GetScores method (lines 19-20). The option (OEFPDatabaseOptions) used for calculating similarity scores ensures that all similarity scores are returned (lines 12-13). The calculated similarity scores are then combined and the ROC curve and AUC number are calculated (and accumulated).

After calculating (and depicting) all the ROC curves to see the variance of the curve for each molecule of the active set (thin green lines), the mean ROC curve (thick green line) and standard deviation (gray region) are also depicted. Finally, the mean AUC (area under curve) and its standard deviation are calculated and plotted.

The generated graph reveals how well the fingerprint method performs for the given activity class and how the retrial rate can be effected when different molecules from the active set are used to the other actives.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
def plot_ROC_curves(plt, active_fpdb, decoy_fpdb):
    """
    Plots the ROC curves and calculate AUC

    :type plt: matplotlib.pyplot
    :type active_fpdb: oegraphsim.OEFastFPDatabase
    :type decoy_fpdb: oegraphsim.OEFastFPDatabase
    """

    tprs = []
    aucs = []
    mean_fpr = linspace(0, 1, 100)

    limit = 0  # zero means no limit all scores returned
    opts = oegraphsim.OEFPDatabaseOptions(limit, oegraphsim.OESimMeasure_Tanimoto)
    fp = oegraphsim.OEFingerPrint()
    for fpidx in range(active_fpdb.NumFingerPrints()):
        if not active_fpdb.GetFingerPrint(fp, fpidx):
            continue

        decoy_scores = [si.GetScore() for si in decoy_fpdb.GetScores(fp, opts)]
        active_scores = [si.GetScore() for si in active_fpdb.GetScores(fp, opts)]

        active_flags = [1] * len(decoy_scores) + [0] * len(active_scores)
        scores = decoy_scores + active_scores
        tpr, fpr, treshold = roc_curve(active_flags, scores)
        tprs.append(interp(mean_fpr, fpr, tpr))
        roc_auc = auc(fpr, tpr)
        aucs.append(roc_auc)
        plt.plot(fpr, tpr, color='green', linewidth=1, alpha=0.10, label=None)

    mean_tpr = mean(tprs, axis=0)
    mean_tpr[-1] = 1.0
    mean_auc = auc(mean_fpr, mean_tpr)
    std_auc = std(aucs)

    plt.plot(mean_fpr, mean_tpr, color='green',
             label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc),
             linewidth=2.0, alpha=0.80)

    std_tpr = std(tprs, axis=0)
    tprs_upper = minimum(mean_tpr + std_tpr, 1)
    tprs_lower = maximum(mean_tpr - std_tpr, 0)
    plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=0.33,
                     label=r'$\pm$ 1 std. dev.')

Usage

Usage

fprocs2img.py and two pre-generated binary fingerprint files: ACE-tree.fpbin and negative-tree.fpbin

The following command will generate the image shown in Figure 1.

prompt > python3 fprocs2img.py -active ACE-tree.fpbin -decoy negative-tree.fpbin fprocs.svg

The binary fingerprint files can be generated with the makefastfp.py script described in the Rapid Similarity Searching of Large Molecule Files recipe.

Command Line Parameters

Simple parameter list
    fingerprint database options
      -memorytype : Fingerprint database memory type

    input/output options
      -active : Input filename of binary fingerprint database of actives
      -decoy : Input filename of binary fingerprint database of decoys
      -out : Output filename of the generated image

Discussion

The Figure 1 reveals that a Tree fingerprint method, with very high AUC number (0.96), performs really well for ACE inhibitors of the Briem-Lessel validation set ([Briem-Lessel-2000]). In comparison, the same Tree fingerprint method performs poorly for the PAF Antagonist activity class in the same validation set (see Figure 2).

Figure 2. Validation of Tree fingerprint method as binary classification for PAF Antagonist

The default parameters types (Tree, Path, and Circular) available in GraphSim TK are rigorous calibrated on the Briem-Lessel [Briem-Lessel-2000], Hert-Willett [Hert-Willett-2004], and Grant [Grant-2006] benchmark sets. GraphSim TK also provides facilities to construct user-defined fingerprints with the following adjustable parameters:

  • the atom and bond typing that define which atom and bond properties are encoded into the fingerprints.
  • the size of the fragments that are exhaustively enumerated during the fingerprint generation
  • the size of the generated fingerprint (in bits)

See also

However, an effective user-defined fingerprint can not be designed without understanding the effect of the different parameters on the overall performance.

The default fingerprint size used in GraphSim TK for Tree, Path, and Circular fingerprint types is 4096-bit long. By reducing the fingerprint size to 512-bits, the speed of the fingerprint search can accelerated about 8-fold, but not without reducing the power to discriminate between structurally similar and dissimilar molecules (see Table 1).

Table 1. The performance

See also in numpy documentation

See also in sklearn documentation

See also in matplotlib documentation