Predict hERG Toxicity of Drug-Like Molecules

In this tutorial, we will use the hERG toxicity floe to predict molecule toxicity as distinct classes of ‘High’ or ‘Low’. The floe predicts the hERG Toxicity of each molecule and explains the prediction based on chemical moeity. It also gives confidence intervals and other insightful information (listed at the end of this tutorial). The floe report provides a summary of the output.

Note: All other pretrained floes to predict molecule properties work the same way.

This tutorial uses the following floe:

hERG Toxicity Prediction for Small Molecules using ML and Cheminfo Fingerprints

Note

If you need to create a tutorial project, please see the Setup Directions for the Machine Learning Model Building Tutorials.

Floe Input

The input dataset P₁ contains several OERecords. The floe expects an OEMol from each record. These are the molecules for which the model will predict hERG toxicity. Note: uploading .csv, .sdf and other common fileformats to Orion should automatically convert them to datasets. Let this dataset be P₁.

Here is a sample record from the dataset:

OERecord (
    Molecule(Chem.Mol) : c1ccc(c(c1)NC(=O)N)OC[C@H](CN2CCC3(CC2)Cc4cc(ccc4O3)Cl)O
)

There can be another String field containing hERG toxicity values (as High and Low) to validate against. The following dataset does not contain said field.

Input Data

Sample Molecules dataset

Run hERG Toxicity Floe

Choose the hERG Toxicity Prediction for Small Molecules using ML and Cheminfo Fingerprints Floe. Click “Launch Floe” to bring up the Job Form. The parameters can be specified as below.

Input Small Molecule Dataset for Prediction: Choose P₁ from above.
Output hERG Toxicity: All the molecules predicted will be saved to this file. Change it to a name you will recognize.

select_db — **Figure 1.** The Job Form.

That’s it! Let’s run the floe. Click the “Start Job” button.

Analyze OEModel Floe Report

Figure 2 shows an image of the expected Floe Report. These statistics ca help analyze predictions based on the input molecules.

The top row of histograms shows a summary of the data used to train the hERG ML model. The red dotted lines illustrate the upper and lower quartiles for the data. The second row shows the histogram summary of the prediction data for the input dataset P₁. It is worthwhile to compare the x-axis ranges of the training and prediction data. If they are largely different, the prediction quality may deteriorate.

Below the histograms, you can see the hyperparameters for the neural network the hERG model was trained on. Finally, there is a histogram of the output prediction and the confidence for each prediction. There is also a plot for the confidence of prediction versus the actual output.

Analyze the Output

On the Data page, activate the data that the floe produced. This should have the same name you chose for the Output hERG Toxicity field of your floe. The data can be activated by clicking on the circle with a plus sign.

In the Spreadsheet Panel on the Analyze page, you can see the molecules, their predicted pyrrolamide values, and the explanation of the output.

Output Columns and Explanations

Class Confidence (hERG Toxicity): How confident the model is with its prediction as High, Medium, and Low. In this context, High, Medium, and Low refer to the prediction confidence and not actual toxicity.
Contributions (hERG Toxicity): Explanation of the prediction based on a local model. Based on the choice of molecule explainer (atom by default), different parts will be color-annotated, with red denoting “vote against hERG active” and blue denoting the opposite.
Scope: If there is an error or warning, this will state what caused the issue.
Predict (hERG Toxicity): Predicts toxicity as High or Low. The background color suggests how confident the model is, with green (most confidence), yellow (average confidence), and red (less confidence or out of scope). If the image has a red or yellow background, it means an error or warning was issued as an additional column in the output (Scope).

Note

We assign ID numbers to each record molecule. This follows a linear ordering over all molecules. So, if you activate both the successful and failure predictions and sort them by number, the order should be the same as for the input.