ML Predict: Use Pretrained Classification Fingerprint Model to Predict Properties of Molecules

In this tutorial, we will use a trained, fully connected neural network classification model to predict whether the NegIC50 data is Low, Medium, or High. The floe predicts and explains the properties of each molecule. It also generates confidence intervals. The Floe Report provides a summary of the output.

This tutorial uses the following floe:

  • ML Predict: Classification using Fingerprints for Small Molecules

To learn how to build a machine learning model to use for prediction, read the tutorial for the ML Build: Classification Model with Tuner using Fingerprints for Small Molecules Floe.

Note

If you need to create a tutorial project, please see the Setup Directions for the Machine Learning Model Building Tutorials.

Floe Input

The required inputs are:

  • Molecule Dataset P1

  • Tensorflow Machine Learning Classification Model Dataset M1

The P1 dataset contains several OERecords. The floe expects an OEMol from each record, which is the molecule whose physical properties the model will predict.

Here is a sample record from the dataset:

OERecord (
    Molecule(Chem.Mol) : c1ccc(c(c1)NC(=O)N)OC[C@H](CN2CCC3(CC2)Cc4cc(ccc4O3)Cl)O
)

There can be another string field containing the physical property values to validate against. The following dataset does not contain said field.

To learn how to generate these models, read the previous tutorial on building models in Orion. The M1 dataset contains one or more machine learning models. The attached model is a sample trained Tensorflow Model to predict classes of IC50 values. These classes are Low, Medium, and High. In addition, the floe uses Convex Box and Monte Carlo Dropout for domain of application (DOA) and error bar analysis.

Run Classification Property Prediction Floe

Choose the ML Predict: Classification using Fingerprints for Small Molecules Floe. Click “Launch Floe” to bring up the Job Form. The parameters can be specified as below.

  • Input Small Molecule Dataset: Select P1.

  • Input TensorFlow Model: Select M1. Make sure this matches the ID of the input TensorFlow model; in this case, it will be 1.

  • Output Property for Fingerprint-based Regression: The predicted molecules will be saved to the name in this field, so change it to something meaningful to you.

  • Model ID of which TensorFlow model to use to predict: Use this field to choose the model ID you would like to predict the properties of your molecules. You can go to the Analyze page on the model dataset to determine the model ID. More information about how to select this model is in the How-to Guide for using built ML models for the property prediction and verification of unseen molecules. Although the how-to guide is for a regression model on fingerprints, it should be similar to this case. For our case, we can use either 1 or 5 for the model ID.

select_db

Figure 1. The Job Form.

  • Property Validation Field: If your small molecule dataset contains a baseline value for a physical property (sourced from elsewhere), you can use this option to get a statistical comparison between this value and the model prediction. Note that this should be a string field containing the strings used during training.

That’s it! Let’s run the floe. Click the “Start Job” button.

Analyze OEModel Floe Report

The Floe Report contains statistics that help analyze the predicted input molecules. It includes the histograms of the training data and the input data.

select_db

Figure 2. Histograms for the summary of the training and input data.

Figure 3 shows the hyperparameters on which the model was trained and the histogram of the IC Class output prediction.

select_db

Figure 3. Hyperparameters of model training and IC class prediction.

Finally, there is a link to an Explanation of Interesting Molecules. This page shows annotated images that explain predicted molecules with Low, Medium, and High confidence.

select_db

Figure 4. Interesting molecules shown in the Floe Report.

Analyze Output

  • On the Data page, activate the data that the floe produced. This should have the same name you chose in the Output Prediction field for your floe. Activate the dataset by clicking on the circle with a plus sign.

select_db

Figure 5. How to activate a dataset.

  • On the Analyze page, you should see the molecules, their predicted pyrrolamide values, and the explanation of the output.

Output Definitions

  • There are three output ports: Success, Failure, and No Confidence.

  • Success Data: (a) The data falls within the scope; the explainer has a green background. (b) The data falls at the edge of the scope; the explainer has a yellow background.

  • Failure Data: The molecule (a) is too large or too small, or (b) has an unknown atom.

  • No Confidence Data: The property of the molecule falls out of the scope of the training set. In this case, the model predicts with no guarantees. The explainer image has a red background.

Output Columns and Explanations

  • Class Confidence (Property Unit): High, Medium, or Low confidence in the prediction.

  • Contributions: Explanation of the prediction based on a local model. If the image has a red background, it means no confidence. If yellow, a warning was issued. Green means it falls within the scope of the predictor. Based on the choice of the molecule explainer (fragment by default), different parts will be color-annotated, with blue denoting more contribution towards the physical property and red denoting the opposite.

  • Scope: If there is an error or warning, what caused the issue.

  • Prediction (Property Unit): Physical property prediction of the molecule in the string field.

select_db

Figure 6. Output columns of the spreadsheet on the Analyze page.

Note

We assign ID numbers to each record molecule. This follows a linear ordering over all molecules. So if you activate all the Success, Failure, and No Confidence predictions and sort them based on number, the order should be same as for the input.