Test: Build Machine Learning Regression Models for Property Prediction of Small Molecules

In this tutorial, we will train several regression models to predict IC50 concentrations at which molecules become reactant to pyrrolamides. The training data is known as the pyrrolamides dataset (P1). The results will be visualized in both the Analyze page and the Floe Report. We will utilize the Floe Report to analyze and choose a good model. Finally, we will choose a built model to predict the properties of some unseen molecules.

This tutorial uses the following floe:

  • ML Build: Regression Model with Tuner using Fingerprints for Small Molecules

Warning

This tutorial keeps default parameters and builds approximately 2.5 K machine learning models. While the per model cost is very cheap, based on the size of dataset, the total cost can be high. For instance, with the pyrrolamides dataset (~1 K molecules), it costs around $100. To understand the working of the floe, we suggest building less expensive models by referring to the Cheaper and Faster Version of the tutorial.

Note

If you need to create a tutorial project, please see the Setup Directions for the Machine Learning Model Building Tutorials.

Floe Input

This floe uses the pyrrolamides dataset for P1. It expects two things from each record:

  • An OEMol, which is the molecule to train the models on.

  • A Float value that contains the regression property to be learned. For this example, it is the IC50 concentration.

Here is a sample record from the dataset:

OERecord (
    *Molecule(Chem.Mol)* : c1ccc(c(c1)NC(=O)N)OC[C@H](CN2CCC3(CC2)Cc4cc(ccc4O3)Cl)O
    *NegIC50(Float)* : 3.04
)

Run OEModel Building Floe

Choose the ML Build: Regression Model with Tuner using Fingerprints for Small Molecules Floe. Click “Launch Floe” to bring up the Job Form. The parameters can be specified as below.

  • Input Small Molecules to train machine learning models on: Select the pyrrolamides dataset or your own dataset.

  • Outputs parameter: You can change the names of the built ML models to be saved in Orion, but we will use the defaults for this tutorial.

  • Response Value Field: This in the property to be trained. The drop-down field generates a list of options based on the input dataset. For our data, choose NegIC50.

  • Are we using Keras tuner: Indicates whether to use a hyperparameter optimizer with a Keras tuner. Keep this on by default.

  • What kind of Keras tuner to use: Keep Hyperband as the default or change to another option.

  • Number of models to show in Floe Report: This field prevents memory blowup in case you generate >1 K models. In such cases, viewing the top 20–50 models should suffice.

select_db
  • Preprocess Molecule: Set to On.

    • Keeps the largest molecule if more than one is present in a record.

    • Sets the pH value to neutral.

    • Applies the Blockbuster Filter.

  • If you don’t preprocess the molecule, you can still apply the Blockbuster Filter if needed.

  • Negative Log: When On, this allows you to transform the training values to a negative log. Keep it Off for this example.

  • Molecule Explainer Type: Select how you want to view the machine learning results.

    • Atom annotates every atom by their degree of contribution towards final result.

    • Fragment does this for every molecule fragment (generated by OEMedChem TK) and is the preferred method of medicinal chemists.

    • Combined produces visualizations of both techniques combined.

select_db
  • The default parameters create a working model. But we can tweak some parameters to learn more about the functionalities.

Advanced: Cheminfo Fingerprint Options: These parameters contain the cheminformatics information the model will be built on. We can add more values to the fingerprint parameters as shown in the image. You can experiment by adding other parameter values.

select_db

Advanced: Neural Network Hyperparameters Options: These are the more specific machine learning parameters. Again, we can use the defaults or choose to add or modify some values based on the How-To Guide.

  • Dropouts: These can be added to prevent overfitting.

  • Sets of Hidden Layers: Every row is the node size of each layer. Each node from each layer combines with another from the next to create an ML model. By default, there are three 3-layer networks. Plugging in the formula stated in the How-to Guide (building Optimal Models), we should reduce the number of nodes to prevent overfitting. Change Layer 1 to 10, 25, 100; Layer 2 to 20, 50, 75; and Layer 3 to 50, 75, 100.

  • Sets of regularization layers: This field sets L2 regularization for each network layer, making it a -1 separated 3-tuple list as well.

  • Learning Rate: Increasing this parameter speeds things up, although the algorithm may not detect the minima if this value is too high. We can train our model on multiple learning rate values. Use the default values for this tutorial.

  • Max Epochs: If the dataset is big, consider increasing the epoch size, as it may take the algorithm longer to converge. Use the default values for this tutorial.

  • Activation: ReLU is a commonly used activation function for most models. Change this parameter to one of the options in the list (see How-to-Guide on building Optimal Models).

  • Batch Size: This defines the number of samples that will be propagated through the network. With larger batches, there is a significant degradation in the quality of the model, as measured by its ability to generalize. However, too small of a batch size may cause the model to take a very long time to converge. For a dataset size of ~2 K, 64 is a reasonable batch size. However, for datasets of approximately 100 K, we may want to increase the batch size to at least 5 K. For this tutorial, set the Batch Size to 64.

  • Neural Network Ensemble Size: This determines how many models per run will be trained and provides a robust confidence interval.

select_db

That’s it! Let’s run the floe. Click the “Start Job” button.

Analysis of Output and Floe Report

After the floe has finished running, click the link on the Floe Report tab in your window to preview the report. Since the report is big, it may take a while to load. Try refreshing or popping the report to a new window if this is the case. All results in the Floe Report are based on the validation data.

More information on the basics of the summary statistics of built models can be found in the Tutorial for Floe Report Summary Statistics <link>.

The first figure shows histograms that summarize statistics on the whole input data.

select_db

The graphs below show the Mean Absolute Error (MAE) for the neural network hyperparameters, which helps us analyze how sensitive the hyperparameter values are. The lower the number, the better the model for that parameter. For instance, we see from the graph that a dropout value of 0.1 and maxr of 5 are better choices for parameters in future model build runs.

select_db

There is also a plot between the two most sensitive hyperparameters. Therefore, we can choose these parameter values to be closer to the lowest MAEs in the heatmap.

select_db

Next we tabulate all models built by the fully connected network. These models are sorted by the least R2 Score (for validation data). On the face of it, the first row should be the best model, since it has the least error on the validation data. But several other factors besides the best R2 determine this, starting with VLoss in the next column. To look at a sample model, click on the Model Link and look at the training curves under Neural Network Hyperparameters and Fingerprint Parameters.

select_db

For each model, a linear color scale shows the rank. The cheminformatics and machine learning parameters the model was trained on are also shown.

select_db

The training and validation MAE follow a similar trend, which is a good sign of not overfitting. Had they diverged, we might have to go back and tweak parameters such as the number of hidden layer nodes, dropouts, and regularizers. Too much fluctuation in the graph may also suggest we need a slower learning rate.

select_db

The Regression Output plots real versus predicted values to show how well the training is correlated.

Another section allows you to click on an interesting molecule to see the annotated explainer of the machine learning model.

select_db

While this tutorial shows how to build regression models, building an optimizing one is a nontrivial task. Refer to the How-to Guide on how to optimize model building based on the data available.

After we have found a model choice, we can use it to predict and validate against unseen molecules. Go to the next tutorial Use Pretrained Model to predict generic property of molecules to learn how.