ML Build: Regression Model with Tuner using Fingerprints for Small Molecules
In this tutorial, we will train several regression models to predict IC50 concentrations at which molecules become reactant to pyrrolamides. The training data is known as the pyrrolamides dataset (P1). The results will be visualized in both the Analyze page and the Floe Report. We will utilize the Floe Report to analyze and choose a good model. Finally, we will choose a built model to predict the properties of some unseen molecules.
This tutorial uses the following floe:
ML Build: Regression Model with Tuner using Fingerprints for Small Molecules
Warning
This tutorial uses the default parameters and builds approximately 2.5 K machine learning models. While the per model cost is very cheap, based on the size of dataset, the total cost can be high. For instance, with the pyrrolamides dataset (~1 K molecules), it costs around $100. To understand the working of the floe, we suggest building less expensive models by referring to the cheaper and faster version of the tutorial.
Note
If you need to create a tutorial project, please see the Setup Directions for the Machine Learning Model Building Tutorials.
Floe Input
This floe uses the pyrrolamides dataset for P1. It expects two things from each record:
An OEMol, which is the molecule to train the models on.
A Float value that contains the regression property to be learned. For this example, it is the IC50 concentration.
Here is a sample record from the dataset:
OERecord (
*Molecule(Chem.Mol)* : c1ccc(c(c1)NC(=O)N)OC[C@H](CN2CCC3(CC2)Cc4cc(ccc4O3)Cl)O
*NegIC50(Float)* : 3.04
)
Input Data
Run the Model Building Floe
Choose the ML Build: Regression Model with Tuner using Fingerprints for Small Molecules Floe. Click “Launch Floe” to bring up the Job Form. The parameters can be specified as below.
Input Small Molecules to Train Machine Learning Models: Select the pyrrolamides dataset or your own dataset.
Outputs: You can change the names of the built ML models to be saved in Orion, but we will use the defaults for this tutorial.
Response Value Field: This in the property to be trained. The drop-down field generates a list of options based on the input dataset. For our data, choose NegIC50.
Are We Using Keras Tuner: Indicates whether to use a hyperparameter optimizer with a Keras Tuner. Keep this on by default.
What Kind of Keras Tuner to Use: Keep Hyperband as the default or change to another option.
Number of Models to Show in Floe Report: This field prevents memory blowup in case you generate >1 K models. In such cases, viewing the top 20–50 models should suffice.
Preprocess Molecule: Set to On.
Keeps the largest molecule if more than one is present in a record.
Sets the pH value to neutral.
If you don’t preprocess the molecule, you can still apply the Blockbuster Filter if needed.
Negative Log: When On, this allows you to transform the training values to a negative log. Keep it Off for this example.
Molecule Explainer Type: Select the model visualization method.
Atom annotates every atom by their degree of contribution towards final result.
Fragment does this for every molecule fragment (generated by OEMedChem TK) and is the preferred method of medicinal chemists.
Combined produces visualizations of both techniques combined.
The default parameters create a working model. But we can tweak some parameters to learn more about the functionalities.
The Advanced: Cheminformatics Fingerprint Options parameters contain the cheminformatics information the model will be built on. We can add more values to the fingerprint parameters as shown in Figure 3. You can experiment by adding other parameter values.
For suggestions on tweaking, read the How-to-Guide on Building Optimal Models.
The Advanced: Neural Network Hyperparameter Options include more specific machine learning parameters. Again, we can use the defaults or choose to add or modify some values based on the How-To Guide.
Dropouts: These can be added to prevent overfitting.
Sets of Hidden Layers: Every row is the node size of each layer. Each node from each layer combines with another from the next to create an ML model. By default, there are three 3-layer networks. Plugging in the formula stated in the How-to-Guide on Building Optimal Models, we should reduce the number of nodes to prevent overfitting. Change Layer 1 to 10, 25, 100; Layer 2 to 20, 50, 75; and Layer 3 to 50, 75, 100.
Note that there is a bug in the current description of this parameter. It should read the following: “Eg: by default each lists first tuple will create NN with 3 hidden layers of size 10,25,50. (Vertically pick first value of each row). Followed by 10,25,75 (row1col1, row2col1, row3col2) | 10,25,100 (row1col1, row2col1, row3col3)| 10,50,50 (row1col1, row2col2, row3col1)”
Sets of regularization layers: This field sets L2 regularization for each network layer, making it a -1 separated 3-tuple list as well. Note that the current defaults are 0.04,0.02 and 0.02,0.01,0.01. There is a bug in this parameter description as well.
Learning Rate: Increasing this parameter speeds things up, although the algorithm may not detect the minima if this value is too high. We can train our model on multiple learning rate values. Use the default values for this tutorial.
Max Epochs: If the dataset is big, consider increasing the epoch size, as it may take the algorithm longer to converge. Use the default values for this tutorial.
Activation: ReLU is a commonly used activation function for most models. Change this parameter to one of the options in the list (see How-to-Guide on Building Optimal Models).
Batch Size: This defines the number of samples that will be propagated through the network. With larger batches, there is a significant degradation in the quality of the model, as measured by its ability to generalize. However, too small of a batch size may cause the model to take a very long time to converge. For a dataset size of ~2K, 64 is a reasonable batch size. However, for datasets of approximately 100K, we may want to increase the batch size to at least 5K. For this tutorial, set the Batch Size to 64 or keep the Adjust Batch Size* toggle On.
Neural Network Ensemble Size: This determines how many models per run will be trained and provides a robust confidence interval.
That’s it! Let’s run the floe. Click the “Start Job” button.
Analysis of Output and Floe Report
After the floe has finished running, you can see your results in the Floe Report. To reach the Floe Report, navigate to the Jobs Tab on the Floe page and then click on the job that you want to inspect. Open the Floe Report by clicking on the panel. You may also choose it from the Reports section on the Job Panel.
The report is large, so refresh or pop the report to a new window if it takes a while to load. All results in the Floe Report are based on the validation data.
Figure 5 shows histograms that summarize statistics on the whole input data.
The graphs below show the Mean Absolute Error (MAE) for the neural network hyperparameters, which helps us analyze how sensitive the hyperparameter values are. The lower the number, the better the model for that parameter. For instance, we see from the graph that a dropout value of 0.1 and maxr of 5 are better choices for parameters in future model build runs.
Figure 7 shows a plot of the two most sensitive hyperparameters. Therefore, we can choose these parameter values to be closer to the lowest MAEs in the heatmap.
Next we tabulate all models built by the fully connected network. These models are sorted by the least R2 Score (for validation data). On the face of it, the first row should be the best model, since it has the least error on the validation data. But several other factors besides the best R2 determine this, starting with VLoss in the next column. To look at a sample model, click on the Model Link and look at the training curves under Neural Network Hyperparameters and Fingerprint Parameters.
For each model, a linear color scale shows the rank. The cheminformatics and machine learning parameters the model was trained on are also shown.
The training and validation MAE follow a similar trend, which is a good sign of not overfitting. Had they diverged, we might have to go back and tweak parameters such as the number of hidden layer nodes, dropouts, and regularizers. Too much fluctuation in the graph may also suggest we need a slower learning rate.
The Regression Output plots real versus predicted values to show how well the training is correlated.
Another section allows you to click on an interesting molecule to see the annotated explainer of the machine learning model.
While this tutorial shows how to build regression models, building an optimizing one is a nontrivial task. Refer to the How-to-Guide on Building Optimal Models based on the data available.
After we have found a model choice, we can use it to predict and validate against unseen molecules. Go to the Use Pretrained Regression Models to Predict Properties of Molecules tutorial to learn how.