ML Build: Regression Model with Tuner on User-Based Feature Vector Input

In this tutorial, we will train several regression models to predict IC50 concentrations at which molecules become reactant to pyrrolamides. The training data is known as the PyrrolamidesA dataset. The regression models will be built on user-provided custom features and neural network hyperparameters. The results will be visualized in both the Analyze page and the Floe Report. We will use the Floe Report to analyze and choose a good model. Finally, we will choose a built model to predict the properties of some unseen molecules.

This tutorial uses the following floe:

  • ML Build: Regression Model using Feature Input

Warning

This tutorial keeps default parameters and builds approximately 1K machine learning models. While the per model cost is very cheap, based on the size of dataset, the total cost might be expensive. For instance, with the pyrrolamides dataset (~1K molecules), it costs around $100. To understand the working of the floe, we suggest building lesser models by referring to the Cheaper and Faster Version of the tutorial.

Note

If you need to create a tutorial project, please see the Setup Directions for the Machine Learning Model Building Tutorials.

Floe Input

This floe uses the PyrrolamidesA dataset for P1. This input dataset contains several OERecords. The floe expects three things from each record:

  • An OEMol, which is the molecule to train the models on.

  • A Float value, which contains the regression property to be learned. For this example, it is the NegIC50 concentration.

  • A FloatVec value, which contains the custom features on which the models will be trained. For this example, it is the FloatVec field called “Feature” and contains 11 properties of the molecule, such as PSA, XlogP, and so on.

Here is a sample record from the dataset:

OERecord (
    *Molecule(Chem.Mol)* : c1cc(c(nc1)N2CCC(CC2)NC(=O)c3cc(c([nH]3)Cl)Cl)[N+](=O)[O-]
    *NegIC50(Float)* : -2.6
    *Feature(FloatVec)*: 25, 1.26, 2, 0.33, 4, 2, 1, 0, 0, 4, 104.15
)

Run OEModel Building Floe

  • Choose the ML Build: Regression Model using Feature Input Floe. Click “Launch Floe” to bring up the Job Form. The parameters can be specified as below.

  • Input Small Molecules to Train Machine Learning Models On: Select the PyrrolamidesA dataset or your own dataset.

  • Outputs: The output will be saved to the names in these fields, so change them to something meaningful to you. We will use the defaults for this tutorial.

  • Response Value Field: This in the property to be trained. The drop-down field generates a list of options based on the input dataset. For our data, choose NegIC50.

  • Custom Feature: This is the input training data field for the model. These features can be anything from molecule descriptors to docking interactions to electrostatic properties. All float vectors need to be of the same size. This allows scientists to generate learnable sets of parameters on which the models are to be trained. Thus, we are no longer bound by fingerprints, which, though effective, have certain caveats. For this tutorial, choose Feature.

  • Number of Models to Show in Floe Report: This field prevents memory blowup in case you generate >1K models. In such instances, viewing the top 20–50 models should suffice.

select_db

Figure 1. The Job Form.

  • Preprocess Molecule performs the following steps. Set it to On.

    • Keeps the largest molecule if more than one is present in a record.

    • Sets the pH value to neutral.

  • If you don’t preprocess the molecule, you can still apply the Blockbuster Filter if needed.

  • Negative Log: When On, this applies -Log10 to training field values. For example, this is helpful when the data needs to be converted from IC50 to NegLogIC50. Keep it Off for this example.

  • Number of Top features to explain: Select how many features you want to view for machine learning results prediction.

    • The model agnostic explainer lists which features of the input vector play significant roles in predicting the Response Value Field. This parameter adjusts how many top features are shown. The maximum length of this field should be less than or equal to the length of our input feature vector (11 in this case). We recommend keeping the number below 10 most significant features, as the explainer diagrams are difficult to interpret if there is too much data.

  • Are We Using Keras Tuner: Indicates whether to use a hyperparameter optimizer with a Keras Tuner. Keep this On by default.

  • What Kind of Keras Tuner to Use: Keep Hyperband as the default or change to another option.

select_db

Figure 2. Additional Options parameters.

  • The default parameters create a working model. But we can tweak some parameters to learn more about the functionalities.

Advanced: Neural Network Hyperparameters Options: These are more specific machine learning parameters. We can use the defaults or choose to add or modify some values based on guidance from the How-to-Guide on Building Optimal Models.

  • Dropouts: These can be added to prevent overfitting.

  • Sets of Hidden Layers: Every row is the node size of each layer.

    • Since it is a -1 separated list, by default there are 3-layer networks of size (250,150,50) and (150,100,80).

    • This gives a total node size of 330(150+100+80) and 240(100+80+60) in the default models.

    • Plugging in the formula stated in the How-to-Guide on Building Optimal Models, we should reduce the number of nodes to prevent overfitting.

    Note that there is a bug in the current description of this parameter. It should read the following: “Eg: by default each lists first tuple will create NN with 3 hidden layers of size 10,25,50. (Vertically pick first value of each row). Followed by 10,25,75 (row1col1, row2col1, row3col2) | 10,25,100 (row1col1, row2col1, row3col3)| 10,50,50 (row1col1, row2col2, row3col1)

  • Sets of regularization layers: This field sets L2 regularization for each network layer, making it a -1 separated 3-tuple list as well. Note that the current defaults are 0.04,0.02 and 0.02,0.01,0.01. There is a bug in this parameter description as well.

  • Learning Rate: Increasing this parameter speeds things up, although the algorithm may not detect the minima if this value is too high. We can train our model on multiple learning rate values. Use the default values for this tutorial.

  • Max Epochs: If the dataset is big, consider increasing the epoch size, as it may take the algorithm longer to converge. Use the default values for this tutorial.

  • Activation: ReLU is a commonly used activation function for most models. If needed, change this parameter to one of the options in the list (see How-to-Guide on building Optimal Models).

  • Batch Size: This defines the number of samples that will be propagated through the network. With larger batch sizes, there is a significant degradation in the quality of the model, as measured by its ability to generalize. However, too small of a batch size may cause the model to take a very long time to converge. For a dataset size of ~2K, 64 is a reasonable batch size. However, for datasets of approximately 100 K, we may want to increase the batch size to at least 5K.

  • Set the batch size to 512 or keep the Adjust Batch Size* toggle turned On.

select_db

Figure 3. Neural network hyperparameter options.

That’s it! Let’s run the floe. Click the “Start Job” button.

Note

The default memory requirement for each cube has been set to moderate to keep the price low. Although this floe is even cheaper than the fingerprint-based floes, if you are training larger datasets, be mindful of the time it takes for the cubes to run into completion. Increasing the memory will let Orion assign instances with more processing power, thereby resulting in faster completion. In some cases, the cubes may run out of memory (indicated in the log report) for large datasets. Increasing cube memory will fix this as well.

Analysis of Output and Floe Report

After the floe has finished running, click the link on the Floe Report tab in your window to preview the report. Since the report is big, it may take a while to load. Try refreshing or popping the report out to a new window if this is the case. All results reported in the Floe Report are on the validation data.

Figure 4 shows statistics for all input data. The histograms represent the spread of data for each property of the input molecules. Skewed histograms or outliers may indicate the need for the data preprocessing floe in order to build better models.

select_db

Figure 4. Histograms for the summary statistics of the input data.

The graphs in Figure 5 show the Mean Absolute Error (MAE) for different values of the neural network hyperparameters. They indicate how model performance fluctuates with change in a single hyperparameter when all others are kept constant. They help us analyze the sensitivity of different hyperparameter values and plan future model builds accordingly.

There is also a plot between the top two most sensitive hyperparameters.

select_db

Figure 5. Summary of the model hyperparameters on the validation set and the hyperparameters that are most sensitive to MAE.

Next we tabulate all models built by the fully connected network. These models are sorted by the least R2 Score (for validation data). On the face of it, the first row should be the best model, since it has the least error on the validation data. But several other factors besides the best R2 determine this, starting with VLoss in the next column. To look at a sample model, click on the Model Link.

select_db

Figure 6. List of generated fully connected neural network models.

For each model, a linear color scale shows the rank. The cheminformatics and machine learning parameters the model was trained on are also shown.

select_db

Figure 7. Neural network classification model and LIME analysis.

The Neural Network Epoch Training Plots show the curves for the training and validation MAE. These follow a similar trend, which is a good sign of not overfitting. Had they diverged, we might have to go back and tweak parameters such as the number of hidden layer nodes, dropouts, and regularizers. Too much fluctuation in the graph may also suggest we need a slower learning rate.

select_db

Figure 8. Regression outputs.

The Regression Output plots real versus predicted values to show how well the training is correlated.

Below that, annotated explainers of the machine learning model are shown. You can click on interesting molecules to see explanations of some exemplary central and outlier molecules. Note that unlike fingerprints, we cannot map features back to molecules, so our explainer is a graph listing the features that figured most strongly in the prediction. The Number of Top Features to Explain parameter can be adjusted to change the x-axis of this plot.

The final figure is a plot of the features which contributed either towards or against the response value the model was trained on. It offers useful insight into the features that made the most contributions across all the molecules in the validation set.

select_db

Figure 9. Explanations of outlier and central molecules, as well as the contributions of various features.

This tutorial shows how to build classification models; it’s important to remember that building an optimized model is a nontrivial task. Refer to the How-to-Guide on Building Optimal Models.

After we have found a model choice, we can use it to predict and or validate against unseen molecules. Go to the Use Pretrained Regression Models to Predict Properties of Molecules tutorial for more information.