Tutorial: Building Machine Learning Regression Models on Fingerprints for Physical Property Prediction of Small Molecules

OpenEye Model Building is a tool to build machine learning models that predict a physical property of small molecules.

In this tutorial, we will train several regression models to predict IC50 concentrations at which molecules become reactant to pyrrolamides. The training data is known as the Pyrrolamides dataset. The regression models will be built on a combination of fingerprint and neural network hyperparameters. The results will be visualized in both the analyze page and Floe report. We will leverage the Floe report to analyze and choose a good model. Finally, we will choose a built model to predict the property of some unseen molecules.

This tutorial uses the following Floe: ML Regression Model Building using Fingerprints for Small Molecules.

Warning

This tutorial keeps default parameters and builds around 1k machine learning models. While the per model cost is very cheap, based on the size of dataset, total cost might be expensive. For instance, with the pyrrolamides dataset (~1k molecules) it costs around $100. To understand the working of the Floe, we suggest building lesser models by referring to the Cheap and Faster Version of the tutorial.

Create a Tutorial Project

Note

If you have already created a Tutorial project you can re-use the existing one.

Log into Orion and click the home button at the top of the blue ribbon on the left of the Orion Interface. Then click on the ‘Create New Project’ button and in the pop up window enter Tutorial for the name of the project and click ‘Save’.

create_project_ui

Orion home page

Floe Input

The Floe requires an input dataset file, “Input Small Molecules to train machine learning models on”, with each record in the file an OEMolField .

There needs to be a separate field in this file containing the Float physical property to train the network on. This field needs to be selected in the promoted field parameter Response Value Field.

The Pyrrolamides dataset contains several OERecord (s). As stated, the Floe expects two things from each record:

  • An OEMol which is the molecule to train the models on

  • A Float value which contains the regression property to be learnt. For this example, it is the NegIC50 concentration.

Here is a sample record from the dataset:

OERecord (
    *Molecule(Chem.Mol)* : c1ccc(c(c1)NC(=O)N)OC[C@H](CN2CCC3(CC2)Cc4cc(ccc4O3)Cl)O
    *NegIC50(Float)* : 3.04
)

The input molecules and the training field values can be filtered and transformed by the preprocessing units:

  • Promoted parameter Negative Log : Applies -Log10 on training field values

  • Promoted parameter Preprocess Molecule :

    • Keeps the largest molecule if more than one is present in a record.

    • Sets pH value to neutral.

  • Promoted parameter Blockbuster Filter :

Run OEModel Building Floe

  • Click on the ‘Floe’ button in the left menu bar

  • Click on the ‘Floes’ tab

  • Under Categories select Packages

  • Select OpenEye Model Building

  • Click on the ML Regression Model Building using Fingerprints for Small Molecules and a Job Form will pop up. Specify the following parameter settings in the Job Form.

select_db
  • Click on the Input Small Molecules to Train Machine Learning Models on. button

  • Select the Pyrrolamides dataset or your own dataset.

  • Under the Options tab, select the ‘Response Value Field’ which the model will train on. This field dynamically generates a list of columns to choose from based on the uploaded dataset’s columns. For our data, its NegIC50.

  • Select how many model reports you want to see in the final Floe report. This field prevents memory blowup in case you generate >1k models. In such cases viewing the top 20-50 models should suffice.

select_db
  • Next, turn on the Boolean parameter Preprocess Molecule. This preps every molecule by removing salts, charge, and multiple components, keeping the largest molecule. It also sets the ionization of the molecule to neutral pH.

  • Optionally, apply the blockbuster filter as well

  • Select how you want to view the ‘Molecule Explainer Type’ for machine learning results prediction.

    • ‘Atom’ annotates every atom by their degree of contribution towards final result.

    • ‘Fragment’ does this for every molecule fragment (generated by the oemedchem tookit) and is the preferred method of medicinal chemists.

    • ‘Combined’ produces visualisations of both these techniques combined.

  • In case it is necessary to transform the training values (Pyrrolamides in this case) to negative log, turn on the Negative Log switch as well. This is helpful when the data needs to be converted from IC50 to NegLogIC50 for instance. Keep it off for this example.

select_db
  • Change the default name of outputs to something recognizable.

  • You can let the model run at this point and it should run succesfully with default parameters. However, we can tweak a few parameters to learn more about the functionality.

  • Open the Advanced: Cheminformatics Fingerprint Options. This tab has all the cheminformatics parameters the model will be built on. We can change or add more values to the fingerprint parameters as shown in the image.

  • For suggestions on tweaking, read the How-to-Guide on building Optimal Models

select_db

Next we move to the Advanced: Neural Network Hyperparameters Options. This is where the Machine Learning parameters are listed. Again, we can leave them as defaults, or choose to add/modify a few values based on the How-To-Guide.

  • We add Dropouts to prevent overfitting.

  • Next, let us inspect the parameter Sets of Hidden Layers.

    • Since it is a -1 separated list, by default there are 3-layer networks of size (250,150,50) and (150,100,80).

    • This gives a total node size of 330(150+100+80) and 240(100+80+60) in the default models.

    • Plugging in the formula stated in the How-to-Guide building Optimal Models, we should probably reduce the number of nodes to prevent overfitting.

  • Another important hyperparameter is the ‘Sets of regularisation Layers’. This field sets L2 regularisation for each network layer, making it a -1 separated 3-tuple list as well.

  • Increasing the Learning Rates is a way of speeding things up, although the algorithm may not detect the minima if this value is too high. We can train our model on multiple learning rate values. Leave defaults here.

  • If the dataset is big, we may consider increasing the Max Epochs size as it may take the algorithm longer to converge. Leave defaults here.

  • Activation: RelU is the most commonly used activation function for most models. Change to one of the options in the list if needed (see How-to-Guide on building Optimal Models).

  • The Batch Size defines the number of samples that will be propagated through the network. With larger batch sizes there is a significant degradation in the quality of the model, as measured by its ability to generalize. However, too small of a batch size may take the model a very long time to converge. For a dataset size of ~2k, 64 is probably a good batch size. However for datasets of saround 100k, we may want to increase batch size to at least 5k.

    • Set Batch size to 64

select_db
  • That’s it! Lets go ahead and run the Floe!

Note

The default memory requirement for each cube has been set to moderate to keep the price low. If you are training larger datasets, be mindful of the time it takes for the cubes to run into completion. Increasing the memory will let Orion assign instances with more processing power, thereby resulting in a faster completion. In some cases, the cubes may run out of memory (indicated in the log report) for large datasets. Increasing cube memory will fix this as well.

Analysis of Output and Floe Report

After the Floe has finished running, click the link on the ‘Floe Report’ tab in your window to preview the report. Since the report is big, it may take a while to load. Try refreshing or popping the report out to a new window if this is the case. All results reported in the Floe report are on the validation data. The top part summarizes statistics on the whole input data.

select_db

The graphs show the Mean Absolute Error (MAE) for different values of the cheminformatics and neural network hyperparameters. It helps us analyze how sensitive the different hyperparameter values are and plan future model builds accordingly. For instance, the graph below shows that dropout of 0.1 and maxr of 5 are better choices for parameters in future model build runs.

select_db

There is also a plot between the top two most sensitive hyperparameters. In the example below, the top two most sensitive parameters are the regularisation 1 and regularisation 0. Choosing value around the minima in the MAE heatmap (0 for reglayer1 and 0.04 for reglayer2), will build better models in future runs.

select_db

Next, we tabulate the list of all models built by the fully connected network. These models are sorted by the least R2 Score (for validation data). On the face of it, the first row should be the best model since it has the least error on the validation data. But there can be several other factors besides the best R2 to determine this, starting with Loss in the next column. Hence, lets look at a sample model by clicking the ‘Model Link’. This will take you to a new Floe report page.

select_db

For each model, we have a linear color scale showing the rank. We also have the cheminformatics and machine learning parameters the model was trained on.

select_db

Let’s look at the training curves under ‘Neural Network Epoch Training Plots’. We see that the training and validation MAE follow a similar trend which is a good sign of not overfitting. Had they been diverging, we might have to go back and tweak parameters such as number of hidden layer nodes, dropouts, regularizers etc. If there were too much fluctuation in the graph it would suggest we need a slower learning rate.

select_db

The Regression outputs plots real versus predicted to show how well the training is correlated. Below that, click on the interesting molecule to see the annotated explainer of the machine learning model.

select_db

While this tutorial shows how to build regression models, building an optimized model is a non-trivial task. Refer to the How-to-Guide on building Optimal Models.

After we have found a model choice, we can use it to predict and or validate against unseen molecules. Go to the next tutorial Use Pretrained Model to predict generic property of molecules to do this.