ML ReBuild: Transfer Learn ML Regression Model Using Fingerprints for Small Molecules

In machine learning, transfer learning is the process of retraining or repurposing a previously built ML model for a new problem or domain. Imagine having a built a QSAR model for solubility on a dataset with molecules of size X. Now we acquire a new set of data whose solubility has been measured, with the size a fraction of X. These molecules are different from the initial training set, based on either moiety or pharmacophore. We would like to update our previous model to include knowledge of this new data. Next, consider another problem of having built a QSAR model on thermodynamic solubility. We want to repurpose this model to predict kinetic solubility, as the size of the training data for kinetic solubility is not enough to build a robust model. In both of these examples, we wish to reuse previously built, robust ML models to extend to a new domain or to solve a new but similar problem.

In this tutorial, we will use the ML ReBuild: Transfer Learn ML Regression Model using Fingerprints for Small Molecules Floe to train a previously built ML model using transfer learning.

We will provide you with a built ML model, but if you want to build one from scratch, refer to the tutorial Building Machine Learning Regression Models for Property Prediction of Small Molecules.

Note

If you need to create a tutorial project, please see the Setup Directions for the Machine Learning Model Building Tutorials.

Floe Input

The inputs required for the floe are:

  • Molecule Dataset (P1) with Float response value

  • Tensorflow Machine Learning Model Dataset (M1) to transfer learn on

For P1, uploading .csv, .sdf or similar formats in Orion should automatically convert them to a dataset. The solubility dataset contains several OERecords. The floe expects two things from each record:

  • An OEMol, which is the molecule to train the models on.

  • A Float value that contains the regression property to be learned. For this example, it is the solubility (logμM) value.

Here is a sample record from the dataset:

OERecord (
    *Molecule(Chem.Mol)* : c1ccc(c(c1)NC(=O)N)OC[C@H](CN2CCC3(CC2)Cc4cc(ccc4O3)Cl)O
    *Solubility loguM(Float)* : 3.04
)

The M1 dataset contains one or more machine learning models. To learn how to generate these models, read previous tutorials on building ML models.

The model has been trained on logμM solubility values from ChEMBL data. The retraining dataset is from a different version of ChEMBL with no common match.

Note that this response value needs to exactly match with the response value on which the models in M_1 were trained.

Run OEModel Building Floe

Choose the ML ReBuild: Transfer Learn ML Regression Model using Fingerprints for Small Molecules Floe. Click “Launch Floe” to bring up the Job Form. The parameters can be specified as below.

  • Input Small Molecules to Train Machine Learning Models: Select the P1 dataset or your own dataset.

  • Input Tensorflow Model: Select the pretrained solubility model M1 or your own trained model.

  • Outputs: You can change the names of the built ML models to be saved in Orion, but we will use the defaults for this tutorial.

  • Select Models for Transfer Learning Training: If your input ML model dataset has more than one model, you need to pick which model you want to transfer learn on. If the default -1 is retained, it will train all the available models in parallel. This can be expensive if the M1 dataset has many models.

    • For our example, our M1 dataset has only one model with model ID 2040. So keeping this parameter at -1 or setting it to 2040 will have the same result.

  • Response Value Field: Select the property which the model will train on. This parameter dynamically generates a list of columns to choose from based on the input dataset P1. For this tutorial, it is solubility (logμM). Note that the response value must exactly match the response value on which the models in the M1 dataset were trained.

  • Number of Models to Show in Floe Report: This parameter prevents memory blowup in case you generate >1K models. In such cases, viewing the top 20-50 models should suffice.

  • Preprocess Molecule: Set to On.

    • Keeps only the largest molecule.

    • Sets the pH to neutral.

  • If you don’t preprocess the molecule, you can still apply the Blockbuster Filter if necessary.

  • Negative Log: When On, this allows you to transform the training values (solubility in this case) to a negative log. Keep it Off for this example.

  • As a general rule, select the previous three parameters based on the settings used for building the original model.

  • Molecule Explainer Type: Select the model visualization method.

    • Atom annotates every atom by their degree of contribution towards final result.

    • Fragment does this for every molecule fragment (generated by OEMedChem TK) and is the preferred method of medicinal chemists.

    • Combined produces visualizations of both techniques combined.

How to change layers to train on for transfer learning

Turn On the “Show Cube Parameters” toggle at the bottom of the Job Form and scroll to the Parallel Transfer Learning Neural Network Regression Training Cube. The Layers to Freeze parameter fixes the number of layers in front of the model NOT to train on. Increase this if the retrain data size is smaller than initial training data.

That’s it! Let’s run the floe. Click the “Start Job” button.

Analysis of Output and Floe Report

After the floe has finished, you can see your results in the Floe Report. To reach the Floe Report, navigate to the Jobs Tab on the Floe page and then click on the job that you want to inspect. Open the Floe Report by clicking on the panel. You may also choose it from the Reports section on the Job Panel.

The report is large, so refresh or pop the report to a new window if it takes a while to load. All results in the Floe Report are based on the validation data.

Note that the Floe Report is designed to give a comparative study of the different models built. In this example, since we performed transfer learning on only one model, the graphs may be less insightful than those with several data points.

The information shown on the Analyze page looks similar to that of a regular ML built floe, except that there is no hyperparameter tuning. Please refer to the Analysis of Output and Floe Report section for more information.