Tutorial: Predict Solubility of Druglike Molecules¶
OpenEye Model Building is a tool to build machine learning models that predict physical property of small molecules.
In this tutorial, we will use the solubility Floe to predict molecule solubility in log(uM). The Floe predicts the solubility of each molecule and explains the prediction. It also gives confidence intervals and other information, which is shown below. The Floe report provides a summary of the output.
Note: All other pretrained Floes that a user builds to predict molecule properties work the same way.
- This tutorial uses the following Floe:
Solubility Prediction for Small Molecules using Machine Learning and Cheminfo Fingerprints
Create a Tutorial Project¶
If you have already created a Tutorial project you can re-use the existing one.
Log into Orion and click the home button at the top of the blue ribbon on the left of the Orion Interface. Then click on the ‘Create New Project’ button and in the pop up window enter Tutorial for the name of the project and click ‘Save’.
The input dataset contains several OERecord (s). The Floe expects an OEMol from each record. These are the molecules for which the model will predict solubility. Let this dataset be P_1.
Here is a sample record from the dataset:
OERecord ( Molecule(Chem.Mol) : c1ccc(c(c1)NC(=O)N)OC[C@H](CN2CCC3(CC2)Cc4cc(ccc4O3)Cl)O )
There can be another float field containing Solubility values to validate against. The following dataset does not contain said field.
Run OEModel-Apply Floe¶
Click on the ‘Floe’ button in the left menu bar
Click on the ‘Floes’ tab
Under Categories select Packages
Select OpenEye Model Building
A list of Floes will now be visible
Click on the Solubility Prediction for Small Molecules using Machine Learning and Cheminfo Fingerprints and a Job Form will pop up. Specify the following parameter settings in the Job Form.
For the Input Dataset, choose P_1 from above.
All the molecules predicted will be saved to the name in the field Output Solubility. Change the default name to something recognizable.
That’s it! Things should run, generate an output and a Floe report.
The default memory requirement for each cube has been set to moderate to keep the price low. If you are predicting the solubility of larger datasets, be mindful that the cubes may run out of memory (indicated in the log report). Increasing memory in the cube parameter will fix this.
Analyze OEModel Floe Report¶
Here is a sample image of what the Floe report should look like:
The top part contains the hyperparameters on which the model was trained. Then we have histogram of the output prediction, and the confidence with each prediction. We also have a plot for confidence of prediction versus the actual output. These overall statistics help analyze the input molecules predicted.
We also have a plot for outliers prediction using:
MAD Technique: Median of Absolute Deviation
Percentile Technique: Percentile Outlier
Finally, we have a plot for the number of active (1) fingerprint bits versus the solubility to inspect if there exists any correlation.
Lastly, there is a link to a page under Interesting Molecules and it shows the explainer annotated images for the outlier and central molecules.
Go to the data section of Orion and Activate the data the Floe produced. This should have the same name you chose for the Output Solubility field of your Floe. The data can be activated by clicking on the small plus sign in a circle right next to it.
Now going to the analyze page in Orion, you should be able to see the molecules, their predicted solubility values, and the explanation of the output.
There are three output ports: Success, Failure, and No Confidence.
Failure Data: (a) Molecule is too large or too small. or, (b) Molecule has an unknown atom.
No confidence Data: Molecule’s property falls out of scope of training set. In this case, the model predicts with no guarantees. Explainer image has a red background.
Success Data: (a) Falls within scope; explainer has green background. or, (b) Falls at the edge of scope; explainer has yellow background.
The output columns and their explanations are:
Class Confidence(Solubility (loguM)): High, Med or Low confidence in prediction.
Class Prediction(Solubility (loguM)): High, Med or Low prediction of solubility. Determined by the range provided by chembl18.
Contributions: Explanation of prediction based on a local model. If the image has a red background, it means no confidence. If yellow, its a warning. Green means it falls within scope of the predictor. Based on the choice of molecule explainer (Fragment by default), different parts will be color annotated with blue denoting more contribution towards the physical property (solubility) while red denotes the opposite.
Scope: If there is an error or warning, what caused the issue.
Prediction(Solubility (loguM)): Solubility prediction of the molecule in loguM.
We assign IDs (#) on each record molecule. This follows a linear ordering over all molecules. So if you activate all the successful, no confidence, and failure predictions, and sort them based on #, the order should be same as input.