Tutorial: Predict Solubility of Druglike Molecules

OpenEye Model Building is a tool to build machine learning models that predict physical property of small molecules.

In this tutorial, we will use the solubility floe to predict molecule solubility in log(uM). The floe predicts the solubility of each molecule and explains the prediction based on chemical moeity. It also gives confidence intervals and other insightful information. The floe report provides a summary of the output.

Note: All other pretrained floes to predict molecule properties work the same way.

This tutorial uses the following Floe:

  • Solubility Prediction for Small Molecule using ML and Cheminfo Fingerprints.

Create a Tutorial Project


If you have already created a Tutorial project you can re-use the existing one.

Log into Orion and click the home button at the top of the blue ribbon on the left of the Orion Interface. Then click on the ‘Create New Project’ button and in the pop up window enter Tutorial for the name of the project and click ‘Save’.


Orion home page

Floe Input

The input dataset contains several OERecord (s). The floe expects an OEMol from each record. These are the molecules for which the model will predict solubility. Note: uploading .csv, .sdf and other common fileformats to Orion shuold automatically convert them to datasets. Let this dataset be P_1

Here is a sample record from the dataset:

OERecord (
    Molecule(Chem.Mol) : c1ccc(c(c1)NC(=O)N)OC[C@H](CN2CCC3(CC2)Cc4cc(ccc4O3)Cl)O

There can be another float field containing Solubility values to validate against. The following dataset does not contain said field.

Run OEModel-Apply Floe

  • Click on the ‘Floes’ button in the left menu bar

  • Click on the ‘Floes’ tab

  • Under the ‘Categories’ tab select ‘OpenEye Model Building’ package

  • In the search bar enter Solubility

  • A list of Floes will now be visible to the right

  • Launch the floe Solubility Prediction for Small Molecule using ML and Cheminfo Fingerprints and a Job Form will pop up. Specify the following parameter settings in the Job Form.

  • For the Input Dataset, choose P_1 from above

  • All the molecules predicted will be saved to the name in the field Output Solubility. Change the default name to something recognizable.


That’s it! Things should run, generate an output and a floe report

Analyze OEModel Floe Report

Here is a sample image of what the floe report should looks like:


The top part contains the histogram summary of the data used to train the Solubility ML model on. The red dotted lines illustrate the upper and lower quartile for the data. This is followed by the histogram summary of the input prediction data P_1. It is worthwhile to compare the x-axis ranges of the training and prediction data. If they are largely different, the prediction quality may detoriorate.

The next part contains the hyperparameter of the Neural Network the Solubility model was trained on. Then we have histogram of the output prediction, and the confidence with each prediction. We also have a plot for confidence of prediction versus the actual output. These overall statistics help analyze the input molecules predicted.


We also have a plot for outliers prediction using:

Lastly, there is a link to a page under Interesting Molecules and it shows the explainer annotated ../images for the outlier and central molecules

Analyze Output

  • Go to the data section of Orion and Activate the data the floe produced. This should have the same name you chose for the Output Solubility field of your floe. The data can be activated by clicking on the small plus sign in a circle right next to it.

  • Now going to the analyze page in Orion, you should be able to see the molecules, their predicted solubility values, and the explanation of the output.

The output columns and their explanations are:

  • Confidence(Solubility loguM): How confident the Model is with its prediction on a scale of 0-1

  • Contributions(Solubility loguM): Explanation of prediction based on a local model. If the image has a red/yellow background, it means there is an error or warning issued. Based on the choice of molecule explainer (Fragment by default), different parts will be color annotated with blue denoting more solubility while red denotes the opposite.

  • HighestTanimotoSimilarity: what is the highest 2d Tanimoto similarity with any molecule in the training set,

  • HighestSimilarPropertyValue: what is the NegIC50 of that training-set molecule on record

    • These two fields basically tell us if there is a similar molecule in the training set and if so, what is its solubility value

  • Scope: if there is an error or warning, what caused the issue

  • Class Predict(Solubility loguM): Predicts Property as High, Medium or Low. Background color suggests how confident the model is with green (most confidence), yellow(average confidence), and red(less confidence/out of scope)

  • Prediction(Solubility loguM): Prediction of solubility of the molecule in loguM



We assign IDs (#) on each record molecule. This follows a linear ordering over all molecule. So if you activate both the successful and failure predictions, and sort them based on #, the order should be same as input.