Solubility Prediction for Small Molecules using ML and Cheminfo Fingerprints

This floe predicts the solubility of small, drug-like molecules in log μM. It runs a TensorFlow-based fully connected neural network regression model for prediction. It uses a convex box approach and TensorFlow-based probabilistic fully connected neural network for domain of application and error bar prediction. Both of these TensorFlow models have been trained on 2D fingerprints.

Finally, this floe uses LIME, a model agnostic system to explain the solubility of the molecule(s). The floe is cheap and quick, adding about 1.5 seconds for a property prediction of 10 molecules.

Outputs:

Failure Dataset: The molecule (a) is too large or too small, or (b) has an atom not encountered in the training set.

No Confidence Dataset: The molecule is deemed out of scope compared to the training set (details below). In this case, the model predictions are unreliable. The explainer image has a red background.

Success Dataset: The molecule falls (a) within scope and the explainer has a green background or (b) at the edge of scope and the explainer has a yellow background.

Molecules outside the scope of the training set will be sent to the “No Confidence” port, as a prediction is unreliable. Specifically, the scope is defined as a range in molecular weight, atom count, polar surface area, and calculated logP from the training set molecules. These ranges are reported in the Floe Report.

Inputs
Name	Description	Type
Input Small Molecule(s) Dataset to Predict Property of	The dataset(s) to read records from.	Molecule Dataset

Explanation and Validation
Name	Description	Type
Molecule Explainer Type	Select explainer visualization. Atom: annotate atoms only Fragment: annotate fragments only Combined: annotate both	List
Property Validation Field	If the dataset has a baseline, the floe reports a comparison between predictions in the Floe Report.	Float

Outputs
Name	Description	Type
Output Solubility	Output dataset to write to.	Dataset
No-confidence Solubility	Output dataset to write to.	Dataset
Failed Dataset Name	Output dataset to write to.	Dataset

Training Data Details

The solubility predictor has been trained on the ChEMBL 30 dataset released in 2022. It is an opaque solubility database. More information can be found on the ChEMBL blog.