Solubility Prediction for Small Molecules using ML and Cheminfo Fingerprints
This floe predicts the solubility of small, drug-like molecules in log μM. It runs a TensorFlow-based fully connected neural network regression model for prediction. It uses a convex box approach and TensorFlow-based probabilistic fully connected neural network for domain of application and error bar prediction. Both of these TensorFlow models have been trained on 2D fingerprints.
Finally, this floe uses LIME, a model agnostic system to explain the solubility of the molecule(s). The floe is cheap and quick, adding about 1.5 seconds for a property prediction of 10 molecules.
Outputs:
Failure Dataset: The molecule (a) is too large or too small, or (b) has an atom not encountered in the training set.
No Confidence Dataset: The molecule is deemed out of scope compared to the training set (details below). In this case, the model predictions are unreliable. The explainer image has a red background.
Success Dataset: The molecule falls (a) within scope and the explainer has a green background or (b) at the edge of scope and the explainer has a yellow background.
Molecules outside the scope of the training set will be sent to the “No Confidence” port, as a prediction is unreliable. Specifically, the scope is defined as a range in molecular weight, atom count, polar surface area, and calculated logP from the training set molecules. These ranges are reported in the Floe Report.
Name |
Description |
Type |
---|---|---|
Input Small Molecule(s) Dataset
to Predict Property of
|
The dataset(s) to read records from. |
Molecule Dataset |
Name |
Description |
Type |
---|---|---|
Molecule Explainer Type |
Select explainer visualization.
Atom: annotate atoms only
Fragment: annotate fragments only
Combined: annotate both
|
List |
Property Validation Field |
If the dataset has a baseline, the floe reports
a comparison between predictions in the Floe Report.
|
Float |
Name |
Description |
Type |
---|---|---|
Output Solubility |
Output dataset to write to. |
Dataset |
No-confidence Solubility |
Output dataset to write to. |
Dataset |
Failed Dataset Name |
Output dataset to write to. |
Dataset |
Training Data Details
The solubility predictor has been trained on the ChEMBL 30 dataset released in 2022. It is an opaque solubility database. More information can be found on the ChEMBL blog.