hERG Toxicity Prediction for Small Molecules using ML and Cheminfo Fingerprints

This floe predicts the hERG toxicity of small, drug-like molecules as active (toxic) or inactive (nontoxic). It runs a TensorFlow-based fully connected neural network regression model for prediction. It uses convex box and Monte Carlo based approaches for domain of application and error bar predictions. The TensorFlow models have been trained on 2D fingerprints.

Finally, this floe uses LIME, a model agnostic system to explain hERG toxicity of the molecule(s). The floe is cheap and quick, adding about 1.5 seconds for property prediction of 10 molecules.

Outputs:

Failure Dataset : The molecule (a) is too large or too small or (b) has an atom not encountered in the training set.

No Confidence Dataset: The molecule is deemed out of scope compared to the training set (details below). In this case, the model predictions are unreliable. The explainer image has a red background.

Success Dataset: The molecule falls (a) within scope and the explainer has a green background or (b) at the edge of scope and the explainer has a yellow background.

Molecules outside the scope of the training set will be sent to the “No Confidence” port, as a prediction cannot be considered reliable. Specifically, the scope is defined as a range in molecular weight, atom count, polar surface area, and calculated logP from the training set molecules. These ranges are reported in the Floe Report.

Inputs
Name	Description	Type
Input Small Molecule(s) Dataset to Predict Property of	The dataset(s) to read records from.	Molecule Dataset

Explanation and Validation
Name	Description	Type
Molecule Explainer Type	Select explainer visualization. Atom: annotate atoms only Fragment: annotate fragments only Combined: annotate both	List
Property Validation Field	If the dataset has a baseline, the floe reports a comparison between predictions in the Floe Report.	Float

Outputs
Name	Description	Type
Output hERG Toxicity	Output dataset to write to.	Dataset
No Confidence hERG Toxicity	Output dataset to write to.	Dataset
Failed Dataset Name	Output dataset to write to.	Dataset

Training Data Details

We used a mix of the ChEMBL 240 and Riken datasets. The Riken dataset has about 10% active molecules and the rest are inactive. We trained our model on all the actives and some inactives.