Frequently Asked Questions

For the model building floes, how do you compare multiple models to decide which one is the best?

The model building floes generate a comprehensive Floe Report to help us do this. The first page of this report has a list of models sorted on r2 scores. The top model in the list can be picked for future predictions.

But we would suggest inspecting the links of individual models. This provides several other statistical performance measures like Kendall Tau, MSE, and MAE, which offer a more holistic picture of model quality. In addition, the model training graphs also help gauge performance. So essentially picking a good model is a mix of engineering and art; hopefully our documentation combined with the Floe Report will help to choose the best one.

How good is your solubility model?

In machine learning, a trained model is as good as the training data. For training, we used ChEMBL 30, which is open source and a very noisy dataset. The Solubility Prediction for Small Molecules using ML and Cheminfo Fingerprints Floe is a use case of our builder floes to show the effectiveness of the pipeline.

But despite this, the solubility result metrics prove it is a robust and decent model. In fact, we tested several other solubility datasets with reasonable performance. For some datasets, like the Delaney, which contains nondrug-like molecules the model has never seen, the predictions are sent to the no-confidence port as expected.

Does the model use or train on 3D features?

It trains on 2D cheminformatics-based fingerprints.

The inputs are still based on expert parameters such as fingerprints, which are all biased by the rules defined by the expert user. Any insights into how to overcome this flaw?

The input takes in a large number of fingerprint parameters and generates models for all possible fingerprint and neural network hyperparameters. What this means is that the user does not need to choose a configuration of the fingerprint. Instead, the builder model will pick the best fingerprint for you. The top models in the Floe Report will show which fingerprint works best. There are also graphs in the Floe Report that show how well the models did with respect to every change in the fingerprint parameter value.

Are the predictions for crystalline solubility? Is the data for crystals?

ChEMBL 30 is an opaque solubility database https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_30/. More information can be found at https://chembl.blogspot.com/2022/03/chembl-30-released.html

What data was used to train hERG toxicity Model?

We used a mix of the ChEMBL and Riken datasets. The Riken dataset has about 10% active and the rest inactive. We trained our model on all the actives and some inactives. Here is a link for the data: https://drugdesign.riken.jp/hERGdb/ For ChEMBL, we used ChEMBL240: https://www.ebi.ac.uk/chembl/target_report_card/CHEMBL240/

Why do you prefer neural networks instead of something like XGBoost?

Because these are a generic model building floes, neural networks provide the bandwidth to build models on both small and large datasets. That is, they are more scalable than either shallow learning (like XGBoost, which typically works well on smaller data) or deep learning (like Graph Convolution Networks, Graph Attention Networks, and so forth, which work better on larger data). One additional note: Fingerprints typically work well when the length is more than 512 or 1024. Reducing it below this number leads to large hashing collisions and other problems. In this setting, NNS work better than many shallow learning techniques.

Neural networks don’t always show good performance in low data regimes. What measures do you take to improve performance?

For low data regimes, the number of nodes and layers can be tuned such that the trained model fits smaller data better. While running the floe, we can choose the optimization algorithm more carefully. This can be changed in the cube parameters. Loss function plays a role in optimization as well. Hinge loss is one such example which works well on smaller data.

How are the confidences computed?

We use three techniques, namely:

  1. Convex Box Approach: This maintains a [max, min] range of physical properties such as MW, XlogP, and PSA. If the Mp falls outside range, it is no confidence.

  2. Monte Carlo Dropout: This is a probabilistic technique to find epistemic uncertainty using a trained TensorFlow Model. This yields a result in “training mode” with the dropouts active, resulting in the model providing a distribution rather than single output.

  3. TensorFlow Probability Model: We train this separate model to find both aleatoric and epistemic uncertainty. A molecule is put to confidence if the reported distribution has a large spread (found empirically).

What methods do you find most effective to eliminate overfitting of your models?

Regularizers are very effective; we have used L2-reg for our models. Dropout and others have also been proven to be effective. We also use early stopping, although that’s not exposed as a promoted parameter in our floe (it can be accessed as a cube parameter to reduce complexity). We have several parameters (like cross validation) in our builder floes to combat overfitting.

What percentages do the solubility predictions have for high or medium confidences?

This is not about the percentage but rather the kind of data. If the model encounters a data it has not seen before (epistemic uncertainty), it will put it into low confidence. When a user builds their own model, they can choose how rigorous they want to be by specifying cutoff values.

In the model builder floes, am I restricted to fingerprints, or can I add other descriptors, such as molecular properties or other calculated or experimental parameters?

This is restricted to fingerprints only.