Frequently Asked Questions

For model building floes, how do we compare multiple models to decide which one is the best?

The model builder floe generates a comprehensive Floe Report to help us do this. The first page of this report has a list of models sorted on r2 scores. The top model in the list can be picked for future predictions.

We suggest inspecting the links of individual models. This provides several other statistical performance measures like Kendall’s Tau, MSE, and MAE, which give a more holistic view of model quality. In addition, the model training graphs help gauge performance. Essentially, picking a good model is a mix of engineering and art. The documentation plus the Floe Report should guide you to choose the best one.

How good is the solubility model?

In machine learning, a trained model is as good as the training data. For training, we used ChEMBL 30, which is an open source and a very noisy dataset. The solubility floe is a use case of our builder floes to show the effectiveness of the pipeline.

Despite this, the solubility result metrics prove that it is a robust and decent model. In fact, we tested on several other solubility datasets with reasonable performance. For some datasets, such as the Delaney dataset, which contains non-drug-like molecules the model has never seen, the predictions are sent to the no-confidence port as expected.

Does the model use or train on 3D features?

It trains on 2D cheminformatics-based fingerprints.

The inputs are based on expert parameters such as fingerprints, which are all biased by the rules that are defined by the expert user. Any insights into how to overcome this flaw?

The input takes in a large number of fingerprint parameters and generates models for all possible fingerprint and neural network hyperparameters. What this means is that the user does not need to choose a configuration of the fingerprint. Instead, the builder model will pick the best fingerprint. The top models in the Floe Report show which fingerprint works best. There are also graphs in the Floe Report that show how well the models performed with respect to every change in fingerprint parameter values.

Are the predictions for crystalline solubility? Is the data for crystals?

ChEMBL 30 contains opaque solubility data. More information can be found at https://www.ebi.ac.uk/chembl/assay_report_card/CHEMBL3616299/.

Why do you prefer NNs instead of something like XGBoost?

Since this is a generic model building floe, neural networks provide the bandwidth to build models on both small and large datasets. That is, neural networks are more scalable than either shallow learning (like XGBoost, which typically works well on smaller data) or deep learning (like Graph Convolutional Networks (GCN) or Graph Attention Networks (GAT), which work better on larger data). One more point is that fingerprints typically work well when the length is more than 512 or 1024 at the minimum. Reducing it below this number leads to large hashing collisions and other problems. In this setting, NNs work better than many shallow learning techniques.

Neural networks don’t always show good performance in low data regimes. What measures can you take to improve performance?

For low data regimes, the number of nodes and layers can be tuned such that the trained model fits smaller data better. While running the floe, we can choose the optimization algorithm more carefully. This can be changed in the cube parameters. Loss function plays a role in optimization as well. Hinge loss is one such example which works well on smaller data.

How are the confidences computed?

We use three techniques, namely:

  1. Convex Box Approach: This maintains a [max, min] range of physical properties as MW, XlogP, and PSA. If Mp falls outside the range, it has no confidence.

  2. Monte Carlo Dropout: This is a probabilistic technique to find epistemic uncertainty using a trained TensorFlow model. It provides the result “in training mode” with the dropouts active, resulting in the model providing a distribution rather than a single output.

  3. TensorFlow Probability Model: We train this separate model to find both aleatoric and epistemic uncertainties. It puts a molecule to confidence if the reported distribution has a large spread (found empirically).

What methods are most effective to eliminate overfitting of your models?

Regularizers are very effective; we have used L2-reg for our models. Dropouts and others have been proven to be effective as well. We also use early stopping, although that’s not exposed as a promoted parameter in our floe (it can be accessed as a cube parameter to reduce complexity). We have several parameters (such as cross validation) in our builder floe to combat overfitting.

How many percentages can the solubility prediction have for high/med confidences?

This is not about the percentage but rather the kind of data. If the model encounters data it has not seen before (epistemic uncertainty), it will put it into low confidence. When a user builds their own model, they can choose how rigorous they want this check to be by specifying cut off values.

In the model builder floe, am I restricted to fingerprints, or can I add other descriptors, such as molecular properties or other calculated or experimental parameters?

It is restricted to fingerprints only.