Frequently Asked Questions

For model building floes, How do we compare multiple models to decide which one is the best?

The model builder floe generates a comprehensive floe report to help us do this. The first page of this report has a list of models sorted on r2 scores. The top model in the list can be picked for future predictions.

But we would suggest inspecting the links of individual models. This gives us several other statistical performance measures like kendall tau, mse mae, which gives a more holistic picture on model quality. In addition, the model training graphs also help gauge performance. So essentially picking a good model is a mix of engineering and art, but hopefully reading our docs combined with the floe report will help in choosing the best one.

How good is your solubility model?

In machine learning, a trained model is as good as the training data. For training we used Chembl 30 which is an open source, and a very noisy dataset. The solubility floe is a use case of our builder floes to show the effectiveness of the pipeline.

But despite this, the solubility result metrics prove it is a robust and decent model. In fact we tested on several other solubility datasets with reasonable performance. For some datasets like the Delany, which contains non-drug-like molecules the model has never seen, the predictions are sent to the no-confidence port as expected.

Does the model use/ train on 3D features?

Not in the current release. It trains on 2D cheminformatics based fingerprints.

The inputs are still based on expert parameters such as fingerprints, which are all biased by what the rules are defined by the expert user. Any insights into how to overcome this flaw?

The input takes in a large number of fingerprint parameters and generates models for all possible fingerprint and neural network hyperparameter. What this means is that the user does not need to choose a configuration of the fingerprint. Instead, the builder model will pick the best fingerprint for you. The top models in the floe report will show which fingerprint works best. There are also graphs in the floe report that show how well the models did with respect to every change in fingerprint parameter value.

Are the predictions for crystalline solubility? Is the data for crystals?

Chembl 30 is an opaque solubility data. More info can be found at https://www.ebi.ac.uk/chembl/assay_report_card/CHEMBL3616299/

What made you prefer NNs instead of, say, XGBoost?

Since we are shipping a generic model building floe, Neural Network provides the bandwidth to build models on both small and large datasets. That is, it is more scalable than either shallow learning (like Boost, which typical works well on smaller data) or deep learning (like Graph Conv, Attention, and so forth, which work better on larger data). One more point to note: Fingerprints typically work well when the length is more than 512 or 1024 at least. Reducing it below this number leads to large hashing collisions and other problems. In this setting, NNs work better than many shallow learning techniques.

Neural networks don’t always show good performance in low data regimes. What measures do you take to improve performance?

For low data regimes, the number of nodes and layers can be tuned such that the trained model fits smaller data better. While running the floe, we can choose the optimization algorithm more carefully. This can be changed in the cube parameters. Loss function plays a role in optimization as well. Hinge loss is one such example which works well on smaller data.

How the confidences are computed?

We use three techniques, namely:

  1. Convex Box Approach: maintain a [max, min] range of physical properties as MW, XlogP, PSA. If Mp falls outside range, it iss no confidence.

  2. Monte Carlo Dropout: Probabilistic technique to find epistemic uncertainty using trained TensorFlow Model. Gets result “in training mode” with the dropouts active, resulting in the model providing a distribution rather than single output.

  3. TensorFlow Probability Model: We train this separate model to find both aleatoric and epistemic uncertainty. Put molecule to confidence if reported distribution has a large spread (found empirically).

What methods do you find most effective to eliminate overfitting of your models?

Regularizes are very effective we have used L2-reg for our models. Dropout and others as been proven to be effective as well. We also use early stopping although that’s not exposed as a promoted parameter in our floe(can be accessed as a cube parameter to reduce complexity). We have several parameters (like cross validation) in our builder floe to combat overfitting.

How many percentages can the solubility prediction have high/med confidences?

This is not about the percentage but rather the kind of data. If the model encounters a data it has not seen before (epistemic uncertainty) it will put into low confidence. When a user builds their own model, they can choose how rigorous they want this check to be (by specifying cut off values).

In the model builder floe, am I restricted to fingerprint or can I add other descriptors, such as mol properties,or other calculated or experimental parameters?

For this release, it is fingerprints only.