ML Build: Classification Model with Tuner Using Fingerprints for Small Molecules¶
This floe trains multiple machine learning (ML) neural network classification models on the properties of small molecules.
The models train on 2D fingerprints which will be generated in the floe itself. Every molecule in the input dataset needs to have a string-type property column to train on (High, Low, etc. Inputs will be ignored otherwise).
It builds machine learning models for all possible combinations of cheminformatics (fingerprint) and neural network hyperparameters provided in the advanced sections. Read the documentation to learn more about these parameters and how to set them for a given training data set.
The floe generates a Floe Report containing details of the best models built. The user can pick any model built with this floe and use it to predict properties of other molecules in the ML Predict: Classification Using Fingerprints for Small Molecules Floe. The Floe Report presents detailed statistics on the hyperparameters. You can adjust them and rerun the floe to build better models (see documentation).
Furthermore, it picks the best models and fine-tunes them using KerasTuner to provide the best models.
In addition to a prediction, the built models provide an explanation of the prediction, a confidence interval, and the domain of application.
Warning: By default, this floe builds approximately 2,500 machine learning models. On a large dataset, this may be expensive, costing more than $100. Since multiple parameters lead to this cost, consult this tutorial for how to build a cheaper version for practice. To build decent models, the dataset needs to contain at least 200 molecules (barring exceptions). We have performed stress tests for as many as 30,000 molecules. We recommended increasing the memory and disk space requirements of the cubes to run on larger datasets.
Name |
Description |
Type |
---|---|---|
Input Small Molecules to Train
Machine Learning Models On
|
Input dataset file with each record containing
molecule and response value (string) to train on.
|
Molecule Dataset |
Name |
Description |
Type |
---|---|---|
Response Value Field |
Name of the field containing the primary data being trained on and predicted. |
Float |
Are We Using KerasTuner |
If this is On, we fine-tune our algorithm using KerasTuner.
|
Bool |
What Kind of KerasTuner to Use |
Choose between Hyperband, RandomSearch, and BayesianOptimization.
|
String |
Number of Models to Show in Floe Report |
How many best models to provide in the Floe Report. By default, keeps the best
20 models (based on Acc), such that it meets memory requirement.
|
Int |
Preprocess Molecule |
For every molecule, stores only largest component and adjusts ionization to neutral pH. |
Bool |
Apply Blockbuster Filter |
Apply blockbuster filter. |
Bool |
Molecule Explainer Type |
Select explainer visualization. Atom: annotate atoms only;
Fragment: annotate fragments; Combined: annotate both.
|
List |
Name |
Description |
Type |
---|---|---|
Min Radius |
Minimum radius for cheminfo fingerprint. |
IntVec |
Max Radius |
Maximum radius for cheminfo fingerprint. |
IntVec |
Bit Length of Fingerprint |
Bit length of cheminfo fingerprint. |
IntVec |
Type of Fingerprint |
Type of cheminfo fingerprint. |
IntVec |
Name |
Description |
Type |
---|---|---|
Dropouts |
List of dropout hyperparameters. |
FloatVec |
Sets of Hidden Layers |
List(s) of hidden layers separated by -1. Input and output layers will be determined by data.
For example, 150,100,50 will create NN with 3 hidden layers of size 150, 100, 50.
|
IntVec |
Sets of Regularization Layers |
List(s) of regularization layers separated by -1.
No regularization on input and output layers.
|
FloatVec |
Learning Rates |
List of all the learning rate hyperparameters to train model. |
FloatVec |
Max Epochs |
Maximum number of epochs to train model. |
Int |
Activation |
Activation functions: ReLU, LeakyReLU, PReLU, tanh, SELU, ELU. |
List |
Batch Size |
Batch size for training regressor. |
Int |
Name |
Description |
Type |
---|---|---|
Models Built |
Output of generated models. |
Dataset |
Failure Output |
Output of failure. |
Dataset |