ML Build: Classification Model with Tuner using Fingerprints for Small Molecules
Category Paths
Follow one of these paths in the Orion user interface, to find the floe.
Solution-based/Hit to Lead/Properties/Model Building
Task-based/ADME & Tox Assessment
Task-based/Data Science
Task-based/Cheminformatics
Description
This floe trains multiple machine learning neural network classification models on properties of small molecules.
The models train on 2D fingerprints which will be generated in the floe itself. Every molecule in the input dataset needs to have a String type property column to train on (such as High or Low). Inputs will be ignored otherwise.
It builds machine learning models for all possible combinations of cheminformatics (fingerprint) and neural network hyperparameters provided in the Advanced sections. Read the documentation to learn more about these parameters and how they should be set for a given training data set.
The floe generates a Floe Report containing details of the best models built. A user can pick any model built using this floe to predict properties of other molecules in the ML Predict: Classification using Fingerprints for Small Molecules Floe. The Floe Report presents detailed statistics on the hyperparameters, adjusts them, and reruns the floe to build better models (See documentation).
Furthermore, it picks the best models and fine-tunes them using Keras Tuner to provide the best models.
In addition to prediction, the built models provide an explanation of predictions, a confidence interval, and the domain of application.
New Feature: XGBoost Baseline Model build. Model ID is 0. Warning: By default, this floe builds about 2,500 machine learning models. On a large dataset, this may be expensive, costing more than $100. Since multiple parameters lead to this cost, refer to this tutorial on how to build a cheaper version for practice. The dataset to build decent models needs to be at least 200 molecules (barring exceptions). We have stress tested up to 30,000 molecules. It is recommended to increase the memory and disk space requirements of the cubes to run on larger datasets. More details on how the floe operates can be found in this tutorial](https://docs.eyesopen.com/floe/modules/ml-modelbuilding/docs/source/tutorials/tutorials4.html) and this [How-to Guide on building optimal machine learning models .
Promoted Parameters
Title in user interface (promoted name)
Inputs
Input Small Molecules to train machine learning models on (in): Input dataset file with each record containing molecule and response value(float) to train on.
Required
Type: data_source
Outputs
Fingerprint based Classification Models Built (out): Output of Fingerprint Classification ML Generated Models
Type: dataset_out
Default: Fingerprint Classification ML Model
Failed Models for Fingerprint based ML Classification (failed_out): Failed Models for Fingerprint-based ML Classification
Type: dataset_out
Default: Failed Models for Fingerprint based ML Classification
Options
Response Value Field: Int or String (val_field): Name of the field containing the primary data being trained on and predicted. Every molecule needs to have this value (will be ignored otherwise).
Required
Type: field_parameter
Are we using Keras Tuner (tn): If this is on, we fine tune our algorithm using the keras Tuner
Type: boolean
Default: True
Choices: [True, False]
What kind of Keras Tuner to use (tp):
Type: string
Default: Hyperband
Choices: [‘Hyperband’, ‘RandomSearch’, ‘BayesianOptimization’]
Type of data splitting for validation (ts): For scaffold split make sure to run floe: Data Processing of Small Molecule for ML Model Building to get scaffolds. Random runs by default
Type: string
Default: Random
Choices: [‘Random’, ‘Scaffold’]
Number of Models to show in Floe Report (number_of_models_to_show_in_floe_report): How many best models to provide in Floe Report. By default, keeps best 5 models (based on Acc score) such that it meets memory requirement
Type: integer
Default: 5
Preprocess Molecule (Preprocess Molecule): For every molecule, stores only largest component, adjusts ionization to neutral pH, rejects molecules that fail typecheck
Type: boolean
Default: True
Choices: [True, False]
Apply Blockbuster filter (Blockbuster Filter): Accept or reject molecules based on closeness to Blockbuster molecule properties. For details check toolkit oemolprop.
Type: boolean
Default: False
Choices: [True, False]
Advanced: Cheminformatics Fingerprint Options
Min Radius (minr): Minimum radius for cheminfo fingerprint. Builds individual model for every individual radius value with all possible combination of other parameters.
Required
Type: integer
Default: [0, 1, 3]
Max Radius (maxr): Maximum radius for cheminfo fingerprint. Builds individual model for every individual radius value with all possible combination of other parameters.
Required
Type: integer
Default: [3, 5]
Bit Length of Fingerprint (FP) (bitl): Bit Length of cheminfo fingerprint. Builds individual model for every individual fp length with all possible combination of other parameters.
Required
Type: integer
Default: [512, 1024, 2048, 4096]
Type of Fingerprint (FP) (ftype): Type of cheminfo fingerprint. Builds individual model for every individual fp type with all possible combination of other parameters. Works for Tree, Circular and Path
Required
Type: string
Default: [‘Tree’, ‘Circular’, ‘Path’]
Advanced: Neural Network Hyperparameter Options
Cutoff ratio of input sample size to model parameter size (Model Ratio): If model’s parameter size relative to the input sample is greater than this, model will be sent to failure. If all models are greater than this ratio, smallest model will be sent for building.
Type: decimal
Default: 5
Dropouts (dr): List of dropout hyperparameters. Example: 0.2 indicates randomly picked 20% of neural network nodes are turned off during training. This makes the model probabilistic. Dropout prevents overfitting by ensuring that no units are codependent with one another.
Required
Type: decimal
Default: [0.2]
Sets of Hidden Layers (hdl): Integer only list(s) of hidden layers. Every element in each row is the size of each layer. Input and output layer will be determined by data. By default each lists(rows) element will create NN models with 3 hidden layers. All models must be at least two layers. Please keep two lists(rows) for this.Eg: by default each lists first tuple will create NN with 3 hidden layers of size 10,50,50. (Vertically pick first value of each row)Followed by 10,50,75 (row1col1, row2col1, row3col2) | 10,50,100 (row1col1, row2col1, row3col3)| 10,75,50 (row1col1, row2col2, row3col1)
Required
Type: string
Default: [‘10,25’, ‘50,75’, ‘50,75,100’]
Sets of Regularisation Layers (reg): list(s) of regularisation layers separated by -1. No regularisation on Input and output layer.Read horizontally as each 3-tuple corresponds to three hidden layers with provisions to add/remove layers based on previous Hidden Layer size assignmentEg: by default, each network will be fitted with (0.04, 0.02, 0.01) reg in hidden layer1, 2 and 3 respectively. Another model built in parallel will be fitted with 0.02,0.01 and 0.01 in layer 1,2,3 respectively etc.
Required
Type: decimal
Default: [0.04, 0.02, 0.01, -1, 0.02, 0.01, 0.01]
Learning Rates (lrr): List of all the learning rate hyperparameters to train model.This parameter determines how fast or slow we will move towards the optimal weights. If the learning rate is very large we will skip the optimal solution. If it is too small we will need too many iterations to converge to the best values.
Required
Type: decimal
Default: [0.001]
Max Epochs (ep): Maximum number of epochs to train model.One epoch means that each sample in the training dataset has had an opportunity to update the internal model parameters.
Required
Type: integer
Default: 50
Activation (act): Activation Functions: relu, leakyrelu, prelu, tanh, selu, elu
Type: string
Default: relu
Choices: [‘relu’, ‘leakyrelu’, ‘prelu’, ‘tanh’, ‘selu’, ‘elu’]
Batch Size (bs): Batch size for training. The system back-propagates after scanning the each batch to update/adjust its neural node weights based on how far the predictions were from true responses.Should be in tandem with learning rate. Too small batch size can fall in local minima while too big might not converge
Required
Type: integer
Default: 1024
Adjust Batch Size (ab): Adjust batch size automatically based on size of training data
Type: boolean
Default: True
Choices: [True, False]