ML Build: Regression Model with Tuner using Fingerprints for Small Molecules¶

This Floe trains multiple ML(machine learning) Neural Network Regression models on physical properties of small molecules.

The models train on 2D fingerprints which will be generated in the Floe itself. Every molecule in the input dataset needs to have a physical property column to train on (will be ignored otherwise).

It builds machine learning models for all possible combinations of cheminformatics (fingerprint) and neural network hyperparameters provided in the Advanced sections. Read the documentation to learn more about these parameters and how they should be set for a given training data set.

Further, it picks the best models and fine tunes them using keras tuner to provide best models.

Generates a Floe report containing details of the best models built. User can pick any model and use it to predict properties of other molecules in the Floe ML Regression using Fingerprints for Small Molecules. The Floe report presents detailed statistics on the hyperparameters, adjust them and rerun the Floe to build better models (See documentation).

In addition to prediction, the built models provide an explanation of predictions, a confidence interval, and the domain of application.

Warning: This Floe by default, builds about 2,000 machine learning models. On a large dataset, this may be expensive. Since multiple parameters lead to this cost, refer to documentation on how to build a cheaper version for practice. The dataset size to build decent models needs to be at least a 100 molecules (barring exceptions). We have stress tested up to 30,000 molecules. It is recommended to increase the Memory and Disk Space requirements of the cubes to run on larger datasets. Please refer to docs on how to build models on a larger dataset.

Inputs¶
Name	Description	Type
Input Small Molecules to train machine learning models on.	Input dataset file with each record containing molecule and response value(float) to train on	Molecule Dataset

Options¶
Name	Description	Type
Response Value Field	Name of the field containing the primary data being trained on and predicted.	Float
Are we using Keras tuner	If this is on, we fine tune our algorithm using the keras tuner	Bool
What kind of keras tuner to use	Choose betweem Hyperband, RandomSearch, Bayesian Optimization	String
Number of Models to show in Floe report	How many best models to provide in FloeReport. By default, keeps best 20 models (based on r2 score) such that it meet memory requirement	Int
Are we training tensorflow probability	True: Builds Tensorflow Probability based Neural Network Model for finding the Domain of Application/ Error Bar, False: Build Tensorflow Neural Network Model for prediction and explanation	Bool
Preprocess Molecule	For every molecule, stores only largest component, adjusts ionization to Neutral Ph	Bool
Apply Blockbuster filter	Apply blockbuster filter	Bool
Negative Log	Transform Learning Value to Negative Log. Only for regression. False: Build Tensorflow Neural Network Model for prediction and explanation (Deterministic Model)	Bool
Molecule Explainer Type	Select explainer visualization. Atom: annotate atoms only, Fragment: Annotate Fragments, Combined: Annotate Both	List

Advanced: Cheminfo Fingerprint Options¶
Name	Description	Type
Min Radius	Minimum radius for cheminfo fingerprint.	IntVec
Max Radius	Maximum radius for cheminfo fingerprint.	IntVec
Bit Length of FP	Bit Length of cheminfo fingerprint	IntVec
Type of FP	Type of cheminfo fingerprints	IntVec

Advanced: Neural Network Hyperparameter Options¶
Name	Description	Type
Dropouts	List of dropout hyperparameters.	FloatVec
Sets of Hidden Layers	list(s) of hidden layers separated by -1. Input and output layer will be determined by data. Eg: 150,100,50 will create NN with 3 hidden layers of size 150, 100, 50.	IntVec
Sets of Regularisation Layers	list(s) of regularisation layers separated by -1. No regularisation on Input and output layer.	FloatVec
Learning Rates	List of all the learning rate hyperparameters to train model.	FloatVec
Max Epochs	Maximum number of epochs to train model.	Int
Activation	Activation Functions: ReLU, LeakyReLU, PReLU, tanh, SELU, ELU	List
Batch Size	Batch size for training regressor	Int

Outputs¶
Name	Description	Type
Models Built	Output of Generated Models	Dataset
Failure Output	Output of Failure	Dataset