ML Build: Classification Model with Tuner using Fingerprints for Small Molecules

This floe trains multiple ML neural network classification models on the properties of small molecules.

The models train on 2D fingerprints which will be generated in the floe itself. Every molecule in the input dataset needs to have a string type property column to train on (such as High or Low). Inputs will be ignored otherwise.

It builds machine learning models for all possible combinations of cheminformatics (fingerprint) and neural network hyperparameters provided in the advanced sections. Read the documentation to learn more about these parameters and how they should be set for a given training data set.

The floe generates a Floe Report containing details of the best models built. The user can pick any model built using this floe and use it to predict properties of other molecules in the ML Predict: Classification using Fingerprints for Small Molecules Floe. The Floe Report presents detailed statistics on the hyperparameters, adjusts them, and reruns the floe to build better models (See documentation).

Furthermore, it picks the best models and fine-tunes them using Keras Tuner to optimize those models.

In addition to prediction, the built models provide an explanation of predictions, a confidence interval, and the domain of application.

Warning: By default, this floe builds about 2,500 machine learning models. On a large dataset, this may be expensive, costing greater than $100. Since multiple parameters lead to this cost, refer to documentation on how to build a cheaper version for practice. The dataset size to build decent models should be more than ~200 molecules (barring exceptions). We have stress tested up to 30,000 molecules. It is recommended to increase the memory and disk space requirements of the cubes to run on larger datasets.

Inputs
Name	Description	Type
Input Small Molecules to Train Machine Learning Models on	Input dataset file with each record containing a molecule and response value (String) to train on.	Molecule Dataset

Options
Name	Description	Type
Response Value Field	Name of the field containing the primary data being trained on and predicted.	Float
Are We Using Keras Tuner*	If this is On, we fine-tune our algorithm using the Keras Tuner.	Bool
What Kind of Keras Tuner to Use	Choose between Hyperband, RandomSearch, and Bayesian Optimization.	String
Number of Models to Show in Floe Report	How many best models to provide in the Floe Report. By default, keeps best 20 models (based on Acc) such that it meets memory requirements.	Int
Preprocess Molecule	For every molecule, stores only largest component, adjusts ionization to neutral pH.	Bool
Apply Blockbuster Filter	Apply Blockbuster filter.	Bool
Molecule Explainer Type	Select explainer visualization. Atom: annotate atoms only Fragment: annotate fragments Combined: annotate both	List

Advanced: Cheminfo Fingerprint Options
Name	Description	Type
Min Radius	Minimum radius for cheminfo fingerprints.	IntVec
Max Radius	Maximum radius for cheminfo fingerprints.	IntVec
Bit Length of Fingerprints (FP)	Bit length of cheminfo fingerprints.	IntVec
Type of Fingerprints (FP)	Type of cheminfo fingerprints.	IntVec

Advanced: Neural Network Hyperparameter Options
Name	Description	Type
Dropouts	List of dropout hyperparameters.	FloatVec
Sets of Hidden Layers	List(s) of hidden layers separated by -1. Input and output layers will be determined by data. Example: 150,100,50 will create NN with 3 hidden layers of size 150, 100, 50.	IntVec
Sets of Regularization Layers	List(s) of regularization layers separated by -1. No regularization on input and output layers.	FloatVec
Learning Rates	List of all the learning rate hyperparameters to train models.	FloatVec
Max Epochs	Maximum number of epochs to train models.	Int
Activation	Activation Functions: ReLU, LeakyReLU, PReLU, tanh, SELU, ELU.	List
Batch Size	Batch size for training regressor.	Int
Adjust Batch Size	Adjust batch size automatically based on size of training data.

Outputs
Name	Description	Type
Models Built	Output of generated models.	Dataset
Failure Output	Output of failure.	Dataset