ML Build: Regression Model with Tuner using Feature Input

This Floe trains multiple ML neural network regression models on the physical properties of small molecules.

Every molecule in the input dataset needs to have a Float physical property column to train on. In addition, every molecule needs a set of input features as float vectors to train the models on. The models train on these user-provided feature vectors which are normalized before training. The input molecule provides a basis for depiction against features.

The floe builds machine learning models on user-input features for all possible combinations of neural network hyperparameters provided in the advanced sections. Read the documentation to learn more about these parameters and how they should be set for a given training data set.

It generates a Floe Report containing details of the best models built. The user can pick any model and use it to predict properties of other molecules in the ML Predict: Regression using Feature Input Floe. The Floe Report presents detailed statistics on the hyperparameters, adjusts them, and reruns the floe to build better models (See documentation).

In addition to prediction, the built models provide an explanation of predictions, a confidence interval, and the domain of application. The explainer picks the strongest features towards and against the prediction and reports them as histograms.

Warning: By default, this floe builds about 1,000 machine learning models. On a large dataset, this may be expensive. Since multiple parameters lead to this cost, refer to documentation on how to build a cheaper version for practice. The dataset size to build decent models needs to be more than ~200 molecules (barring exceptions). We have stress tested up to 50,000 molecules. It is recommended to increase the memory and disk space requirements of the cubes to run on larger datasets.

Inputs

Name

Description

Type

Input Small Molecules to Train
Machine Learning Models On
Input dataset file with each record containing a
molecule and response value (float) to train on.

Molecule Dataset

Options

Name

Description

Type

Response Value Field

Name of the field containing the primary data being trained on and predicted.

Float

Custom Feature
Field containing feature vector to train model on. Must be a float vector.

FloatVec

Number of Models to Show in Floe Report

How many best models to provide in Floe Report. By default, keeps best
20 models (based on r2 score), such that it meets memory requirements.

Int

Preprocess Molecule

For every molecule, stores only largest component, adjusts ionization to neutral pH.

Bool

Apply Blockbuster Filter

Apply Blockbuster filter.

Bool

Negative Log

Transform learning value to negative log.
Only for regression. Off: Build TensorFlow neural network model
for prediction and explanation (Deterministic Model).

Bool

Number of Top Features to Explain

Number of top features to provide LIME votes results for.
These are the most important features that determine property prediction based on the trained ML model.

List

Advanced: Neural Network Hyperparameter Options

Name

Description

Type

Dropouts

List of dropout hyperparameters.

FloatVec

Sets of Hidden Layers

List(s) of hidden layers separated by -1. Input and output layers will be determined by data.
Example: 150,100,50 will create NN with 3 hidden layers of size 150, 100, 50.

IntVec

Sets of Regularization Layers

List(s) of regularization layers separated by -1.
No regularization on input and output layers.

FloatVec

Learning Rates

List of all the learning rate hyperparameters to train model.

FloatVec

Max Epochs

Maximum number of epochs to train model.

Int

Activation

Activation Functions: ReLU, LeakyReLU, PReLU, tanh, SELU, ELU.

List

Batch Size

Batch size for training regressor.

Int

Adjust Batch Size

Adjust batch size automatically based on size of training data.

Outputs

Name

Description

Type

Models Built

Output of generated models.

Dataset

Failure Output

Output of failure.

Dataset