ML Build: Regression Model Using Feature Input¶
This floe trains multiple machine learning (ML) neural network regression models on the physical properties of small molecules.
Every molecule in the input dataset needs to have a float-type physical property column to train on. In addition, every molecule needs a set of user input features as float vectors to train the models on. The models train on these user-provided feature vectors which are normalized before training. The input molecule provides a basis for depiction against features.
It builds machine learning models on user input features for all possible combinations of neural network hyperparameters provided in the advanced sections. Read the documentation to learn more about these parameters and how they should be set for a given training data set.
This floe generates a Floe Report containing details of the best models built. The user can pick any model and use it to predict properties of other molecules in the ML Predict: Regression Using Feature Input Floe. The Floe Report presents detailed statistics on the hyperparameters. The user can adjust them and rerun the floe to build better models (see documentation).
In addition to prediction, the built models provide an explanation of the predictions, a confidence interval, and the domain of application. The explainer picks the strongest features towards and against the prediction and reports it as a histogram.
Warning: By default, this floe builds approximately 1,000 machine learning models. On a large dataset, this may be expensive. Since multiple parameters lead to this cost, refer to this tutorial for how to build a cheaper version. The dataset to build decent models needs to be at least 200 molecules (barring exceptions). We have performed stress tests on as many as 50,000 molecules. We recommended increasing the memory and disk space requirements of the cubes to run on larger datasets.
Name |
Description |
Type |
---|---|---|
Input Small Molecules to Train
Machine Learning Models On
|
Input dataset file with each record containing a
molecule and response value (float) to train on.
|
Molecule Dataset |
Name |
Description |
Type |
---|---|---|
Response Value Field |
Name of the field containing the primary data being trained on and predicted. |
Float |
Custom Feature
|
Field containing feature vector to train model on. Must be a float vector.
|
FloatVec |
Number of Models to Show in Floe Report |
How many best models to provide in Floe Report. By default, keeps best
20 models (based on r2 score), such that it meets memory requirements.
|
Int |
Preprocess Molecule |
For every molecule, stores only largest component and adjusts ionization to neutral pH. |
Bool |
Apply Blockbuster Filter |
Apply blockbuster filter. |
Bool |
Negative Log |
Transform learning value to negative log.
Only for regression. Off: Build TensorFlow neural network model
for prediction and explanation (deterministic model).
|
Bool |
Number of Top Features to Explain |
Number of top features to provide results for LIME votes.
These are the most important features that determine property prediction based on trained ML models.
|
List |
Name |
Description |
Type |
---|---|---|
Dropouts |
List of dropout hyperparameters. |
FloatVec |
Sets of Hidden Layers |
List(s) of hidden layers separated by -1. Input and output layer will be determined by data.
For example, 150,100,50 will create NN with 3 hidden layers of size 150, 100, 50.
|
IntVec |
Sets of Regularization Layers |
List(s) of regularization layers separated by -1.
No regularization on input and output layers.
|
FloatVec |
Learning Rates |
List of all the learning rate hyperparameters to train model. |
FloatVec |
Max Epochs |
Maximum number of epochs to train model. |
Int |
Activation |
Activation functions: ReLU, LeakyReLU, PReLU, tanh, SELU, ELU. |
List |
Batch Size |
Batch size for training regressor. |
Int |
Name |
Description |
Type |
---|---|---|
Models Built |
Output of generated models. |
Dataset |
Failure Output |
Output of failure. |
Dataset |