Data Processing of Small Molecules for ML Model Building
This floe analyzes and preprocesses data for training machine learning (ML) models. The floe cleans the molecule by retaining the largest molecule (if multiple are present in a single record), sets a neutral pH, and removes charges. The floe looks into molecule properties including molecular weight, atom count, XLogP, rotatable bonds, and PSA. It cleans outlier molecules, sending them to the Failure Port. The Floe Report gives detailed report on this.
For duplicate molecules, a duplicate warning is included, as well as a box plot indicating the count. For float (regression) response values, the duplicates are set to an average. If the variance of the float value is too high, all duplicate molecules are rejected. For an input string or int response, it is set to the highest count of response.
The Floe Report provides additional details as a correlation of the response with physical properties and a count of outliers. We recommend using this floe on noisy datasets before sending them to the model building floes.
Name |
Description |
Type |
---|---|---|
Input Small Molecules
to Train Machine Learning Models On
|
Input dataset file with each record containing molecule and response value (float) to train on. |
Molecule Dataset |
Name |
Description |
Type |
---|---|---|
Blockbuster Preprocess Molecule |
For every molecule, apply Blockbuster filter. |
Bool |
Response Value Field |
Name of the field containing the primary data being trained on and predicted. |
Float, Int, or String |
Response Analysis |
Auto: detects regression or classification based on response value.
Regression: expects Float
Classification: expects String or Int
|
List |
Name |
Description |
Type |
---|---|---|
Output Property |
Output dataset to write to. |
Dataset |
Failure Property |
Output dataset to write to. |
Dataset |