Data Processing of Small Molecule for ML Model Building¶

This Floe analyzes and preprocesses data for training ML(machine learning) models. The floe cleans the molecule by retaining the largest molecule (if multiple present in a single record), sets neutral Ph, and removes charges. The floe looks into molecule properties including Mol Wt, Atom Count, XLogP, Rotatable Bonds, and PSA. It cleans outlier molecules sending them to the Failure Port. The floe report gives detailed report on this.

For duplicate molecules, we add a Duplicate warning and include a Box Plot indicating the count. For float(regression) response value, the duplicates are set to an average. If the variance of the float value is too high, we reject all the duplicate molecules. For an input string or int response, we set to the highest count of response.

The floe report provides additional details as correlation of response with physical properties, and count of outliers. Recommend using this floe on noisy dataset before sending them to the model building floes.

Inputs¶
Name	Description	Type
Input Small Molecules to train machine learning models on	Input dataset file with each record containing molecule and response value(float) to train on	Molecule Dataset

Options¶
Name	Description	Type
Blockbuster Preprocess Molecule	For every molecule apply Blockbuster Filter	Bool
Response Value Field	Name of the field containing the primary data being trained on and predicted.	Float, Int, or String
Response Analysis	Auto: Detects Regression or Classification based on response value, Regression: expects Float, Classification: expects String or Int	List

Outputs¶
Name	Description	Type
Output Property	Output dataset to write to	Dataset
Failure Property	Output dataset to write to	Dataset