Data Processing of Small Molecule for ML Model Building

Category Paths

Follow one of these paths in the Orion user interface, to find the floe.

  • Solution-based/Hit to Lead/Properties/Model Building

  • Task-based/ADME & Tox Assessment

  • Task-based/Data Science

  • Task-based/Cheminformatics

Description

This floe analyzes and preprocesses data for training machine learning models. The floe cleans the molecule by retaining the largest molecule (if multiple are present in a single record), adjusts the ionization to a neutral pH, and rejects molecules that fail typecheck. The floe investigates molecule properties including molecular weight, atom count, XLogP, rotatable bonds, and polar surface area. It cleans outlier molecules and sends them to the Failure Port. The Floe Report gives a detailed report on the results.

For duplicate molecules, we add a duplicate warning and include a box plot to indicate the count. For float (regression) response values, the duplicates are set to an average. If the variance of the float value is too high, all the duplicate molecules are rejected. For an input string or int response, it is set to the highest count of response.

The Floe Report provides additional details as a correlation of responses with physical properties and the count of outliers. We recommend using this floe on noisy datasets before sending them to the model building floes. We have stress tested up to 30,000 molecules. It is recommended to increase the memory and disk space requirements of the cubes to run on larger datasets (>10K).

New Feature: This includes a scaffold analysis of molecules in a dataset. The Floe Report provides statistical analysis for the molecule scaffolds (variants). Scaffold analysis parses molecular datasets to identify common structural frameworks. For this floe, we can a. uncolor molecules, b. cluster scaffolds, and c. either ‘evenly distribute’ the clustered molecule scaffold among train and test OR ‘hold out’ novel scaffolds for testing to inspect model generalization. There is also an option to try to d. split the molecules into train and test based on clustered scaffolds. e. Optionally, if a good split is found, data can be tagged as either scaffold train or test for the ML Build floes. Each of the options, a through e, are found in the Advanced: Scaffold Splitting for Regression Options. All calculations are done in 2D. To train on larger datasets greater than 30k with scaffold splitting please increase cube memory of the Scaffold Splitting cube.

Finally, we have the new option to feature the molecules for training Graph Convolutional Neural Networks. This outputs a collection that contains tensor data with molecule featured. More details on how the floe operates can be found in this tutorial .

Promoted Parameters

Title in user interface (promoted name)

Inputs

Input Small Molecules to train machine learning models on (in): Input dataset file with each record containing molecule and response value (float) to train on.

  • Required

  • Type: data_source

Outputs

ML Data Processed Output Molecules (out): ML Data Processed Output of Modified Molecule

  • Type: dataset_out

  • Default: ML Data Processed Output Molecules

ML Data Processed Failure Molecules (failed_out): ML Data Processed Output of Failure Molecules

  • Type: dataset_out

  • Default: ML Data Processed Failure Molecules

Output Collection name for Graph Feature Vector (outc): Collection Feature Name that serves as input to the Floe: ML Build: Graph convolution Model on pregenerated Features for Small Molecules

  • Required

  • Type: collection_sink

  • Default: Graph Feature Collection

Options

Blockbuster Preprocess Molecule (blockbuster): For every molecule apply Blockbuster Filter

  • Type: boolean

  • Choices: [True, False]

Response Value Field (rf): Name of the field containing the primary data being trained on and predicted. Every molecule needs to have this value (will be ignored otherwise). Must be Float, Int, or String

  • Required

  • Type: field_parameter

Response Analysis (tor):

  • Type: string

  • Default: Auto

  • Choices: [‘Classification’, ‘Regression’, ‘Auto’]

Remove molecules with outlier chemical properties (outl): For Trim: Remove molecules with MW, AC, RB etc higher than 3*variance of data as shown in histogram of floe report. For Cutoff: Uses next two options to remove specific ranges of molecule properties. For Do not trim: Does not remove any outlier

  • Type: string

  • Default: Trim outlier

  • Choices: [‘Trim outlier’, ‘Do not trim’, ‘Cutoff’]

Feature for Min and Max Cutoff Values (mmr): Works by changing parameter Trim molecules with outlier chemical properties to ‘Cutoff’. Choose feature to put cutoff on.

  • Type: string

  • Default: Mol Wt

  • Choices: [‘Mol Wt’, ‘Atom Count’, ‘XLogP’, ‘Rotatable Bonds’, ‘PSA’, ‘Response’]

Min and Max Cutoff Values (mmc): Works by changing parameter of Remove molecules with outlier chemical properties to ‘Cutoff’. Trim molecules with outlier chemical properties of parameter Feature for Min and Max Cutoff Values which is MW by default. First value is lower-cutoff, Second value is higher cutoff to remove molecules. (Default: MW)

  • Type: decimal

  • Default: [200, 800]

New Feature: Scaffold Splitting (doscf): Performs scaffold analysis of molecules in dataset. Parses molecular datasets to identify common structural cores. Molecules with similar scaffolds often share biological activities. Splits are useful for machine learning as well.

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

New Feature: Graph Feature Vector Generation (dogcnn): Outputs a collection with graph node and edge features. These are pytorch tensors that might be used to train graph convolutional neural network models

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Advanced Options: Scaffold Splitting for Training ML

Uncolor Molecule (uncolor): Cluster based on scaffold structure rather than atoms. Uncolors molecule based on OEUncolorStrategy: including ConvertAtomTypeToC, ConvertBondTypeToSingle, RemoveAtomStereoRemoveBondStereo, RemoveGroupStereo, RemoveAtomProperties, RemoveDimension

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Clustering Algorithm for Scaffolds (clustertype): Cluster on pairwise FP Tanimoto Distance to find similar substructure in scaffolds. For KMeans++ we will try different sized clusters and choose the best (minimum intra cluster distance)

  • Type: string

  • Default: DBScan

  • Choices: [‘DBScan’, ‘KMeans++’]

How to distribute scaffold splits for training (learnscope): Even Distribute: make sure all types of clustered scaffolds are evenly distributed among test and trainHold out: Keep novel scaffold clusters out of training to test for models ability to generalize

  • Type: string

  • Default: Hold out

  • Choices: [‘Even Distribute’, ‘Hold out’]

Train size for splitting total molecule data between train and test (sps): Train size for splitting the molecules into train and test based on clustered scaffolds. Note that there is no guarantee that the data possess scaffolds that can be split in the input ratio, and our floe report will give an estimate.

  • Type: decimal

  • Default: 0.9