ML Build: Graph Convolution Model on Pregenerated Features for Small Molecules

Category Paths

Follow one of these paths in the Orion user interface, to find the floe.

Description

ML Build: Graph Convolution Model on Pregenerated Features for Small Molecules. The graph architecture uses PyTorch Geometric running on top of Pytorch DDP (Distributed Data Parallel) to train models. Can be scaled to much larger data on increasing GPU Count. Takes input graph features generated from the Data Processing of Small Molecule for ML Model Building floe. This pipeline is designed to predict various small molecule properties, adaptable for both regression and classification tasks based on the “Response Value Field” selected during data processing. Model Architecture: The core of this pipeline is a GCNN model, structured to effectively learn from graph-structured data:

Graph Convolutional Layers: GINEConv (Graph Isomorphism Network Edge Convolution) acts as the first layer followed by GINConv (Graph Isomorphism Network Convolution) Layers. These layers are responsible for aggregating information from a node’s local neighborhood, effectively learning molecular representations. Make these hidden layers deeper for capturing longer range interactions. Global Pooling Layer and Batching (Implicit): Captures the overall structural and chemical information of the entire molecule. MLP (Multi-Layer Perceptron) Layers: The molecular-level representation obtained from graph layers is then passed through a series of fully connected (MLP) layers is aggregated to learn molecule wide relations. Output Layer: The final MLP layer outputs the predicted property: Regression: A ReLU activation function and a single output node for continuous property prediction. Classification: A sigmoid activation (for binary) or a softmax activation (for multi-class) with a number of output nodes equal to the number of classes.

Training and Hyperparameter Considerations: Model Complexity and Overfitting: Select a model complexity (number of layers, hidden units) that is significantly less than the available data points. This prevents overfitting, especially when dealing with limited datasets in small molecule property design. Regularization techniques (e.g., dropout, regularization method, weighted decay) can also mitigate overfitting.

Loss Function: Regression: Mean Squared Error (L2), Mean Absolute Error (L1), R2, KL-Divergence etc.Classification: Binary Cross-Entropy (BCE) for binary, Categorical Cross-Entropy for multi-class. Finally we provide a comprehensive floe report to illustrate quality of model build on validation set and training graphs and statistics.

Promoted Parameters

Title in user interface (promoted name)

Inputs

Collection: Input Graph Featurized for Small Molecules to train GCNN ML (ic): Collection Output of the floe Graph Convolution Featurization. Contains Metadata attached on input molecules

  • Required

  • Type: collection_source

Options

Epoch (te): Total Number of Epochs for GCNN Model Training

  • Required

  • Type: integer

  • Default: 50

Type of Loss Function (losstype): Types of Loss Functions. Choose based on type of model. Regression: L1, L2, R2, KL-Divergence, Binary Classification: BCELoss, Multiclass Classification: CrossEntropyLoss

  • Required

  • Type: string

  • Default: L2

  • Choices: [‘L1’, ‘L2’, ‘R2’, ‘KL-Divergence’, ‘CrossEntropyLoss’, ‘BCELoss’]

Automatically adjust hidden layer size (auto_layer_size): The default graph convolution and MLP values are reduced until the model parameter-to-data ratio is less than the target ratio parameter below. If you turn this off, make sure to reduce default size of Graph convolution and Fully Connected Layer Dimension as by default it creates a model of size 700k. Best case to use this is built an initial model with this option ON, see what defaults the floe chooses based on your input then improve on that.

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Automatically adjust batches size (auto_batch_size): Default batch size values are adjusted using a lookup table that references the training sample size

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Target ratio of input sample size to model parameter size (auto_ratio): Calculates the model’s parameter size relative to the input sample. By default, it reduces the model’s trainable parameters to be roughly equivalent in size to the input*ratio.If the size of your data is less than 1k we recommend lowering ratio value to 1. If you have more data (>5K), it is recommended to put ratio somewhere between 2 and 5.

  • Type: decimal

  • Default: 1.5

Advanced Options

Weight Decay for optimizers (weight_decay): ML training L2 optimization if positive wgt_dcy. L1 optimization on same value if negative wgt_dcy

  • Required

  • Type: decimal

  • Default: 0.001

Include additional skip connection in last layer of GCNNy using DeepGCNLayer (includeskipconnection): https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.nn.models.DeepGCNLayer.html

  • Type: boolean

  • Default: False

  • Choices: [True, False]

Type of pool layer (poolinglayer):

  • Type: string

  • Default: mean

  • Choices: [‘max’, ‘mean’, ‘sum’]

dropout (dropout): Dropout for GCNN ML Model

  • Required

  • Type: decimal

  • Default: 0.2

Learning Rate (learning_rate): Learning Rate for GCNN ML Model

  • Required

  • Type: decimal

  • Default: 0.001

Graph Convolution Dimension (dh): Graph Convolution Hidden Layer Dimension. Input Feature dimension Convolved using GINEConv to first hidden layer. This is convolved to the second layer onwards using GINConv layer. Make it deeper based on longer interactions model would need. Make such that total size of roughly half of input dataset size. Default setting has a total of 70k nodes in model total: Best change if automatic adjust is OFF

  • Type: integer

  • Default: [1024, 512, 256]

Fully Connected Layer Dimension (dl): Fully Connected Layer Dimension for MLP after GCNN. Output of GCNN layer fed to first layer of MLP. Last layer of GCNN matches first layer of MLPFully connected layers down to 1 layer for regression and binary classification. 2 or more for multiclass classification

  • Type: integer

  • Default: [128, 4]

Batch Size (batch_size): Batch size for GCNN

  • Required

  • Type: integer

  • Default: 64