ML Build: Graph Convolution Model on Pregenerated Features for Small Molecules
Category Paths
Follow one of these paths in the Orion user interface, to find the floe.
Description
ML Build: Graph Convolution Model on Pregenerated Features for Small Molecules. The graph architecture uses PyTorch Geometric running on top of Pytorch DDP (Distributed Data Parallel) to train models. Can be scaled to much larger data on increasing GPU Count. Takes input graph features generated from the Data Processing of Small Molecule for ML Model Building floe. This pipeline is designed to predict various small molecule properties, adaptable for both regression and classification tasks based on the “Response Value Field” selected during data processing. Model Architecture: The core of this pipeline is a GCNN model, structured to effectively learn from graph-structured data:
Graph Convolutional Layers: GINEConv (Graph Isomorphism Network Edge Convolution) acts as the first layer followed by GINConv (Graph Isomorphism Network Convolution) Layers. These layers are responsible for aggregating information from a node’s local neighborhood, effectively learning molecular representations. Make these hidden layers deeper for capturing longer range interactions. Global Pooling Layer and Batching (Implicit): Captures the overall structural and chemical information of the entire molecule. MLP (Multi-Layer Perceptron) Layers: The molecular-level representation obtained from graph layers is then passed through a series of fully connected (MLP) layers is aggregated to learn molecule wide relations. Output Layer: The final MLP layer outputs the predicted property: Regression: A ReLU activation function and a single output node for continuous property prediction. Classification: A sigmoid activation (for binary) or a softmax activation (for multi-class) with a number of output nodes equal to the number of classes.
Training and Hyperparameter Considerations: Model Complexity and Overfitting: Select a model complexity (number of layers, hidden units) that is significantly less than the available data points. This prevents overfitting, especially when dealing with limited datasets in small molecule property design. Regularization techniques (e.g., dropout, regularization method, weighted decay) can also mitigate overfitting.
Loss Function: Regression: Mean Squared Error (L2), Mean Absolute Error (L1), R2, KL-Divergence etc.Classification: Binary Cross-Entropy (BCE) for binary, Categorical Cross-Entropy for multi-class. Finally we provide a comprehensive floe report to illustrate quality of model build on validation set and training graphs and statistics.
Promoted Parameters
Title in user interface (promoted name)
Inputs
Collection: Input Graph Featurized for Small Molecules to train GCNN ML (ic): Collection Output of the floe Graph Convolution Featurization. Contains Metadata attached on input molecules
Required
Type: collection_source
Options
Epoch (te): Total Number of Epochs for GCNN Model Training
Required
Type: integer
Default: 50
Type of Loss Function (losstype): Types of Loss Functions. Choose based on type of model. Regression: L1, L2, R2, KL-Divergence, Binary Classification: BCELoss, Multiclass Classification: CrossEntropyLoss
Required
Type: string
Default: L2
Choices: [‘L1’, ‘L2’, ‘R2’, ‘KL-Divergence’, ‘CrossEntropyLoss’, ‘BCELoss’]
Automatically adjust hidden layer size (auto_layer_size): The default graph convolution and MLP values are reduced until the model parameter-to-data ratio is less than the target ratio parameter below. If you turn this off, make sure to reduce default size of Graph convolution and Fully Connected Layer Dimension as by default it creates a model of size 700k. Best case to use this is built an initial model with this option ON, see what defaults the floe chooses based on your input then improve on that.
Type: boolean
Default: True
Choices: [True, False]
Automatically adjust batches size (auto_batch_size): Default batch size values are adjusted using a lookup table that references the training sample size
Type: boolean
Default: True
Choices: [True, False]
Target ratio of input sample size to model parameter size (auto_ratio): Calculates the model’s parameter size relative to the input sample. By default, it reduces the model’s trainable parameters to be roughly equivalent in size to the input*ratio.If the size of your data is less than 1k we recommend lowering ratio value to 1. If you have more data (>5K), it is recommended to put ratio somewhere between 2 and 5.
Type: decimal
Default: 1.5
Advanced Options
Weight Decay for optimizers (weight_decay): ML training L2 optimization if positive wgt_dcy. L1 optimization on same value if negative wgt_dcy
Required
Type: decimal
Default: 0.001
Include additional skip connection in last layer of GCNNy using DeepGCNLayer (includeskipconnection): https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.nn.models.DeepGCNLayer.html
Type: boolean
Default: False
Choices: [True, False]
Type of pool layer (poolinglayer):
Type: string
Default: mean
Choices: [‘max’, ‘mean’, ‘sum’]
dropout (dropout): Dropout for GCNN ML Model
Required
Type: decimal
Default: 0.2
Learning Rate (learning_rate): Learning Rate for GCNN ML Model
Required
Type: decimal
Default: 0.001
Graph Convolution Dimension (dh): Graph Convolution Hidden Layer Dimension. Input Feature dimension Convolved using GINEConv to first hidden layer. This is convolved to the second layer onwards using GINConv layer. Make it deeper based on longer interactions model would need. Make such that total size of roughly half of input dataset size. Default setting has a total of 70k nodes in model total: Best change if automatic adjust is OFF
Type: integer
Default: [1024, 512, 256]
Fully Connected Layer Dimension (dl): Fully Connected Layer Dimension for MLP after GCNN. Output of GCNN layer fed to first layer of MLP. Last layer of GCNN matches first layer of MLPFully connected layers down to 1 layer for regression and binary classification. 2 or more for multiclass classification
Type: integer
Default: [128, 4]
Batch Size (batch_size): Batch size for GCNN
Required
Type: integer
Default: 64