How-to Guide: Molecular Clustering with Three Specialized Floes
This guide details the philosophy behind the molecular clustering offerings (the ethos) and provides a framework for selecting the appropriate floe based on your drug discovery stage, dataset size, and goal (the when to use).
The Clustering Ethos: The Tiered Approach
The clustering strategy follows a “tiered clustering” philosophy, specifically designed to tame unfeasibly large hit lists and reduce output from tools like Gigadock and FastROCS to a manageable, “sensible” size within Orion.
The selection of the clustering method is based on molecule count. The core strategy is a tiered approach to data reduction:
- Tier I Clustering
Clustering: 2D Scaffold Floe: The process is designed to reduce 106–107 hits down to 102 cluster heads.
- Tier II Clustering
Clustering: 2D Fingerprint, up to 100K compounds Floe: For input up to 100K molecules.
- Tier III Clustering
Clustering: 3D Shape/Color, ~10K compounds: For input less than or equal to 10K molecules.
For the Tier I scale, we use the Clustering: 2D Scaffold (~25M compounds) Floe to:
Preprocess molecules (remove non-drug-like molecules and optionally replace heavy atoms).
Use Bemis-Murcko-based scaffolds to cluster molecules together.
Reserve subsamples from each cluster (molecule core) to provide reliable and diverse subsets.
Create a statistical floe report to provide more insights into data.
Shuffle and arrange molecules with common scaffolds to the same shards.
For the Tier II and III scales, we use the Clustering: 2D Fingerprint (≤100K compounds) Floe and the Clustering: 3D Shape/Color (~10K compounds) Floe, respectively, to:
By default, perform K-medoids, DBSCAN, and hierarchical clustering (with the option to remove one or more techniques).
Make calculated distance matrices reusable for future clustering iterations.
Subsample either randomly or based on a given hit list.
Create statistical floe reports to provide more insights into data.
While the fundamental clustering aspects of both Tier II and III are the same, they differ in pairwise molecule distance computation.
Tier II uses 2D fingerprints and offers a choice to change the fingerprint type (tree, circular, path, MACCS, or Lingo).
Tier III uses 3D ROCS-based distance calculations and offers a choice to calculate distance based on shape, color, or a combination of both. By default, it calculates a single conformer. However, if multiple conformers are provided, the distance is calculated using the minimum distance between all conformers. Users can also choose the option to overlap or overlay to calculate molecule similarity.
Both floes output a collection that contains the computed distance matrix. Users can utilize this to recalibrate clustering parameters to get better results.
This table shows additional details for the use cases of the three floes.
Floe Name |
Clustering Type |
Max Size |
Input Type |
Target |
|---|---|---|---|---|
Scaffold |
107 |
Collection or File |
HTS/VS |
|
Fingerprint |
105 |
Data Record |
Hit-to-Lead |
|
3D Fixed Pose |
104 |
Data Record |
Hit-to-Lead |
Cost Estimates
The following tables show estimates for cost and time based on input size.
Clustering: 2D Scaffold (~25M)
Input Size |
Cost ($) |
Time (hr) |
|---|---|---|
1M |
4 |
1.5 |
5M |
13 |
3.5 |
10M |
26 |
6 |
23M |
65 |
22 |
Clustering: 2D Fingerprints (≤100K)
Input Size |
Cost ($) |
Time |
|---|---|---|
1K |
2 |
3 min |
2.5K |
2 |
12 min |
10K |
3 |
15 min |
20K |
3 |
50 min |
40K |
3 |
50 min |
50K |
4 |
1.5 hr |
60K |
5 |
2 hr |
70K |
15 |
7 hr |
80K |
20 |
8 hr |
Clustering: 3D Color/Shape (~10K)
Input Size |
Cost ($) |
Time |
|---|---|---|
1K |
2 |
36 min |
2K |
5 |
44 min |
5K |
6 |
3 hr |
10K |
150 |
7 hr |
Scaling Up
To make the clustering step run faster (especially for larger datasets), here are a few tips.
K-medoid: Keep the Number of K-Medoids Clusters parameter low. For a dataset of 10K, and a K-medoid cluster head of 50, it takes about 8 hours to run, costing about $10. In the Advanced: K-medoid parameter group, using k-medoids++ as the option in the Medoid Initialization Method parameter increases the speed of the floe.
Hierarchical: To increase the speed, remove the Distance Threshold parameter. In the Advanced: Hierarchical parameter group, set Compute Full Tree to False and set an appropriate number for the Number of Hierarchical Clusters (like K-medoid, 3–5 are good defaults but it may vary with individual data).
To make the distance matrix computation faster:
Choose the MACCS fingerprint type.
For the Clustering: 3D Shape/Color Floe, having the Calculate Minimum Similarity for Multi-conformer parameter include the full conformer distance matrix can get computationally expensive. We recommend turning the Auto Tile Size to Off and setting the tile size cube parameter to 50 in the Make 3D Similarity Tiles Cube.