How-to Guide: Molecular Clustering with Three Specialized Floes

This guide details the philosophy behind the molecular clustering offerings (the ethos) and provides a framework for selecting the appropriate floe based on your drug discovery stage, dataset size, and goal (the when to use).

The Clustering Ethos: The Tiered Approach

The clustering strategy follows a “tiered clustering” philosophy, specifically designed to tame unfeasibly large hit lists and reduce output from tools like Gigadock and FastROCS to a manageable, “sensible” size within Orion.

The selection of the clustering method is based on molecule count. The core strategy is a tiered approach to data reduction:

Tier I Clustering: Clustering: 2D Scaffold Floe: The process is designed to reduce 10⁶–10⁷ hits down to 10² cluster heads.
Tier II Clustering: Clustering: 2D Fingerprint, up to 100K compounds Floe: For input up to 100K molecules.
Tier III Clustering: Clustering: 3D Shape/Color, ~10K compounds: For input less than or equal to 10K molecules.

For the Tier I scale, we use the Clustering: 2D Scaffold (~25M compounds) Floe to:

Preprocess molecules (remove non-drug-like molecules and optionally replace heavy atoms).
Use Bemis-Murcko-based scaffolds to cluster molecules together.
Reserve subsamples from each cluster (molecule core) to provide reliable and diverse subsets.
Create a statistical floe report to provide more insights into data.
Shuffle and arrange molecules with common scaffolds to the same shards.

For the Tier II and III scales, we use the Clustering: 2D Fingerprint (≤100K compounds) Floe and the Clustering: 3D Shape/Color (~10K compounds) Floe, respectively, to:

By default, perform K-medoids, DBSCAN, and hierarchical clustering (with the option to remove one or more techniques).
Make calculated distance matrices reusable for future clustering iterations.
Subsample either randomly or based on a given hit list.
Create statistical floe reports to provide more insights into data.

While the fundamental clustering aspects of both Tier II and III are the same, they differ in pairwise molecule distance computation.

Tier II uses 2D fingerprints and offers a choice to change the fingerprint type (tree, circular, path, MACCS, or Lingo).
Tier III uses 3D ROCS-based distance calculations and offers a choice to calculate distance based on shape, color, or a combination of both. By default, it calculates a single conformer. However, if multiple conformers are provided, the distance is calculated using the minimum distance between all conformers. Users can also choose the option to overlap or overlay to calculate molecule similarity.
Both floes output a collection that contains the computed distance matrix. Users can utilize this to recalibrate clustering parameters to get better results.

This table shows additional details for the use cases of the three floes.

Floe Name	Clustering Type	Max Size	Input Type	Target
Clustering: 2D Scaffold	Scaffold	10⁷	Collection or File	HTS/VS
Clustering: 2D Fingerprint, up to 100K compounds	Fingerprint	10⁵	Data Record	Hit-to-Lead
Clustering: 3D Shape/Color, ~10K compounds	3D Fixed Pose	10⁴	Data Record	Hit-to-Lead

Cost Estimates

The following tables show estimates for cost and time based on input size.

Clustering: 2D Scaffold (~25M)

Input Size	Cost ($)	Time (hr)
1M	4	1.5
5M	13	3.5
10M	26	6
23M	65	22

Clustering: 2D Fingerprints (≤100K)

Input Size	Cost ($)	Time
1K	2	3 min
2.5K	2	12 min
10K	3	15 min
20K	3	50 min
40K	3	50 min
50K	4	1.5 hr
60K	5	2 hr
70K	15	7 hr
80K	20	8 hr

Clustering: 3D Color/Shape (~10K)

Input Size	Cost ($)	Time
1K	2	36 min
2K	5	44 min
5K	6	3 hr
10K	150	7 hr

Scaling Up

To make the clustering step run faster (especially for larger datasets), here are a few tips.

K-medoid: Keep the Number of K-Medoids Clusters parameter low. For a dataset of 10K, and a K-medoid cluster head of 50, it takes about 8 hours to run, costing about $10. In the Advanced: K-medoid parameter group, using k-medoids++ as the option in the Medoid Initialization Method parameter increases the speed of the floe.
Hierarchical: To increase the speed, remove the Distance Threshold parameter. In the Advanced: Hierarchical parameter group, set Compute Full Tree to False and set an appropriate number for the Number of Hierarchical Clusters (like K-medoid, 3–5 are good defaults but it may vary with individual data).

To make the distance matrix computation faster:

Choose the MACCS fingerprint type.
For the Clustering: 3D Shape/Color Floe, having the Calculate Minimum Similarity for Multi-conformer parameter include the full conformer distance matrix can get computationally expensive. We recommend turning the Auto Tile Size to Off and setting the tile size cube parameter to 50 in the Make 3D Similarity Tiles Cube.