How-to Guide: Molecular Clustering with Three Specialized Floes

This guide details the philosophy behind the molecular clustering offerings (the ethos) and provides a framework for selecting the appropriate floe based on your drug discovery stage, dataset size, and goal (the when to use).

The Clustering Ethos: The Tiered Approach

The clustering strategy follows a “tiered clustering” philosophy, specifically designed to tame unfeasibly large hit lists and reduce output from tools like Gigadock and FastROCS to a manageable, “sensible” size within Orion.

The selection of the clustering method is based on molecule count. The core strategy is a tiered approach to data reduction:

Tier I Clustering

Clustering: 2D Scaffold Floe: The process is designed to reduce 106–107 hits down to 102 cluster heads.

Tier II Clustering

Clustering: 2D Fingerprint, up to 100K compounds Floe: For input up to 100K molecules.

Tier III Clustering

Clustering: 3D Shape/Color, ~10K compounds: For input less than or equal to 10K molecules.

For the Tier I scale, we use the Clustering: 2D Scaffold (~25M compounds) Floe to:

  1. Preprocess molecules (remove non-drug-like molecules and optionally replace heavy atoms).

  2. Use Bemis-Murcko-based scaffolds to cluster molecules together.

  3. Reserve subsamples from each cluster (molecule core) to provide reliable and diverse subsets.

  4. Create a statistical floe report to provide more insights into data.

  5. Shuffle and arrange molecules with common scaffolds to the same shards.

For the Tier II and III scales, we use the Clustering: 2D Fingerprint (≤100K compounds) Floe and the Clustering: 3D Shape/Color (~10K compounds) Floe, respectively, to:

  1. By default, perform K-medoids, DBSCAN, and hierarchical clustering (with the option to remove one or more techniques).

  2. Make calculated distance matrices reusable for future clustering iterations.

  3. Subsample either randomly or based on a given hit list.

  4. Create statistical floe reports to provide more insights into data.

While the fundamental clustering aspects of both Tier II and III are the same, they differ in pairwise molecule distance computation.

  • Tier II uses 2D fingerprints and offers a choice to change the fingerprint type (tree, circular, path, MACCS, or Lingo).

  • Tier III uses 3D ROCS-based distance calculations and offers a choice to calculate distance based on shape, color, or a combination of both. By default, it calculates a single conformer. However, if multiple conformers are provided, the distance is calculated using the minimum distance between all conformers. Users can also choose the option to overlap or overlay to calculate molecule similarity.

  • Both floes output a collection that contains the computed distance matrix. Users can utilize this to recalibrate clustering parameters to get better results.

This table shows additional details for the use cases of the three floes.

Floe Name

Clustering Type

Max Size

Input Type

Target

Clustering: 2D Scaffold

Scaffold

107

Collection or File

HTS/VS

Clustering: 2D Fingerprint, up to 100K compounds

Fingerprint

105

Data Record

Hit-to-Lead

Clustering: 3D Shape/Color, ~10K compounds

3D Fixed Pose

104

Data Record

Hit-to-Lead

Cost Estimates

The following tables show estimates for cost and time based on input size.

Clustering: 2D Scaffold (~25M)

Input Size

Cost ($)

Time (hr)

1M

4

1.5

5M

13

3.5

10M

26

6

23M

65

22

Clustering: 2D Fingerprints (≤100K)

Input Size

Cost ($)

Time

1K

2

3 min

2.5K

2

12 min

10K

3

15 min

20K

3

50 min

40K

3

50 min

50K

4

1.5 hr

60K

5

2 hr

70K

15

7 hr

80K

20

8 hr

Clustering: 3D Color/Shape (~10K)

Input Size

Cost ($)

Time

1K

2

36 min

2K

5

44 min

5K

6

3 hr

10K

150

7 hr

Scaling Up

To make the clustering step run faster (especially for larger datasets), here are a few tips.

  1. K-medoid: Keep the Number of K-Medoids Clusters parameter low. For a dataset of 10K, and a K-medoid cluster head of 50, it takes about 8 hours to run, costing about $10. In the Advanced: K-medoid parameter group, using k-medoids++ as the option in the Medoid Initialization Method parameter increases the speed of the floe.

  2. Hierarchical: To increase the speed, remove the Distance Threshold parameter. In the Advanced: Hierarchical parameter group, set Compute Full Tree to False and set an appropriate number for the Number of Hierarchical Clusters (like K-medoid, 3–5 are good defaults but it may vary with individual data).

To make the distance matrix computation faster:

  1. Choose the MACCS fingerprint type.

  2. For the Clustering: 3D Shape/Color Floe, having the Calculate Minimum Similarity for Multi-conformer parameter include the full conformer distance matrix can get computationally expensive. We recommend turning the Auto Tile Size to Off and setting the tile size cube parameter to 50 in the Make 3D Similarity Tiles Cube.