Clustering: 2D Fingerprint (<=100K compounds)

Category Paths

Follow one of these paths in the Orion user interface, to find the floe.

Description

2D Based Fingerprint Clustering

The process begins by checking whether a 2D fingerprint already exists for each molecule. If not, a 2D fingerprint is generated. A 2D fingerprint is a binary representation of a molecule’s structure, encoding features like substructures and atom types. It’s a fundamental step for many cheminformatics tasks, including similarity searching and clustering.

Clustering: Once the fingerprints are available, the floe employs several clustering algorithms to group compounds. These methods—K-medoids, DBSCAN, and Hierarchical clustering—each offer different approaches for finding patterns in the data:

K-medoids partitions the data into a predefined number of clusters, with each molecule belonging to the cluster with the nearest mean.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups points that are closely packed, marking as outliers points that lie alone in low-density regions.
Hierarchical clustering builds a tree of clusters, with the data points at the bottom and a single cluster at the top, allowing for analysis at various levels of granularity.

Sampling from Clusters: The next critical step is to sample from each cluster core. This isn’t random sampling; it’s a strategic process. Molecules are selected based on “subsample criteria,” which could include scores from a previous screening step (e.g., docking score, experimental activity). This ensures that the final, smaller sample represents the most promising candidates from each structural group, a process that balances chemical diversity with therapeutic relevance.

Note: This floe can be extended to run on ~100K molecules by default. Modifications such as setting the hierarchical Distance Threshold to None and Compute Full Tree to False can speed up the process. Please refer to the documentation on how to run using this scale of data.

Promoted Parameters

Title in user interface (promoted name)

1. Input

Input Dataset (data_in): Input dataset molecule for first time clustering

Type: data_source

Rerun Clustering: Input Distance Matrix (cin): Input distance matrix and dataset from a previous run of this floe. This avoids computing the distance matrix multiple times.Reuse it to choose different clustering parameters. It cannot be used if the distance calculation metrics (score function, alignment, etc.) are changed.

Type: collection_source

2. Primary Clustering Options

Clustering Variants (ch): Types of clustering to do on dataset. Does all 3 variants by default.

Type: string

Default: [‘DBScan’, ‘Hierarchical’, ‘K-Medoid’]

Choices: [‘DBScan’, ‘Hierarchical’, ‘K-Medoid’]

Fingerprint Type (fingerprint_type): The fingerprint type generated for similarity calculation.

Type: string

Default: Tree

Choices: [‘Circular’, ‘Path’, ‘Tree’, ‘MACCS’]

Auto tile size (auto_tile_size): Automatically choose tile size

Type: boolean

Default: True

Choices: [True, False]

Number of K-Medoids Clusters (n_clusters_kmed): Computation increases steeply with data size. Specially for large data (>5k) choosing cluster heads greater than 100 will lead to large time complexity.

Required

Type: integer

Default: 3

Distance Threshold (distance_threshold): The linkage distance threshold above which, clusters will not be merged. If not None, n_clusters must be None and compute_full_tree must be True.

Type: decimal

Default: 0.3

Epsilon (eps) (eps):

The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. Increase eps to cluster more molecules together in fewer clusters and decrease eps to cluster fewer molecules together in more clusters. Scores are normalized from 0 to 1; for example, a TanimotoCombo similarity score of 1.5 is normalized to a score of 0.75 and an eps/distance of 0.25. For more details: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

Type: decimal

Default: 0.6

Advanced: K-Medoid

Maximum K-Medoids Iterations (max_iter):

Type: integer

Default: 100

Random State (random_state): Specify random state for the random number generator. Used to initialise medoids when init=’random’.

Type: integer

Medoid Initialization Method (init_kmedoids): From scikit-learn-extra docs: Specify medoid initialization method. ‘random’ selects n_clusters elements from the dataset. ‘heuristic’ picks the n_clusters points with the smallest sum distance to every other point. ‘k-medoids++’ follows an approach based on k-means++_, and in general, gives initial medoids which are more separated than those generated by the other methods. ‘build’ is a greedy initialization of the medoids used in the original PAM algorithm. Often ‘build’ is more efficient but slower than other initializations on big datasets and it is also very non-robust, if there are outliers in the dataset, use another initialization.

Type: string

Default: k-medoids++

Choices: [‘random’, ‘heuristic’, ‘k-medoids++’, ‘build’]

Advanced: DBSCAN

Minimum Samples (min_samples):

This is a control parameter for DBSCAN that has little effect on the clustering for most cases. Across a wide-range of literature datasets, a value of 5 (or possibly a range of 5-10) is quite effective.

Type: integer

Default: 5

Advanced: Hierarchical

Number of Hierarchical Clusters (n_clusters_heir):

Type: integer

Compute Full Tree (compute_full_tree): Stop early the construction of the tree at n_clusters. This is useful to decrease computation time if the number of clusters is not small compared to the number of samples. This option is useful only when specifying a connectivity matrix. Note also that when varying the number of clusters and using caching, it may be advantageous to compute the full tree. It must be True if distance_threshold is not None. By default compute_full_tree is “auto”, which is equivalent to True when distance_threshold is not None or that n_clusters is inferior to the maximum between 100 or 0.02 * n_samples. For samples>50k, it is recommended to set this to False, set distance threshold to None, and number of clusters to a desired count

Type: string

Default: auto

Choices: [‘auto’, ‘True’, ‘False’]

Linkage (linkage): Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion.‘average’ uses the average of the distances of each observation of the two sets.‘complete’ or ‘maximum’ linkage uses the maximum distances between all observations of the two sets.‘single’ uses the minimum of the distances between all observations of the two sets.auto: sets to average if data size less than 20k, else sets to single

Type: string

Default: auto

Choices: [‘complete’, ‘average’, ‘single’, ‘auto’]

Advanced: Subsampling for Diverse Subset

Field to Sub Sample (subsampleparam): If this field is set, for each cluster group, emits n number of molecules with highest or lowest value

Type: field_parameter::float

How many top Hit list samples to keep (num_subsample): Triggered with previous parameter. 5 will pick the top-5 cluster element with highest value for previous field. Setting it to -5 will select lowest.

Type: integer

Default: 0

How to subsample from each cluster (subsample_type): Random: Picks randomly from each cluster. Highest or Lowest: Picks from Sub Sample field

Type: string

Default: Random

Choices: [‘Highest’, ‘Lowest’, ‘Random’]

Output

Output Collection Name for Storing Distance Matrix and Input Record (outc): Collection for distance matrix and input file. Use this instead of input dataset for subsequent runs to save time recalculating the distance matrix

Required

Type: collection_sink

Default: 2D Fingerprint Distance Matrix and Metadata

Output Dataset (m1out): Output data record for each input molecule with four added fields for each cluster: ClusterID, isCore, distToCore, isSingleton.

Required

Type: dataset_out

Default: 2d_cluster_label_and_distance_to_core

Output Dataset (h1out): Output of molecules labeled using DBSCAN cluster. Cores are the medoid of each cluster.

Required

Type: dataset_out

Default: 2d_dbscan_subsample_out

Output Dataset (h3out): Output of molecules labeled using K-medoid cluster.

Required

Type: dataset_out

Default: 2d_kmedoid_subsample_out

Output Dataset (h2out): Output of molecules labeled using Hierarchical cluster

Required

Type: dataset_out

Default: 2d_hier_subsample_out