Clustering: 3D Shape/Color (~10K compounds)

Category Paths

Follow one of these paths in the Orion user interface, to find the floe.

Description

” 3D Based Fingerprint Clustering

The process for this floe begins by checking whether a 3D conformer already exists for each molecule. If not, a 3D lowest-energy conformer is generated. The user can select between overlap and overlay of molecules.

Clustering: Once the conformers are available, the floe employs several clustering algorithms to group compounds. The methods—K-medoids, DBSCAN, and Hierarchical clustering—each offer a different approach for finding patterns in the data:

  • K-medoid partitions the data into a predefined number of clusters, with each molecule belonging to the cluster with the nearest mean.

  • DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups points that are closely packed, marking points that lie alone in low-density regions as outliers.

  • Hierarchical clustering builds a tree of clusters, with the data points at the bottom and a single cluster at the top, allowing for analysis at various levels of granularity.

Sampling from Clusters: The next critical step is to sample from each cluster core. This isn’t random sampling; it’s a strategic process. Molecules are selected based on “subsample criteria,” which could be scores from a previous screening step (e.g., docking score or experimental activity). This ensures that the final, smaller sample represents the most promising candidates from each structural group, a process that balances chemical diversity with therapeutic relevance.

Promoted Parameters

Title in user interface (promoted name)

1. Input

Input Dataset (data_in): Input dataset molecule for first time clustering

  • Type: data_source

Rerun Clustering: Input Distance Matrix (cin): Input distance matrix and dataset from a previous run of this floe. This avoids computing the distance matrix multiple times. Reuse it to choose different clustering parameters. It cannot be used if the distance calculation metrics (fingerprint type) are changed.

  • Type: collection_source

2. Primary Options

Clustering Variants (clusterchoice): Types of clustering to do on dataset. Does all 3 variants by default.

  • Type: string

  • Default: [‘DBScan’, ‘Hierarchical’, ‘K-Medoid’]

  • Choices: [‘DBScan’, ‘Hierarchical’, ‘K-Medoid’]

Align Molecules (use_align): If set to True, molecules will be aligned before similarity calculation; otherwise, they will retain input coordinates.

  • Type: boolean

  • Default: True

  • Choices: [True, False]

3D Similarity Score Function (score_type):

  • Type: string

  • Default: Tanimoto Combo

  • Choices: [‘Shape Tanimoto’, ‘Color Tanimoto’, ‘Tanimoto Combo’]

Auto tile size (auto_tile_size): Automatically choose tile size

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Number of K-Medoids Clusters (n_clusters_kmed): Computation increases steeply with data size. Specially for large data (>5k) choosing cluster heads greater than 100 will lead to large time complexity.

  • Required

  • Type: integer

  • Default: 3

Distance Threshold (distance_threshold): The linkage distance threshold above which, clusters will not be merged. If not None, n_clusters must be None and compute_full_tree must be True.

  • Type: decimal

  • Default: 0.3

Epsilon (eps) (eps):

The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. Increase eps to cluster more molecules together in fewer clusters and decrease eps to cluster fewer molecules together in more clusters. Scores are normalized from 0 to 1; for example, a TanimotoCombo similarity score of 1.5 is normalized to a score of 0.75 and an eps/distance of 0.25. For more details: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

  • Type: decimal

  • Default: 0.6

Advanced: Subsampling for Diverse Subset

Field to Sub Sample (hl): If this field is set, for each cluster group, emits n number of molecules with highest or lowest value

  • Type: field_parameter::float

How many top Hit list samples to keep (nhl): Triggered with previous parameter. 5 will pick the top-5 cluster element with highest value for previous field. Setting it to -5 will select lowest.

  • Type: integer

  • Default: 0

How to subsample from each cluster (subsample_type): Random: Picks randomly from each cluster. Highest or Lowest: Picks from Sub Sample field

  • Type: string

  • Default: Random

  • Choices: [‘Highest’, ‘Lowest’, ‘Random’]

Advanced: K-Medoid

Maximum K-Medoids Iterations (max_iter):

  • Type: integer

  • Default: 100

Random State (random_state): Specify random state for the random number generator. Used to initialise medoids when init=’random’.

  • Type: integer

Medoid Initialization Method (init_kmedoids): From scikit-learn-extra docs: Specify medoid initialization method. ‘random’ selects n_clusters elements from the dataset. ‘heuristic’ picks the n_clusters points with the smallest sum distance to every other point. ‘k-medoids++’ follows an approach based on k-means++_, and in general, gives initial medoids which are more separated than those generated by the other methods. ‘build’ is a greedy initialization of the medoids used in the original PAM algorithm. Often ‘build’ is more efficient but slower than other initializations on big datasets and it is also very non-robust, if there are outliers in the dataset, use another initialization.

  • Type: string

  • Default: k-medoids++

  • Choices: [‘random’, ‘heuristic’, ‘k-medoids++’, ‘build’]

Advanced: DBScan

Minimum Samples (min_samples):

This is a control parameter for DBSCAN that has little effect on the clustering for most cases. Across a wide-range of literature datasets, a value of 5 (or possibly a range of 5-10) is quite effective.

  • Type: integer

  • Default: 5

Advanced: Hierarchical

Number of Hierarchical Clusters (n_clusters_heir):

  • Type: integer

Compute Full Tree (compute_full_tree): Stop early the construction of the tree at n_clusters. This is useful to decrease computation time if the number of clusters is not small compared to the number of samples. This option is useful only when specifying a connectivity matrix. Note also that when varying the number of clusters and using caching, it may be advantageous to compute the full tree. It must be True if distance_threshold is not None. By default compute_full_tree is “auto”, which is equivalent to True when distance_threshold is not None or that n_clusters is inferior to the maximum between 100 or 0.02 * n_samples. For samples>50k, it is recommended to set this to False, set distance threshold to None, and number of clusters to a desired count

  • Type: string

  • Default: auto

  • Choices: [‘auto’, ‘True’, ‘False’]

Linkage (linkage): Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion.‘average’ uses the average of the distances of each observation of the two sets.‘complete’ or ‘maximum’ linkage uses the maximum distances between all observations of the two sets.‘single’ uses the minimum of the distances between all observations of the two sets.auto: sets to average if data size less than 20k, else sets to single

  • Type: string

  • Default: auto

  • Choices: [‘complete’, ‘average’, ‘single’, ‘auto’]

Output

Output Collection Name for Storing Distance Matrix and Input Record (outc): Collection for distance matrix and input file. Use this instead of input dataset next time you run this clustering floe to save time re-calculating the distance matrix.

  • Required

  • Type: collection_sink

  • Default: 3D Conformer Distance Matrix and Metadata

Output Dataset (m1out): Output data record for each input molecule with four added fields for each clustering: ClusterID, isCore, dist to Core, isSingleton.

  • Required

  • Type: dataset_out

  • Default: 3d_cluster_label_and_distance_to_core

Output Dataset (h1out): Output of molecules subsampled from each DBSCAN cluster.

  • Required

  • Type: dataset_out

  • Default: 3d_dbscan_subsample_out

Output Dataset (h3out): Output of molecules subsampled from each K-medoid cluster.

  • Required

  • Type: dataset_out

  • Default: 3d_kmedoid_subsample_out

Output Dataset (h2out): Output of molecules subsampled from each Hierarchical cluster.

  • Required

  • Type: dataset_out

  • Default: 3d_hier_subsample_out