Hierarchical 3D Similarity Clustering

Category Paths

Follow one of these paths in the Orion user interface, to find the floe.

  • Role-based/Medicinal Chemist

  • Task-based/Data Science/Clustering

  • Solution-based/Virtual-screening/Analysis/Clustering

Description

This floe clusters datasets using 3D OEShape similarity calculations and sklearn agglomerative clustering, also known as hierarchical clustering.

For details, see sklearn docs at https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering

This Floe uses OEShape for 3D similarity calculation. If (and only if) the primary molecule field on the input dataset does not have 3D coordinates, the floe will use OMEGA to generate a single conformer for each record in the primary molecule field and do similarity calculations with this molecule.

This Floe has to cache records and generate an NxN similarity matrix, therefore, it is expensive and not recommended for handling an input dataset with more than 20K input records, and the floe will fail automatically for input datasets larger than 30K records.

The Floe can generate the following datasets:

  • A members dataset that will contain each molecule from the input dataset with its corresponding cluster ID.

  • A cores dataset that contains one representative from each cluster.

  • A singletons dataset that contains any clusters with only one member.

Promoted Parameters

Title in user interface (promoted name)

3D Similarity Calculation

3D Similarity Score Function (score_type):

  • Type: string

  • Default: Tanimoto Combo

  • Choices: [‘Shape Tanimoto’, ‘Color Tanimoto’, ‘Tanimoto Combo’]

Align Molecules (use_align): If set to True, molecules will be aligned before similarity calculation; otherwise, they will retain input coordinates.

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Similarity Score Cutoff (sim_cutoff): Similarity scores below this value will be calculated as 0

  • Type: decimal

  • Default: 0.05

Outputs

Cluster Cores (cores): Representative members, one for each cluster

  • Type: dataset_out

  • Default: 3D_hierarchical_cluster_cores

Failed Records (failed): Dataset with failed records.

  • Required

  • Type: dataset_out

  • Default: 3D_hierarchical_cluster_failed

Cluster Members (members): Name of output member dataset, containing all cluster members.

  • Required

  • Type: dataset_out

  • Default: 3D_hierarchical_cluster_members

Floe Report Name (cluster_report_name):

  • Type: string

  • Default: 3D_hierarchical_cluster_report

Singletons (singletons): Name of output singletons dataset, containing clusters with only one member.

  • Type: dataset_out

  • Default: 3D_hierarchical_cluster_singletons

Inputs

Input Dataset (in): Dataset to cluster

  • Required

  • Type: data_source

Advanced: Sort Input Dataset By Score

Use Score (use_rank): Use rank to sort hits in directed sphere exclusion algorithm and include rank information in clustering report. YOU MUST SELECT A SCORE FIELD IN THE SCORE FIELD PARAMETER, BELOW, IF THIS IS SET TO TRUE.

  • Type: boolean

  • Default: False

  • Choices: [True, False]

Score Field (rank_field): Score field to be used for sorting during sphere exclusion and analyzed in floe report.

  • Type: field_parameter

Score Sort Order (sort_order): Sort order for scores. Descending means higher scores are more desirable. Ascending means lower scores are more desirable.

  • Type: string

  • Default: Descending

  • Choices: [‘Descending’, ‘Ascending’]

Advanced: Memory, Optimization, and Matrix File Output or Input

Output Distance Matrix (output_similarity_matrix): If True, writes distance matrix used in clustering to a file.

  • Type: boolean

  • Default: False

  • Choices: [True, False]

Distance Matrix Filename (similarity_matrix_filename): Filename of distance matrix.

  • Type: string

  • Default: clustering_distance_matrix.txt

Memory for Clustering Cubes (clustering_memory): For large datasets, try increasing the memory limit.

  • Type: decimal

  • Default: 8000

Similarity Tile Size (tile_size): Advanced parameter for tuning parallel speed.

  • Type: integer

  • Default: 5

Hierarchical Clustering

Number of Clusters (n_clusters): The number of clusters to find.

  • Required

  • Type: integer

Advanced: Hierarchical Clustering

Compute Full Tree (compute_full_tree): Stop early the construction of the tree at n_clusters. This is useful to decrease computation time if the number of clusters is not small compared to the number of samples. This option is useful only when specifying a connectivity matrix. Note also that when varying the number of clusters and using caching, it may be advantageous to compute the full tree. It must be True if distance_threshold is not None. By default compute_full_tree is “auto”, which is equivalent to True when distance_threshold is not None or that n_clusters is inferior to the maximum between 100 or 0.02 * n_samples. Otherwise, “auto” is equivalent to False.

  • Required

  • Type: string

  • Default: auto

  • Choices: [‘auto’, ‘True’, ‘False’]

Linkage (linkage): Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion.‘average’ uses the average of the distances of each observation of the two sets.‘complete’ or ‘maximum’ linkage uses the maximum distances between all observations of the two sets.‘single’ uses the minimum of the distances between all observations of the two sets.

  • Required

  • Type: string

  • Default: average

  • Choices: [‘complete’, ‘average’, ‘single’]