Dataset Similarity and Clustering – Hierarchical

Category Paths

  • Role-based/Medicinal Chemist

  • Task-based/Data Science/Clustering

  • Solution-based/Virtual-screening/Analysis/Clustering

Description

This floe clusters datasets based on pre-generated fingerprints using Tanimoto similarity calculation and sklearn agglomerative clustering, also known as hierarchical clustering.

For details, see sklearn docs at https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering

The fingerprint input for this floe can be pre-generated using the following two Floes in this package.

  • Fingerprint Generation

  • Fingerprint Generation (User Defined) Floe

This Floe has to cache records and generate an NxN similarity matrix, therefore, it is not suitable for handling a very large dataset (more than 100K).

To run this floe using large datasets (greater than 10K records) please try increasing the “Memory for TSNE and Clustering Cubes” parameter.

If enabled, TSNE, T-distributed Stochastic Neighbor Embedding, will occur after clustering. This is a tool to visualize high-dimensional data. This step in the floe is optional, and the floe may run faster if this step is disabled.

The Floe can generate the following datasets:

  • Members dataset that will contain each molecule from the input dataset with cluster id, cluster size, and cluster member id data fields.

  • Cores dataset that contains one representative from each cluster with the cluster id and cluster size data fields.

  • Singletons dataset that contains clusters with only one member.

When writing the singletons dataset, neither the members nor the cores output datasets will contain the singleton records.

Promoted Parameters

Title in user interface (promoted name)

Inputs

Input Dataset (in): Dataset to cluster

  • Required

  • Type: data_source

Outputs

Cluster Members (members): Name of output member dataset, containing all cluster members.

  • Required

  • Type: dataset_out

  • Default: Hierarchical_clustering_members

Cluster Cores (cores): Representative members, one for each cluster

  • Type: dataset_out

  • Default: Hierarchical_clustering_cores

Singletons (singletons): Name of output singletons dataset

  • Type: dataset_out

  • Default: Hierarchical_clustering_singletons

Floe Report Name (cluster_report_name):

  • Required

  • Type: string

  • Default: Hierarchical_clustering_report

Failed Records (failed): Dataset with failed records.

  • Required

  • Type: dataset_out

  • Default: Hierarchical_clustering_failed

Cluster Settings

Minimum Similarity Score Cutoff (Similarity Minimum Cutoff): This cutoff controls the sparsity of the similarity matrix used for clustering. It is the minimumsimilarity score required to be included in the matrix: scores below this will be set to 0.0 in thematrix.

  • Type: decimal

  • Default: 0.05

Fingerprint Field (fingerprint_field): Tag name for the field that stores fingerprints.

  • Required

  • Type: field_parameter

Use TSNE (tsne_switch): Turn off TSNE to reduce memory usage, for very large datasets.

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Memory for TSNE and Clustering Cubes (clustering_memory): For large datasets, try increasing the memory limit.

  • Required

  • Type: decimal

  • Default: 2800

Hierarchical Clustering Settings

Number of Clusters (n_clusters): The number of clusters to find.

  • Required

  • Type: integer

  • Default: 10

Hierarchical Clustering Advanced Settings

Compute Full Tree (compute_full_tree): Stop early the construction of the tree at n_clusters. This is useful to decrease computation time if the number of clusters is not small compared to the number of samples. This option is useful only when specifying a connectivity matrix. Note also that when varying the number of clusters and using caching, it may be advantageous to compute the full tree. It must be True if distance_threshold is not None. By default compute_full_tree is “auto”, which is equivalent to True when distance_threshold is not None or that n_clusters is inferior to the maximum between 100 or 0.02 * n_samples. Otherwise, “auto” is equivalent to False.

  • Required

  • Type: string

  • Default: auto

  • Choices: [‘auto’, ‘True’, ‘False’]

Linkage (linkage): Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion.‘average’ uses the average of the distances of each observation of the two sets.‘complete’ or ‘maximum’ linkage uses the maximum distances between all observations of the two sets.‘single’ uses the minimum of the distances between all observations of the two sets.

  • Required

  • Type: string

  • Default: average

  • Choices: [‘complete’, ‘average’, ‘single’]