Dataset Similarity and Clustering – Hierarchical¶
Category Paths
Role-based/Medicinal Chemist
Task-based/Data Science/Clustering
Solution-based/Virtual-screening/Analysis/Clustering
Description
This floe clusters datasets based on pre-generated fingerprints using Tanimoto similarity calculation and sklearn agglomerative clustering, also known as hierarchical clustering.
For details, see sklearn docs at https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering
The fingerprint input for this floe can be pre-generated using the following two Floes in this package.
Fingerprint Generation
Fingerprint Generation (User Defined) Floe
This Floe has to cache records and generate an NxN similarity matrix, therefore, it is not suitable for handling a very large dataset (more than 100K).
To run this floe using large datasets (greater than 10K records) please try increasing the “Memory for TSNE and Clustering Cubes” parameter.
If enabled, TSNE, T-distributed Stochastic Neighbor Embedding, will occur after clustering. This is a tool to visualize high-dimensional data. This step in the floe is optional, and the floe may run faster if this step is disabled.
The Floe can generate the following datasets:
Members dataset that will contain each molecule from the input dataset with cluster id, cluster size, and cluster member id data fields.
Cores dataset that contains one representative from each cluster with the cluster id and cluster size data fields.
Singletons dataset that contains clusters with only one member.
When writing the singletons dataset, neither the members nor the cores output datasets will contain the singleton records.
Promoted Parameters
Title in user interface (promoted name)
Inputs
Input Dataset (in): Dataset to cluster
Required
Type: data_source
Outputs
Cluster Members (members): Name of output member dataset, containing all cluster members.
Required
Type: dataset_out
Default: Hierarchical_clustering_members
Cluster Cores (cores): Representative members, one for each cluster
Type: dataset_out
Default: Hierarchical_clustering_cores
Singletons (singletons): Name of output singletons dataset
Type: dataset_out
Default: Hierarchical_clustering_singletons
Floe Report Name (cluster_report_name):
Required
Type: string
Default: Hierarchical_clustering_report
Failed Records (failed): Dataset with failed records.
Required
Type: dataset_out
Default: Hierarchical_clustering_failed
Cluster Settings
Minimum Similarity Score Cutoff (Similarity Minimum Cutoff): This cutoff controls the sparsity of the similarity matrix used for clustering. It is the minimumsimilarity score required to be included in the matrix: scores below this will be set to 0.0 in thematrix.
Type: decimal
Default: 0.05
Fingerprint Field (fingerprint_field): Tag name for the field that stores fingerprints.
Required
Type: field_parameter
Use TSNE (tsne_switch): Turn off TSNE to reduce memory usage, for very large datasets.
Required
Type: boolean
Default: True
Choices: [True, False]
Memory for TSNE and Clustering Cubes (clustering_memory): For large datasets, try increasing the memory limit.
Required
Type: decimal
Default: 2800
Hierarchical Clustering Settings
Number of Clusters (n_clusters): The number of clusters to find.
Required
Type: integer
Default: 10
Hierarchical Clustering Advanced Settings
Compute Full Tree (compute_full_tree): Stop early the construction of the tree at n_clusters. This is useful to decrease computation time if the number of clusters is not small compared to the number of samples. This option is useful only when specifying a connectivity matrix. Note also that when varying the number of clusters and using caching, it may be advantageous to compute the full tree. It must be True if distance_threshold is not None. By default compute_full_tree is “auto”, which is equivalent to True when distance_threshold is not None or that n_clusters is inferior to the maximum between 100 or 0.02 * n_samples. Otherwise, “auto” is equivalent to False.
Required
Type: string
Default: auto
Choices: [‘auto’, ‘True’, ‘False’]
Linkage (linkage): Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion.‘average’ uses the average of the distances of each observation of the two sets.‘complete’ or ‘maximum’ linkage uses the maximum distances between all observations of the two sets.‘single’ uses the minimum of the distances between all observations of the two sets.
Required
Type: string
Default: average
Choices: [‘complete’, ‘average’, ‘single’]