Dataset Similarity and Clustering – Sphere Exclusion¶
Category Paths
Role-based/Medicinal Chemist
Task-based/Data Science/Clustering
Solution-based/Virtual-screening/Analysis/Clustering
Description
Cluster datasets based on pre-generated fingerprints using Tanimoto similarity calculation and the sphere exclusion clustering method.
The sphere exclusion clustering method is implemented to be deterministic. In each iteration, the molecule with the highest accumulated similarity score is selected to be the head for the next cluster.
The fingerprint input for this floe can be pre-generated using the following two Floes in this package.
Fingerprint Generation
Fingerprint Generation (User Defined) Floe
This Floe has to cache records and generate an NxN similarity matrix, therefore, it is not suitable for handling a very large dataset (more than 100K).
To run this floe using large datasets (greater than 10K records) please try increasing the “Memory for TSNE and Clustering Cubes” parameter.
If enabled, TSNE, T-distributed Stochastic Neighbor Embedding, will occur after clustering. This is a tool to visualize high-dimensional data. This step in the floe is optional, and the floe may run faster if this step is disabled.
The Floe can generate the following datasets:
Members dataset that will contain each molecule from the input dataset with cluster id, cluster size, and cluster member id data fields.
Cores dataset that contains one representative from each cluster with the cluster id and cluster size data fields.
Singletons dataset that contains clusters with only one member.
When writing the singletons dataset, neither the members nor the cores output datasets will contain the singleton records.
Promoted Parameters
Title in user interface (promoted name)
Inputs
Input Dataset (in): Dataset to cluster
Required
Type: data_source
Outputs
Cluster Members (members): Name of output member dataset, containing all cluster members.
Required
Type: dataset_out
Default: Sphere_exclusion_clustering_members
Cluster Cores (cores): Representative members, one for each cluster
Type: dataset_out
Default: Sphere_exclusion_clustering_cores
Singletons (singletons): Name of output singletons dataset
Type: dataset_out
Default: Sphere_exclusion_clustering_singletons
Floe Report Name (cluster_report_name):
Required
Type: string
Default: Sphere_exclusion_clustering_report
Failed Records (failed): Dataset with failed records.
Required
Type: dataset_out
Default: Sphere_exclusion_clustering_failed
Cluster Settings
Minimum Similarity Score Cutoff (Similarity Minimum Cutoff): This cutoff controls the sparsity of the similarity matrix used for clustering. It is the minimumsimilarity score required to be included in the matrix: scores below this will be set to 0.0 in thematrix.
Type: decimal
Default: 0.05
Fingerprint Field (fingerprint_field): Tag name for the field that stores fingerprints.
Required
Type: field_parameter
Use TSNE (tsne_switch): Turn off TSNE to reduce memory usage, for very large datasets.
Required
Type: boolean
Default: True
Choices: [True, False]
Memory for TSNE and Clustering Cubes (clustering_memory): For large datasets, try increasing the memory limit.
Required
Type: decimal
Default: 2800