Dataset Similarity and Clustering – Sphere Exclusion

Category Paths

  • Role-based/Medicinal Chemist

  • Task-based/Data Science/Clustering

  • Solution-based/Virtual-screening/Analysis/Clustering

Description

Cluster datasets based on pre-generated fingerprints using Tanimoto similarity calculation and the sphere exclusion clustering method.

The sphere exclusion clustering method is implemented to be deterministic. In each iteration, the molecule with the highest accumulated similarity score is selected to be the head for the next cluster.

The fingerprint input for this floe can be pre-generated using the following two Floes in this package.

  • Fingerprint Generation

  • Fingerprint Generation (User Defined) Floe

This Floe has to cache records and generate an NxN similarity matrix, therefore, it is not suitable for handling a very large dataset (more than 100K).

To run this floe using large datasets (greater than 10K records) please try increasing the “Memory for TSNE and Clustering Cubes” parameter.

If enabled, TSNE, T-distributed Stochastic Neighbor Embedding, will occur after clustering. This is a tool to visualize high-dimensional data. This step in the floe is optional, and the floe may run faster if this step is disabled.

The Floe can generate the following datasets:

  • Members dataset that will contain each molecule from the input dataset with cluster id, cluster size, and cluster member id data fields.

  • Cores dataset that contains one representative from each cluster with the cluster id and cluster size data fields.

  • Singletons dataset that contains clusters with only one member.

When writing the singletons dataset, neither the members nor the cores output datasets will contain the singleton records.

Promoted Parameters

Title in user interface (promoted name)

Inputs

Input Dataset (in): Dataset to cluster

  • Required

  • Type: data_source

Outputs

Cluster Members (members): Name of output member dataset, containing all cluster members.

  • Required

  • Type: dataset_out

  • Default: Sphere_exclusion_clustering_members

Cluster Cores (cores): Representative members, one for each cluster

  • Type: dataset_out

  • Default: Sphere_exclusion_clustering_cores

Singletons (singletons): Name of output singletons dataset

  • Type: dataset_out

  • Default: Sphere_exclusion_clustering_singletons

Floe Report Name (cluster_report_name):

  • Required

  • Type: string

  • Default: Sphere_exclusion_clustering_report

Failed Records (failed): Dataset with failed records.

  • Required

  • Type: dataset_out

  • Default: Sphere_exclusion_clustering_failed

Cluster Settings

Minimum Similarity Score Cutoff (Similarity Minimum Cutoff): This cutoff controls the sparsity of the similarity matrix used for clustering. It is the minimumsimilarity score required to be included in the matrix: scores below this will be set to 0.0 in thematrix.

  • Type: decimal

  • Default: 0.05

Fingerprint Field (fingerprint_field): Tag name for the field that stores fingerprints.

  • Required

  • Type: field_parameter

Use TSNE (tsne_switch): Turn off TSNE to reduce memory usage, for very large datasets.

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Memory for TSNE and Clustering Cubes (clustering_memory): For large datasets, try increasing the memory limit.

  • Required

  • Type: decimal

  • Default: 2800