Dataset Similarity and Clustering – DBSCAN

Category Paths

  • Role-based/Medicinal Chemist

  • Task-based/Data Science/Clustering

  • Solution-based/Virtual-screening/Analysis/Clustering

Description

This Floe clusters datasets based on pre-generated fingerprints using Tanimoto similarity calculation and sklearn DBSCAN clustering.

The fingerprint input for this floe can be pre-generated using the following two Floes in this package.

  • Fingerprint Generation

  • Fingerprint Generation (User Defined) Floe

This Floe has to cache records and generate an NxN similarity matrix, therefore, it is not suitable for handling a very large dataset (more than 100K).

To run this floe using large datasets (greater than 10K records) please try increasing the “Memory for TSNE and Clustering Cubes” parameter.

If enabled, TSNE, T-distributed Stochastic Neighbor Embedding, will occur after clustering. This is a tool to visualize high-dimensional data. This step in the floe is optional, and the floe may run faster if this step is disabled.

The Floe can generate the following datasets:

  • Members dataset that will contain each molecule from the input dataset with cluster id, cluster size, and cluster member id data fields.

  • Cores dataset that contains one representative from each cluster with the cluster id and cluster size data fields.

  • Singletons dataset that contains clusters with only one member.

When writing the singletons dataset, neither the members nor the cores output datasets will contain the singleton records.

Promoted Parameters

Title in user interface (promoted name)

Inputs

Input Dataset (in): Dataset to cluster

  • Required

  • Type: data_source

Outputs

Cluster Members (members): Name of output member dataset, containing all cluster members.

  • Required

  • Type: dataset_out

  • Default: DBScan_clustering_members

Cluster Cores (cores): Representative members, one for each cluster

  • Type: dataset_out

  • Default: DBScan_clustering_cores

Singletons (singletons): Name of output singletons dataset

  • Type: dataset_out

  • Default: DBScan_clustering_singletons

Floe Report Name (cluster_report_name):

  • Required

  • Type: string

  • Default: DBScan_clustering_report

Failed Records (failed): Dataset with failed records.

  • Required

  • Type: dataset_out

  • Default: DBScan_clustering_failed

Cluster Settings

Minimum Similarity Score Cutoff (Similarity Minimum Cutoff): This cutoff controls the sparsity of the similarity matrix used for clustering. It is the minimumsimilarity score required to be included in the matrix: scores below this will be set to 0.0 in thematrix.

  • Type: decimal

  • Default: 0.05

Fingerprint Field (fingerprint_field): Tag name for the field that stores fingerprints.

  • Required

  • Type: field_parameter

Use TSNE (tsne_switch): Turn off TSNE to reduce memory usage, for very large datasets.

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Memory for TSNE and Clustering Cubes (clustering_memory): For large datasets, try increasing the memory limit.

  • Required

  • Type: decimal

  • Default: 2800

DBSCAN Cluster Settings

Epsilon (eps) [OPTIONAL] (eps): If not provided, an eps value will be estimated based on the input data. The

epsilon value controls DBSCAN clustering. This is the maximum DISTANCE between core cluster members and the maximum single-linkage distance for non-core cluster members. Increase eps to cluster more molecules together and decrease eps to cluster fewer molecules together.

  • Type: decimal

Minimum Largest Cluster Percentage [OPTIONAL] (min_largest_cluster_percentage): If an eps value is not provided, this will controlDBSCAN clustering by setting the minimum percentage of molecules in the largest allowed cluster.

  • Type: decimal

  • Default: 10

Maximum Largest Cluster Percentage [OPTIONAL] (max_largest_cluster_percentage): If an eps value is not provided, this will controlDBSCAN clustering by setting the maximum percentage of molecules in the largest allowed cluster.

  • Type: decimal

  • Default: 80