DBSCAN 3D Similarity Clustering

Category Paths

Follow one of these paths in the Orion user interface, to find the floe.

Role-based/Medicinal Chemist

Task-based/Data Science/Clustering

Solution-based/Virtual-screening/Analysis/Clustering

Description

This floe clusters datasets using OEShape 3D similarity calculations and sklearn DBSCAN clustering.

Please note that if advanced parameter min_samples is not set to 1, some records given to clustering may be ignored as noise, and so the total number of records output in the members dataset will not equal the number of records input.

This Floe uses OEShape for 3D similarity calculation. If (and only if) the primary molecule field on the input dataset does not have 3D coordinates, the floe will use OMEGA to generate a single conformer for each record in the primary molecule field and do similarity calculations with this molecule.

This Floe has to cache records and generate an NxN similarity matrix, therefore, it is expensive and not recommended for handling an input dataset with more than 20K input records, and the floe will fail automatically for input datasets larger than 30K records.

The Floe can generate the following datasets:

A members dataset that will contain each molecule from the input dataset with its corresponding cluster ID.
A cores dataset that contains one representative from each cluster.
A singletons dataset that contains any clusters with only one member.

Promoted Parameters

Title in user interface (promoted name)

3D Similarity Calculation

3D Similarity Score Function (score_type):

Type: string

Default: Tanimoto Combo

Choices: [‘Shape Tanimoto’, ‘Color Tanimoto’, ‘Tanimoto Combo’]

Align Molecules (use_align): If set to True, molecules will be aligned before similarity calculation; otherwise, they will retain input coordinates.

Type: boolean

Default: True

Choices: [True, False]

Similarity Score Cutoff (sim_cutoff): Similarity scores below this value will be calculated as 0

Type: decimal

Default: 0.05

Outputs

Cluster Cores (cores): Representative members, one for each cluster

Type: dataset_out

Default: 3D_DBSCAN_cluster_cores

Failed Records (failed): Dataset with failed records.

Required

Type: dataset_out

Default: 3D_DBSCAN_cluster_failed

Cluster Members (members): Name of output member dataset, containing all cluster members.

Required

Type: dataset_out

Default: 3D_DBSCAN_cluster_members

Floe Report Name (cluster_report_name):

Type: string

Default: 3D_DBSCAN_cluster_report

Singletons (singletons): Name of output singletons dataset, containing clusters with only one member.

Type: dataset_out

Default: 3D_DBSCAN_cluster_singletons

Inputs

Input Dataset (in): Dataset to cluster

Required

Type: data_source

Advanced: Sort Input Dataset By Score

Use Score (use_rank): Use rank to sort hits in directed sphere exclusion algorithm and include rank information in clustering report. YOU MUST SELECT A SCORE FIELD IN THE SCORE FIELD PARAMETER, BELOW, IF THIS IS SET TO TRUE.

Type: boolean

Default: False

Choices: [True, False]

Score Field (rank_field): Score field to be used for sorting during sphere exclusion and analyzed in floe report.

Type: field_parameter

Score Sort Order (sort_order): Sort order for scores. Descending means higher scores are more desirable. Ascending means lower scores are more desirable.

Type: string

Default: Descending

Choices: [‘Descending’, ‘Ascending’]

Advanced: Memory, Optimization, and Matrix File Output or Input

Output Distance Matrix (output_similarity_matrix): If True, writes distance matrix used in clustering to a file.

Type: boolean

Default: False

Choices: [True, False]

Distance Matrix Filename (similarity_matrix_filename): Filename of distance matrix.

Type: string

Default: clustering_distance_matrix.txt

Memory for Clustering Cubes (clustering_memory): For large datasets, try increasing the memory limit.

Type: decimal

Default: 8000

Similarity Tile Size (tile_size): Advanced parameter for tuning parallel speed.

Type: integer

Default: 5

DBSCAN

Epsilon (eps) [OPTIONAL] (eps):

If not provided, an eps value will be estimated based on the input data. The epsilon value controls DBSCAN clustering. This is the maximum DISTANCE between core cluster members and the maximum single-linkage distance for non-core cluster members. Increase eps to cluster more molecules together in fewer clusters and decrease eps to cluster fewer molecules together in more clusters. Scores are normalized from 0 to 1; for example, a TanimotoCombo similarity score of 1.5 is normalized to a score of 0.75 and an eps/distance of 0.25.

Type: decimal

Minimum Largest Cluster Percentage [OPTIONAL] (min_largest_cluster_percentage): If an eps value is not provided, this will control DBSCAN clustering by setting the minimum percentage of molecules in the largest allowed cluster.

Type: decimal

Default: 1.0

Maximum Largest Cluster Percentage [OPTIONAL] (max_largest_cluster_percentage): If an eps value is not provided, this will control DBSCAN clustering by setting the maximum percentage of molecules in the largest allowed cluster.

Type: decimal

Default: 90.0