Large Scale 2D Similarity Clustering

Category Paths

Follow one of these paths in the Orion user interface, to find the floe.

Description

This Floe clusters datasets using 2D similarity calculations and directed sphere exclusion.

Promoted Parameters

Title in user interface (promoted name)

Advanced: Cube Memory Allocation

Memory for Report Cube (report_memory): For large datasets, try increasing the memory limit.

Type: decimal

Default: 8000

Outputs

Cluster Cores (cores): Representative members, one for each cluster

Type: dataset_out

Default: 2D_large_scale_cluster_cores

Failed Records (failed): Dataset with failed records.

Required

Type: dataset_out

Default: 2D_large_scale_cluster_failed

Cluster Members (members): Name of output member dataset, containing all cluster members.

Required

Type: dataset_out

Default: 2D_large_scale_cluster_members

Floe Report Name (cluster_report_name):

Type: string

Default: 2D_large_scale_cluster_report

Singletons (singletons): Name of output singletons dataset, containing clusters with only one member.

Type: dataset_out

Default: 2D_large_scale_cluster_singletons

Advanced: Large Scale Clustering Cubes

Memory (MB) for Cluster Head Generation (gen_heads_memory_mb): Memory (in MB) allocated to serial cube that generates cluster heads.

Type: decimal

Default: 8000

Memory (MB) for Parallel Sphere Exclusion (dse_memory_mb): Memory (in MB) allocated to parallel sphere exclusion cube that assigns members to clusters.

Type: decimal

Default: 8000

Sphere Exclusion Item Count (dse_item_count): Number of records processed at a time, by parallel sphere exclusion cube. Each record creates a unit of work, which is comparing similarity of all cluster heads found so far to a single member to be assigned.

Type: integer

Default: 1000

Cluster Head Percentage (head_percentage): Ratio of cluster heads per cycle, to batch size

Type: decimal

Default: 0.03

Minimum Batch Size (batch_size_floor): Minimum batch size for generate heads cube.

Type: integer

Default: 200

Minimum Cluster Heads Per Cycle (num_clusters_per_cycle_floor): Minimum cluster heads found per cycle, for generate heads cube.

Type: integer

Default: 20

Batch Size Percentage (starting_batch_percentage): Starting ratio of batch size to total input size

Type: decimal

Default: 0.1

Fingerprint Generation

Use Pregenerated fingerprints (switch): If set to True, the floe will not generate fingerprints, and instead use the fingerprint field specified to provide pregenerated fingerprints for each molecule.

Required

Type: boolean

Default: False

Choices: [True, False]

Fingerprint Field (fingerprint_field): If fingerprints are generated within the Floe, this is the name of the fingerprint field that will contain the generated fingerprints. If fingerprints are pregenerated, this should be the field name containing the pregenerated fingerprints.

Required

Type: field_parameter

Default: Fingerprint

Fingerprint Type (fingerprint_type): If Use Pregenerated is set to False, the type of fingerprint to be generated and used in the similarity calculation.

Type: string

Default: Circular

Choices: [‘Circular’, ‘Lingo’, ‘MACCS’, ‘Path’, ‘Tree’]

2D Similarity Calculation

2D Similarity Score Function (sim_type): The similarity measure used to 2D similarity calculation.

Type: string

Default: OETanimoto

Choices: [‘OECosine’, ‘OEDice’, ‘OEEuclid’, ‘OEManhattan’, ‘OETanimoto’]

Similarity Score Cutoff (sim_cutoff): Similarity scores below this value will be calculated as 0

Type: decimal

Default: 0.05

Sphere Exclusion Radius (sphere_exclusion_radius): Radius from cluster head, used to determine that head’s cluster members. This corresponds to a distance metric, or 1.0 - similarity_score. Scores are normalized from 0 to 1. For example, a TanimotoCombo similarity score of 1.5 is normalized to a score of 0.75 and a distance of 0.25. A larger radius will generally result in fewer clusters with more members, and a smaller radius will result in more clusters with fewer members.

Required

Type: decimal

Default: 0.8

Advanced: Sort Input Dataset By Score

Use Score (use_rank): Use rank to sort hits in directed sphere exclusion algorithm and include rank information in clustering report. YOU MUST SELECT A SCORE FIELD IN THE SCORE FIELD PARAMETER, BELOW, IF THIS IS SET TO TRUE.

Type: boolean

Default: False

Choices: [True, False]

Score Field (rank_field): Score field to be used for sorting during sphere exclusion and analyzed in floe report.

Type: field_parameter

Score Sort Order (sort_order): Sort order for scores. Descending means higher scores are more desirable. Ascending means lower scores are more desirable.

Type: string

Default: Descending

Choices: [‘Descending’, ‘Ascending’]

Inputs

Input Dataset (data_in): The dataset(s) to read records from

Required

Type: data_source

Default: data_in

Cluster Batch Size (batch_size): Batch size for clustering. Set to 10% of the number of input records for optimal results.

Required

Type: integer