3D Hitlist Clustering¶
Category Paths
Follow one of these paths in the Orion user interface, to find the floe.
Description
This Floe clusters datasets using 3D OEShape similarity calculations and modified directed sphere exclusion clustering that uses the user-provided score field for sorting and ranking output clusters.
- You must provide 3D molecules on the input records, or the floe will fail. If multiple conformers are provided, only the
active conformer will be used in the floe.
Promoted Parameters
Title in user interface (promoted name)
Advanced: Cube Memory Allocation
Memory for Report Cube (report_memory): For large datasets, try increasing the memory limit.
Type: decimal
Default: 8000
Outputs
Cluster Cores (cores): Representative members, one for each cluster
Type: dataset_out
Default: 3D_hitlist_cluster_cores
Failed Records (failed): Dataset with failed records.
Required
Type: dataset_out
Default: 3D_hitlist_cluster_failed
Cluster Members (members): Name of output member dataset, containing all cluster members.
Required
Type: dataset_out
Default: 3D_hitlist_cluster_members
Floe Report Name (cluster_report_name):
Type: string
Default: 3D_hitlist_cluster_report
Singletons (singletons): Name of output singletons dataset, containing clusters with only one member.
Type: dataset_out
Default: 3D_hitlist_cluster_singletons
Advanced: Large Scale Clustering Cubes
Memory (MB) for Cluster Head Generation (gen_heads_memory_mb): Memory (in MB) allocated to serial cube that generates cluster heads.
Type: decimal
Default: 8000
Memory (MB) for Parallel Sphere Exclusion (dse_memory_mb): Memory (in MB) allocated to parallel sphere exclusion cube that assigns members to clusters.
Type: decimal
Default: 8000
Sphere Exclusion Item Count (dse_item_count): Number of records processed at a time, by parallel sphere exclusion cube. Each record creates a unit of work, which is comparing similarity of all cluster heads found so far to a single member to be assigned.
Type: integer
Default: 10
Cluster Head Percentage (head_percentage): Ratio of cluster heads per cycle, to batch size
Type: decimal
Default: 0.01
Minimum Batch Size (batch_size_floor): Minimum batch size for generate heads cube.
Type: integer
Default: 20
Minimum Cluster Heads Per Cycle (num_clusters_per_cycle_floor): Minimum cluster heads found per cycle, for generate heads cube.
Type: integer
Default: 1
Batch Size Percentage (starting_batch_percentage): Starting ratio of batch size to total input size
Type: decimal
Default: 0.01
3D Similarity Calculation
3D Similarity Score Function (score_type):
Type: string
Default: Shape Tanimoto
Choices: [‘Shape Tanimoto’, ‘Color Tanimoto’, ‘Tanimoto Combo’]
Align Molecules (use_align): If set to true, molecules will be aligned before similarity calculation; otherwise they will retain input coordinates.
Type: boolean
Default: True
Choices: [True, False]
Similarity Score Cutoff (sim_cutoff): Similarity scores below this value will be calculated as 0
Type: decimal
Default: 0.05
Sphere Exclusion Radius (sphere_exclusion_radius): Radius from cluster head, used to determine that head’s cluster members. This corresponds to a distance metric, or 1.0 - similarity_score. Scores are normalized from 0 to 1. For example, a TanimotoCombo similarity score of 1.5 is normalized to a score of 0.75 and a distance of 0.25. A larger radius will generally result in fewer clusters with more members, and a smaller radius will result in more clusters with fewer members.
Required
Type: decimal
Default: 0.5
Required: Sort Input Using Score Field
Score Field (rank_field): Score field to be used for sorting during sphere exclusion, and analyzed in floe report.
Required
Type: field_parameter
Score Sort Order (sort_order): Sort order for scores. Descending means higher scores are more desirable.Ascending means lower scores are more desirable.
Type: string
Default: Descending
Choices: [‘Descending’, ‘Ascending’]
Inputs
Input Dataset (data_in): The dataset(s) to read records from
Required
Type: data_source
Default: data_in
Cluster Batch Size (batch_size): Batch size for clustering. Set to 1% of the number of input records for optimal results.
Required
Type: integer