2D Diverse Subset

Category Paths

Follow one of these paths in the Orion user interface, to find the floe.

  • Role-based/Medicinal Chemist

  • Task-based/Data Science/Clustering

  • Solution-based/Virtual-screening/Analysis/Clustering

Description

This Floe generates a diverse subset of the input dataset based on OEGraphSim 2D similarity calculations.

If the parameter Use Large Scale Version is set to True, the floe will use sphere exclusion clustering to find the diverse subset. Otherwise, the floe will use k-medoids clustering. It is imperative to not attempt to cluster datasets larger than 10,000 records using k-medoids clustering!

Promoted Parameters

Title in user interface (promoted name)

Outputs

Diverse Subset Name (subset): Name of output diverse subset file.

  • Required

  • Type: dataset_out

  • Default: 2D_diverse_subset

Failed Records (failed): Dataset with failed records.

  • Required

  • Type: dataset_out

  • Default: 2D_diverse_subset_failed

Advanced: Memory, Optimization, and Matrix File Output or Input

Output Distance Matrix (output_similarity_matrix): If True, writes distance matrix used in clustering to a file.

  • Type: boolean

  • Default: False

  • Choices: [True, False]

Distance Matrix Filename (similarity_matrix_filename): Filename of distance matrix.

  • Type: string

  • Default: clustering_distance_matrix.txt

Memory for Clustering Cubes (clustering_memory): For large datasets, try increasing the memory limit.

  • Type: decimal

  • Default: 8000

Similarity Tile Size (tile_size): Advanced parameter for tuning parallel speed.

  • Type: integer

  • Default: 100

Advanced: Large Scale Clustering Cubes

Memory (MB) for Cluster Head Generation (gen_heads_memory_mb): Memory (in MB) allocated to serial cube that generates cluster heads.

  • Type: decimal

  • Default: 8000

Memory (MB) for Parallel Sphere Exclusion (dse_memory_mb): Memory (in MB) allocated to parallel sphere exclusion cube that assigns members to clusters.

  • Type: decimal

  • Default: 8000

Sphere Exclusion Item Count (dse_item_count): Number of records processed at a time, by parallel sphere exclusion cube. Each record creates a unit of work, which is comparing similarity of all cluster heads found so far to a single member to be assigned.

  • Type: integer

  • Default: 1000

Cluster Head Percentage (head_percentage): Ratio of cluster heads per cycle, to batch size

  • Type: decimal

  • Default: 0.03

Minimum Batch Size (batch_size_floor): Minimum batch size for generate heads cube.

  • Type: integer

  • Default: 200

Minimum Cluster Heads Per Cycle (num_clusters_per_cycle_floor): Minimum cluster heads found per cycle, for generate heads cube.

  • Type: integer

  • Default: 20

Batch Size Percentage (starting_batch_percentage): Starting ratio of batch size to total input size

  • Type: decimal

  • Default: 0.1

Fingerprint Generation

Use Pregenerated fingerprints (switch): If set to True, the floe will not generate fingerprints, and instead use the fingerprint field specified to provide pregenerated fingerprints for each molecule.

  • Required

  • Type: boolean

  • Default: False

  • Choices: [True, False]

Fingerprint Field (fingerprint_field): If fingerprints are generated within the Floe, this is the name of the fingerprint field that will contain the generated fingerprints. If fingerprints are pregenerated, this should be the field name containing the pregenerated fingerprints.

  • Required

  • Type: field_parameter

  • Default: Fingerprint

Fingerprint Type (fingerprint_type): If Use Pregenerated is set to False, the type of fingerprint to be generated and used in the similarity calculation.

  • Type: string

  • Default: Circular

  • Choices: [‘Circular’, ‘Lingo’, ‘MACCS’, ‘Path’, ‘Tree’]

2D Similarity Calculation

2D Similarity Score Function (sim_type): The similarity measure used to 2D similarity calculation.

  • Type: string

  • Default: OETanimoto

  • Choices: [‘OECosine’, ‘OEDice’, ‘OEEuclid’, ‘OEManhattan’, ‘OETanimoto’]

Similarity Score Cutoff (sim_cutoff): Similarity scores below this value will be calculated as 0

  • Type: decimal

  • Default: 0.05

Sphere Exclusion Radius (sphere_exclusion_radius): Radius from cluster head, used to determine that head’s cluster members. This corresponds to a distance metric, or 1.0 - similarity_score. Scores are normalized from 0 to 1. For example, a TanimotoCombo similarity score of 1.5 is normalized to a score of 0.75 and a distance of 0.25. A larger radius will generally result in fewer clusters with more members, and a smaller radius will result in more clusters with fewer members.

  • Required

  • Type: decimal

  • Default: 0.8

Advanced: Sort Input Dataset By Score

Use Score (use_rank): Use rank to sort hits in directed sphere exclusion algorithm and include rank information in clustering report. YOU MUST SELECT A SCORE FIELD IN THE SCORE FIELD PARAMETER, BELOW, IF THIS IS SET TO TRUE.

  • Type: boolean

  • Default: False

  • Choices: [True, False]

Score Field (rank_field): Score field to be used for sorting during sphere exclusion and analyzed in floe report.

  • Type: field_parameter

Score Sort Order (sort_order): Sort order for scores. Descending means higher scores are more desirable. Ascending means lower scores are more desirable.

  • Type: string

  • Default: Descending

  • Choices: [‘Descending’, ‘Ascending’]

Inputs

Input Dataset (data_in): The dataset(s) to read records from

  • Required

  • Type: data_source

  • Default: data_in

Cluster Batch Size (batch_size): Batch size for clustering. Set to 10% of the number of input records for optimal results.

  • Required

  • Type: integer

Diverse Subset

Number in Subset (n_subset):

The number of records in the resulting subset. If Use Large Scale is set to True, the number of records output may be below the number requested. In this case, try adjusting the sphere exclusion radius. If Large Scale is set to False, the number of records output should match this parameter exactly.

  • Required

  • Type: integer

Use Large Scale Version (use_large): Set to True, if running clustering on a dataset with more than 10,000 records. Large scale clustering uses directed sphere exclusion, which is much more suitable for large datasets.

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Advanced: K-Medoids

Medoid Initialization Method (init_method): From scikit-learn-extra docs: Specify medoid initialization method. ‘random’ selects n_clusters elements from the dataset. ‘heuristic’ picks the n_clusters points with the smallest sum distance to every other point. ‘k-medoids++’ follows an approach based on k-means++_, and in general, gives initial medoids which are more separated than those generated by the other methods. ‘build’ is a greedy initialization of the medoids used in the original PAM algorithm. Often ‘build’ is more efficient but slower than other initializations on big datasets and it is also very non-robust, if there are outliers in the dataset, use another initialization.

  • Required

  • Type: string

  • Default: heuristic

  • Choices: [‘random’, ‘heuristic’, ‘k-medoids++’, ‘build’]

Algorithm (method): Alternate is faster, pam is more accurate

  • Required

  • Type: string

  • Default: pam

  • Choices: [‘alternate’, ‘pam’]

Maximum K-Medoids Iterations (max_iter):

  • Required

  • Type: integer

  • Default: 100000