K-Medoids 2D Similarity Clustering

Category Paths

Follow one of these paths in the Orion user interface, to find the floe.

  • Role-based/Medicinal Chemist

  • Task-based/Data Science/Clustering

  • Solution-based/Virtual-screening/Analysis/Clustering

Description

This floe clusters datasets based on pre-generated fingerprints using OEGraphSim similarity calculations and sklearn K-Medoids clustering.

One benefit of K-Medoids is that it allows the user to select the number of clusters. See the “Number of K-Medoids Clusters” parameter to control this.

This Floe uses fingerprints for 2D similarity calculations with OEGraphSim. If a Fingerprint field is not specified in the input parameters, then the floe will generate fingerprints using the settings specified in the Fingerprint parameters sections.

The fingerprint input for this floe can be pre-generated using the following two Floes in this package:

  • Fingerprint Generation

  • Fingerprint Generation (User Defined) Floe

This Floe has to cache records and generate an NxN similarity matrix, therefore, it is expensive and not recommended for handling an input dataset with more than 20K input records, and the floe will fail automatically for input datasets larger than 30K records.

The Floe can generate the following datasets:

  • A members dataset that will contain each molecule from the input dataset with its corresponding cluster ID.

  • A cores dataset that contains one representative from each cluster.

  • A singletons dataset that contains any clusters with only one member.

Promoted Parameters

Title in user interface (promoted name)

Outputs

Cluster Medoids (medoids): Name of output medoids dataset

  • Type: dataset_out

  • Default: 2D_KMedoids_cluster_medoids

Failed Records (failed): Dataset with failed records.

  • Required

  • Type: dataset_out

  • Default: 2D_KMedoids_cluster_failed

Cluster Members (members): Name of output member dataset, containing all cluster members.

  • Required

  • Type: dataset_out

  • Default: 2D_KMedoids_cluster_members

Floe Report Name (cluster_report_name):

  • Type: string

  • Default: 2D_KMedoids_cluster_report

Singletons (singletons): Name of output singletons dataset, containing clusters with only one member.

  • Type: dataset_out

  • Default: 2D_KMedoids_cluster_singletons

Fingerprint Generation

Use Pregenerated fingerprints (switch): If set to True, the floe will not generate fingerprints, and instead use the fingerprint field specified to provide pregenerated fingerprints for each molecule.

  • Required

  • Type: boolean

  • Default: False

  • Choices: [True, False]

Fingerprint Field (fingerprint_field): If fingerprints are generated within the Floe, this is the name of the fingerprint field that will contain the generated fingerprints. If fingerprints are pregenerated, this should be the field name containing the pregenerated fingerprints.

  • Required

  • Type: field_parameter

  • Default: Fingerprint

Fingerprint Type (fingerprint_type): If Use Pregenerated is set to False, the type of fingerprint to be generated and used in the similarity calculation.

  • Type: string

  • Default: Circular

  • Choices: [‘Circular’, ‘Lingo’, ‘MACCS’, ‘Path’, ‘Tree’]

2D Similarity Calculation

2D Similarity Score Function (sim_type): The similarity measure used to 2D similarity calculation.

  • Type: string

  • Default: OETanimoto

  • Choices: [‘OECosine’, ‘OEDice’, ‘OEEuclid’, ‘OEManhattan’, ‘OETanimoto’]

Similarity Score Cutoff (sim_cutoff): Similarity scores below this value will be calculated as 0

  • Type: decimal

  • Default: 0.05

Inputs

Input Dataset (in): Dataset to cluster

  • Required

  • Type: data_source

Advanced: Sort Input Dataset By Score

Use Score (use_rank): Use rank to sort hits in directed sphere exclusion algorithm and include rank information in clustering report. YOU MUST SELECT A SCORE FIELD IN THE SCORE FIELD PARAMETER, BELOW, IF THIS IS SET TO TRUE.

  • Type: boolean

  • Default: False

  • Choices: [True, False]

Score Field (rank_field): Score field to be used for sorting during sphere exclusion and analyzed in floe report.

  • Type: field_parameter

Score Sort Order (sort_order): Sort order for scores. Descending means higher scores are more desirable. Ascending means lower scores are more desirable.

  • Type: string

  • Default: Descending

  • Choices: [‘Descending’, ‘Ascending’]

Advanced: Memory, Optimization, and Matrix File Output or Input

Output Distance Matrix (output_similarity_matrix): If True, writes distance matrix used in clustering to a file.

  • Type: boolean

  • Default: False

  • Choices: [True, False]

Distance Matrix Filename (similarity_matrix_filename): Filename of distance matrix.

  • Type: string

  • Default: clustering_distance_matrix.txt

Memory for Clustering Cubes (clustering_memory): For large datasets, try increasing the memory limit.

  • Type: decimal

  • Default: 8000

Similarity Tile Size (tile_size): Advanced parameter for tuning parallel speed.

  • Type: integer

  • Default: 100

Use Input Matrix File (input_sim_file): OPTIONAL: Set to True if you wish to input a precalculated distance or similarity matrix instead of having the floe calculate this for you. If set to True, the input dataset will be ignored. You must also input a row label file that has SMILES corresponding to each matrix row.

  • Type: boolean

  • Default: False

  • Choices: [True, False]

Similarity or Distance Matrix Input File (matrix_input_file): OPTIONAL: matrix input file, must be accompanied by row labels file. The file format must be a binary .npy file, containing a valid 2D numpy ndarray. If a similarity matrix is input, it will be converted to a distance matrix for clustering. Matrix must be valid square similarity or distance matrix with values normalized between 0 and 1.

  • Type: file_in

Row Label Input File (row_label_input_file): OPTIONAL: row labels, which must be SMILES in a 1D numpy ndarray file with extension .npy, to go along with matrix input file. The dimension of this array must match the dimensions of the accompanying square matrix.

  • Type: file_in

K-Medoids

Number of K-Medoids Clusters (n_clusters):

  • Required

  • Type: integer

  • Default: 10

Advanced: K-Medoids

Medoid Initialization Method (init_method): From scikit-learn-extra docs: Specify medoid initialization method. ‘random’ selects n_clusters elements from the dataset. ‘heuristic’ picks the n_clusters points with the smallest sum distance to every other point. ‘k-medoids++’ follows an approach based on k-means++_, and in general, gives initial medoids which are more separated than those generated by the other methods. ‘build’ is a greedy initialization of the medoids used in the original PAM algorithm. Often ‘build’ is more efficient but slower than other initializations on big datasets and it is also very non-robust, if there are outliers in the dataset, use another initialization.

  • Required

  • Type: string

  • Default: heuristic

  • Choices: [‘random’, ‘heuristic’, ‘k-medoids++’, ‘build’]

Algorithm (method): Alternate is faster, pam is more accurate

  • Required

  • Type: string

  • Default: pam

  • Choices: [‘alternate’, ‘pam’]

Maximum K-Medoids Iterations (max_iter):

  • Required

  • Type: integer

  • Default: 100000