K-Medoids 3D Similarity Clustering¶
Category Paths
Follow one of these paths in the Orion user interface, to find the floe.
Role-based/Medicinal Chemist
Task-based/Data Science/Clustering
Solution-based/Virtual-screening/Analysis/Clustering
Description
This Floe clusters datasets using OEShape 3D similarity calculation and sklearn K-Medoids clustering.
One benefit of K-Medoids is that it allows the user to select the number of clusters. See the “Number of K-Medoids Clusters” parameter to control this.
This Floe uses OEShape for 3D similarity calculation. If (and only if) the primary molecule field on the input dataset does not have 3D coordinates, the Floe will use OMEGA to generate a single conformer for each record in the primary molecule field, and do similarity calculations with this molecule.
This Floe has to cache records and generate an NxN similarity matrix, therefore, it is expensive and not recommended for handling an input dataset with more than 20K input records, and the floe will fail automatically for input datasets larger than 30K records.
The Floe can generate the following datasets:
A members dataset that will contain each molecule from the input dataset with its corresponding cluster ID.
A cores dataset that contains one representative from each cluster.
A singletons dataset that contains any clusters with only one member.
Promoted Parameters
Title in user interface (promoted name)
3D Similarity Calculation
3D Similarity Score Function (score_type):
Type: string
Default: Shape Tanimoto
Choices: [‘Shape Tanimoto’, ‘Color Tanimoto’, ‘Tanimoto Combo’]
Align Molecules (use_align): If set to true, molecules will be aligned before similarity calculation; otherwise they will retain input coordinates.
Type: boolean
Default: True
Choices: [True, False]
Similarity Score Cutoff (sim_cutoff): Similarity scores below this value will be calculated as 0
Type: decimal
Default: 0.05
Outputs
Cluster Medoids (medoids): Name of output medoids dataset
Type: dataset_out
Default: 3D_KMedoids_clustering_medoids
Failed Records (failed): Dataset with failed records.
Required
Type: dataset_out
Default: 3D_KMedoids_clustering_failed
Cluster Members (members): Name of output member dataset, containing all cluster members.
Required
Type: dataset_out
Default: 3D_KMedoids_clustering_members
Floe Report Name (cluster_report_name):
Type: string
Default: 3D_KMedoids_clustering_report
Singletons (singletons): Name of output singletons dataset, containing clusters with only one member.
Type: dataset_out
Default: 3D_KMedoids_clustering_singletons
Inputs
Input Dataset (in): Dataset to cluster
Required
Type: data_source
Advanced: Sort Input Dataset By Score
Use Score (use_rank): Use rank to sort hits in directed sphere exclusion algorithm, and include rank information in clustering report. YOU MUST SELECT A SCORE FIELD IN THE SCORE FIELD PARAMETER, BELOW, IF THIS IS SET TO TRUE.
Type: boolean
Default: False
Choices: [True, False]
Score Field (rank_field): Score field to be used for sorting during sphere exclusion, and analyzed in floe report.
Type: field_parameter
Score Sort Order (sort_order): Sort order for scores. Descending means higher scores are more desirable.Ascending means lower scores are more desirable.
Type: string
Default: Descending
Choices: [‘Descending’, ‘Ascending’]
Advanced: Memory, Optimization, and Matrix File Output or Input
Output Distance Matrix (output_similarity_matrix): If True, writes distance matrix used in clustering to a file.
Type: boolean
Default: False
Choices: [True, False]
Distance Matrix Filename (similarity_matrix_filename): Filename of distance matrix.
Type: string
Default: clustering_distance_matrix.txt
Memory for Clustering Cubes (clustering_memory): For large datasets, try increasing the memory limit.
Type: decimal
Default: 8000
Similarity Tile Size (tile_size): Advanced parameter for tuning parallel speed.
Type: integer
Default: 5
K-Medoids
Number of K-Medoids Clusters (n_clusters):
Required
Type: integer
Default: 10
Advanced: K-Medoids
Medoid Initialization Method (init_method): From scikit-learn-extra docs: Specify medoid initialization method. ‘random’ selects n_clusters elements from the dataset. ‘heuristic’ picks the n_clusters points with the smallest sum distance to every other point. ‘k-medoids++’ follows an approach based on k-means++_, and in general, gives initial medoids which are more separated than those generated by the other methods. ‘build’ is a greedy initialization of the medoids used in the original PAM algorithm. Often ‘build’ is more efficient but slower than other initializations on big datasets and it is also very non-robust, if there are outliers in the dataset, use another initialization.
Required
Type: string
Default: heuristic
Choices: [‘random’, ‘heuristic’, ‘k-medoids++’, ‘build’]
Algorithm (method): Alternate is faster, pam is more accurate
Required
Type: string
Default: pam
Choices: [‘alternate’, ‘pam’]
Maximum K-Medoids Iterations (max_iter):
Required
Type: integer
Default: 100000