Hierarchical 2D Similarity Clustering¶
Category Paths
Follow one of these paths in the Orion user interface, to find the floe.
Role-based/Medicinal Chemist
Task-based/Data Science/Clustering
Solution-based/Virtual-screening/Analysis/Clustering
Description
This floe clusters datasets based on pre-generated fingerprints using OEGraphSim similarity calculation and sklearn agglomerative clustering, also known as hierarchical clustering.
For details, see sklearn docs at https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering
This Floe uses fingerprints for 2D similarity calculations with OEGraphSim. If a Fingerprint field is not specified in the input parameters, then the floe will generate fingerprints using the settings specified in the Fingerprint parameters sections.
The fingerprint input for this floe can be pre-generated using the following two Floes in this package:
Fingerprint Generation
Fingerprint Generation (User Defined) Floe
This Floe has to cache records and generate an NxN similarity matrix, therefore, it is expensive and not recommended for handling an input dataset with more than 20K input records, and the floe will fail automatically for input datasets larger than 30K records.
The Floe can generate the following datasets:
A members dataset that will contain each molecule from the input dataset with its corresponding cluster ID.
A cores dataset that contains one representative from each cluster.
A singletons dataset that contains any clusters with only one member.
Promoted Parameters
Title in user interface (promoted name)
Outputs
Cluster Cores (cores): Representative members, one for each cluster
Type: dataset_out
Default: 2D_hierarchical_cluster_cores
Failed Records (failed): Dataset with failed records.
Required
Type: dataset_out
Default: 2D_hierarchical_cluster_failed
Cluster Members (members): Name of output member dataset, containing all cluster members.
Required
Type: dataset_out
Default: 2D_hierarchical_cluster_members
Floe Report Name (cluster_report_name):
Type: string
Default: 2D_hierarchical_cluster_report
Singletons (singletons): Name of output singletons dataset, containing clusters with only one member.
Type: dataset_out
Default: 2D_hierarchical_cluster_singletons
Fingerprint Generation
Use Pregenerated fingerprints (switch): If set to True, the floe will not generate fingerprints, and instead use the fingerprint field specified to provide pregenerated fingerprints for each molecule.
Required
Type: boolean
Default: False
Choices: [True, False]
Fingerprint Field (fingerprint_field): If fingerprints are generated within the Floe, this is the name of the fingerprint field that will contain the generated fingerprints. If fingerprints are pregenerated, this should be the field name containing the pregenerated fingerprints.
Required
Type: field_parameter
Default: Fingerprint
Fingerprint Type (fingerprint_type): If Use Pregenerated is set to False, The type of fingerprint to be generated and used in the similarity calculation.
Type: string
Default: Circular
Choices: [‘Circular’, ‘Lingo’, ‘MACCS’, ‘Path’, ‘Tree’]
2D Similarity Calculation
2D Similarity Score Function (sim_type): The similarity measure used to 2D similarity calculation.
Type: string
Default: OETanimoto
Choices: [‘OECosine’, ‘OEDice’, ‘OEEuclid’, ‘OEManhattan’, ‘OETanimoto’]
Similarity Score Cutoff (sim_cutoff): Similarity scores below this value will be calculated as 0
Type: decimal
Default: 0.05
Inputs
Input Dataset (in): Dataset to cluster
Required
Type: data_source
Advanced: Sort Input Dataset By Score
Use Score (use_rank): Use rank to sort hits in directed sphere exclusion algorithm, and include rank information in clustering report. YOU MUST SELECT A SCORE FIELD IN THE SCORE FIELD PARAMETER, BELOW, IF THIS IS SET TO TRUE.
Type: boolean
Default: False
Choices: [True, False]
Score Field (rank_field): Score field to be used for sorting during sphere exclusion, and analyzed in floe report.
Type: field_parameter
Score Sort Order (sort_order): Sort order for scores. Descending means higher scores are more desirable.Ascending means lower scores are more desirable.
Type: string
Default: Descending
Choices: [‘Descending’, ‘Ascending’]
Advanced: Memory, Optimization, and Matrix File Output or Input
Output Distance Matrix (output_similarity_matrix): If True, writes distance matrix used in clustering to a file.
Type: boolean
Default: False
Choices: [True, False]
Distance Matrix Filename (similarity_matrix_filename): Filename of distance matrix.
Type: string
Default: clustering_distance_matrix.txt
Memory for Clustering Cubes (clustering_memory): For large datasets, try increasing the memory limit.
Type: decimal
Default: 8000
Similarity Tile Size (tile_size): Advanced parameter for tuning parallel speed.
Type: integer
Default: 100
Use Input Matrix File (input_sim_file): OPTIONAL: Set to True if you wish to input a precalculated distance or similarity matrix instead of having the floe calculate this for you. If set to True, the input dataset will be ignored. You must also input a row label file that has SMILES corresponding to each matrix row.
Type: boolean
Default: False
Choices: [True, False]
Similarity or Distance Matrix Input File (matrix_input_file): OPTIONAL: matrix input file, must be accompanied by row labels file. The file format must be a binary .npy file, containing a valid 2D numpy ndarray. If a similarity matrix is input, it will be converted to a distance matrix for clustering. Matrix must be valid square similarity or distance matrix with values normalized between 0 and 1.
Type: file_in
Row Label Input File (row_label_input_file): OPTIONAL: row labels, which must be SMILES in a 1D numpy ndarray file with extension .npy, to go along with matrix input file. The dimension of this array must match the dimensions of the accompanying square matrix.
Type: file_in
Hierarchical Clustering
Number of Clusters (n_clusters): The number of clusters to find.
Required
Type: integer
Advanced: Hierarchical Clustering
Compute Full Tree (compute_full_tree): Stop early the construction of the tree at n_clusters. This is useful to decrease computation time if the number of clusters is not small compared to the number of samples. This option is useful only when specifying a connectivity matrix. Note also that when varying the number of clusters and using caching, it may be advantageous to compute the full tree. It must be True if distance_threshold is not None. By default compute_full_tree is “auto”, which is equivalent to True when distance_threshold is not None or that n_clusters is inferior to the maximum between 100 or 0.02 * n_samples. Otherwise, “auto” is equivalent to False.
Required
Type: string
Default: auto
Choices: [‘auto’, ‘True’, ‘False’]
Linkage (linkage): Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion.‘average’ uses the average of the distances of each observation of the two sets.‘complete’ or ‘maximum’ linkage uses the maximum distances between all observations of the two sets.‘single’ uses the minimum of the distances between all observations of the two sets.
Required
Type: string
Default: average
Choices: [‘complete’, ‘average’, ‘single’]