Deciding on Clustering Method

The clustering floes in the cheminfo-floes package provide many options for clustering molecules based on their similarity score. This section provides basic guidance on which clustering floe to use based on your application and input data. For brief overviews of each clustering method, and links to specific tutorials, see the methods section.

Decision Tree

Decision Tree

Size Constraints

Is your dataset larger than 10,000 molecules? If so, you are likely to get results much faster using the large scale clustering floes. If the dataset is larger than 30,000 molecules, it will fail the small scale clustering floes.

For datasets smaller than 10,000 molecules, refer to the applications section to decide clustering based on a specific application, or the small scale clustering section for other selection criteria. Please keep in mind that these are only general guidelines for choosing clustering method. It is recommended to consult academic literature to determine the best method for your use case.

Application-Based Clustering

Do you have a specific application in mind for clustering?

Small Scale Clustering

Do you want a specific number of clusters?

Does your dataset have a large number of outliers or lots of noise?

Do you want to have better control over the size of the largest cluster?

  • If yes, DBSCAN clustering allows you to choose a minimum or maximum size of the largest cluster.

Clustering Methods Overview

DBSCAN Clustering

The 2D and 3D DBSCAN clustering floes use scikit learn’s DBSCAN method to do clustering. DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. Unlike K-Medoids or hierarchical clustering, DBSCAN views clusters as areas of high density separated by areas of low density. For more details, refer to the sklearn documentation.

K-Medoids Clustering

The 2D and 3D K-Medoids clustering floes use scikit learn extra’s K-Medoids method. K-Medoids tries to minimize the sum of distances between each point and the medoid of its cluster. Its principal advantages are that it can produce the exact number of requested clusters, and that it is less sensitive to outliers than other methods of clustering including k-means and hierarchical clustering. For more details refer to the sklearn documentation on k-medoids.

Hierarchical Clustering

The 2D and 3D hierarchical clustering floes use scikit learn’s AgglomerativeClustering method to do clustering using a bottom up approach. Each member starts in its own cluster, and clusters are merged together in a series of merge steps that occur according to the linkage criteria that is selected. More details can be found in the scikit learn documentation .

Large Scale Clustering

With a large dataset, you have effectively three options for 2D or 3D clustering:

  1. Do you only need a diverse subset? Use the 2D or 3D Diverse Subset floe.

  2. Do you have a hitlist, or another type of dataset that has a score for each molecule? Use the 2D or 3D Hitlist Clustering floe.

  3. If the answer to 1 and 2 is no, use the 2D or 3D Large Scale Clustering floe.

Hitlist Clustering

The 2D or 3D hitlist clustering floes take a large scale input dataset and require a float score field on each record. That score field will be used to direct clustering, and also be used to rank clusters in the clustering report, and provide information about average or best rank in each cluster. For more details, see the hitlist clustering tutorial.

Diverse Subset Floes

The 2D or 3D diverse subset floes take an input dataset, and select N members from that dataset with clustering, using sphere exclusion for large datasets, and k-medoids for small-scale clustering. For large datasets, large scale diverse subset may be significantly faster than clustering itself. For more details, see the diverse subset tutorial.

Large Scale Clustering

The large scale floes use the same method (directed sphere exclusion) as the diverse subset floe and the hitlist floe. Unlike the diverse subset floe, they also output cluster members. Unlike the hitlist clustering floe, they don’t require a score field to be used in clustering, or sorting output clusters. For more details, see the large scale clustering tutorial.