Large 2D Clustering for Hit-to-Lead
The Clustering: 2D Fingerprint, up to 100K compounds Floe performs 2D fingerprint-based clustering on a dataset, suitable for the Hit-to-Lead stage, and uses strategic subsampling to select the most promising structurally diverse molecules.
The process begins by checking if a 2D fingerprint already exists for each molecule. If not, a 2D fingerprint is generated. A 2D fingerprint is a binary representation of a molecule’s structure, encoding features like substructures and atom types. For more information, see the Fingerprint Generation section of the GraphSim TK documentation.
Clustering
Once the fingerprints are available, the floe employs several clustering algorithms to group molecules. The methods—K-medoids, DBSCAN, and Hierarchical clustering—each offer a different approach to finding patterns in the data.
K-medoids partitions the data into a predefined number of clusters, with each molecule belonging to the cluster with the nearest mean.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups closely packed points, marking as outliers points that lie alone in low-density regions.
Hierarchical clustering builds a tree of clusters, with the data points at the bottom and a single cluster at the top, allowing for analysis at various levels of granularity.
Sampling from Clusters
The next critical step is to sample from each cluster core. This is a strategic process where molecules are selected based on a “subsample criterion,” which could be a score from a previous screening step (e.g., docking score or experimental activity). This ensures the final, smaller sample represents the most promising candidates from each structural group, balancing chemical diversity with therapeutic relevance.
Step-by-Step Guide
1. Input
You can cluster from input molecules or a pregenerated distance matrix.
Input Dataset: Input a dataset of molecules for first time clustering.
Rerun Clustering: Input Distance Matrix: For successive iterations, use the output collection from a previous run of this floe, which contains the calculated distance matrix between all molecule pairs.
2. Primary Clustering Options
Fingerprint Type: Selects the type of 2D fingerprint to be generated and used for similarity calculation. The options are: Circular, Lingo, MACCS, Path, and Tree.
Recommendation: Tree fingerprints are often preferred.
Note: This step generates the 2D fingerprint if it’s not already present.
Tile Size to break Distance Matrix to Chunk: This parameter relates to the floe’s scalability, which uses tiling and disk storage (np.memmap) to handle large data and reduce cost.
Recommendation: Leave this at the default value unless you are an advanced user managing very large data points (up to 106).
Clustering Variants: Select the clustering algorithms to run. By default, the floe supports all three variants: DBSCAN, Hierarchical, and K-medoid.
3. Algorithm-Specific Parameters
Number of K-Medoids Clusters (K-Medoid Cluster 2D/3D): Specifies the desired number of clusters for the K-medoids algorithm (e.g., 3 was used in a sample analysis).
Distance Threshold (Hierarchical Cluster 2D/3D): Sets the maximum linkage distance for merging clusters in Hierarchical clustering. Clusters will not be merged above this threshold.
Epsilon (eps) (DBScan Cluster 2D/3D): Controls the density-based clustering for DBSCAN. It is the maximum distance between core cluster members. Increase eps to cluster more molecules together in fewer clusters and decrease it for more clusters.
Note: If not provided, an eps value will be estimated based on the input data. A sample analysis used an eps of 0.05.
4. Subsampling and Output
The floe includes an optional Subsampling feature to select representative molecules from the generated clusters.
Field to Subsample: Select a hit list criterion, such as solubility (loguM).
How to subsample from each cluster: Choose the criterion (Highest, Lowest, or Random) to pick the top n samples based on the chosen field.
Floe Output
The floe outputs a new dataset that includes the cluster results, with four added fields for each input molecule: ClusterID, isCore, distToCore, and isSingleton.
Output Dataset (Hierarchical Subsample Writer): The selected molecules from the Hierarchical clustering.
Output Dataset (DBSCAN Subsample Writer): The selected molecules from the DBSCAN clustering.
Output Dataset (KMedoids Subsample Writer): The selected molecules from the K-medoid clustering.
Output Dataset (All Cluster Members): Contains all input data records with the clustering information.