Clustering: 2D Scaffold (~25M compounds)

Category Paths

Follow one of these paths in the Orion user interface, to find the floe.

Description

The process begins with clustering chemical compounds based on their Bemis-Murcko core. This is a foundational step that groups molecules with similar structural backbones. The clustering can be done in two ways: by considering cores that are colored (meaning atom types and bond orders are taken into account) or uncolored (where only the connectivity is considered). The Bemis-Murcko core provides a robust way to identify the key structural scaffold of a molecule.

The clustering algorithm is highly efficient, operating in O(n) time with parallelization. This linear complexity and parallel nature allow it to handle millions of compounds. The workflow has been stress-tested on a dataset of 5 million compounds, costing $18, and on a larger dataset of 25 million compounds, costing $25, demonstrating its scalability and cost-effectiveness.

After clustering, the next step is reservoir sampling. This involves picking a small, representative sample of compounds from each cluster. This method gives a subset of the dataset while ensuring that the chemical diversity of the entire library is maintained. By sampling from each cluster, the workflow ensures that different structural motifs are represented in the final, smaller sample.

Since this floe typically expects large files (over 100K), the recommended method of storage in Orion is collections and files. Other floes in the ETL package exist to convert files from a dataset into one of these two formats. For output, the process places molecules belonging to the same scaffold cluster into separate shards. This sharding strategy ensures that subsequent computational tasks (such as docking or molecular dynamics) can be run independently on the sample for each cluster, further accelerating the screening process.

Promoted Parameters

Title in user interface (promoted name)

Inputs

Collection (c_in): Input either collection or file for molecules to cluster.

Since this floe handles a larger range of data (>100K), we recommend using collections and files over datasets.

  • Type: collection_source

Input File(s) (f_in): Input either collection or file for molecules to cluster.

  • Type: file_in

Outputs

Label Record Collection (c1): Output cluster molecules stored in common shards based on common scaffold.

  • Required

  • Type: collection_sink

  • Default: Cluster Labelled Molecules

Cluster Core Collection (c2): Collection containing a single molecule picked from every scaffold.

  • Required

  • Type: collection_sink

  • Default: Cluster Core Molecules

Failed Molecule Output (f): Output dataset to which to write.

  • Required

  • Type: dataset_out

  • Default: failure_core_data_out

Output Reservoir Sub Sample (s): Output diverse subset sampling from each scaffold cluster.

  • Required

  • Type: dataset_out

  • Default: output molecule sample