Prepare Collection for Fast Similarity or Substructure Search from Dataset

Category Paths

Follow one of these paths in the Orion user interface, to find the floe.

  • Role-based/Medicinal Chemist

  • Task-based/Library Prep & Design/Substructure & Similarity Search

  • Solution-based/Virtual-screening/DB Search/2D Similarity and SubSearch


From input dataset, prepares collection for fast similarity or substructure search. This floe screens molecules for filtering characteristics so that they can be subsequently searched more quickly in the Fast Substructure Search floe, and generates a collection with multiple types of fingerprints, so that they can be searched in the Fast Similarity Search floe.

Promoted Parameters

Title in user interface (promoted name)


Input Dataset (in): Input dataset, containing a molecule on each record.

  • Required

  • Type: data_source

Advanced Parallelism Settings

Maximum Parallel Cube Instances (max_parallel_central): Maximum number of cubes at any one time for the parallel part of search prep computation.

  • Type: integer

  • Default: 1000

Number of messages to distribute at a time (item_count_central): Maximum units of work sent to each parallel cube

  • Type: integer

  • Default: 10000


Substructure Search Input Collection Name (coll_name): Name of fast substructure search prepared collection.

  • Required

  • Type: collection_sink

  • Default: Fast Substructure Search Input Collection

Similarity Search Input Collection Name (sim_coll_name): Name of fast similarity search prepared collection.

  • Required

  • Type: collection_sink

  • Default: Fast Similarity Search Input Collection

Advanced: Large Inputs

Number of Molecules per Shard (mols_per_shard): Number of molecules per shard. Since each shard is searched in parallel this controls the granularity of the subsequent fast similarity or substructure search floe run.

  • Type: integer

  • Default: 250000

File Reader Disk Space (MiB) (disk_space): This size (in MiB) must be larger than the total size of the file(s) provided as input.

  • Type: decimal

  • Default: 5120