Prepare Collection for Fast Similarity or Substructure Search from Dataset¶
Category Paths
Follow one of these paths in the Orion user interface, to find the floe.
Role-based/Medicinal Chemist
Task-based/Library Prep & Design/Substructure & Similarity Search
Solution-based/Virtual-screening/DB Search/2D Similarity and SubSearch
Description
From input dataset, prepares collection for fast similarity or substructure search. This floe screens molecules for filtering characteristics so that they can be subsequently searched more quickly in the Fast Substructure Search floe, and generates a collection with multiple types of fingerprints, so that they can be searched in the Fast Similarity Search floe.
Promoted Parameters
Title in user interface (promoted name)
Inputs
Input Dataset (in): Input dataset, containing a molecule on each record.
Required
Type: data_source
Advanced Parallelism Settings
Maximum Parallel Cube Instances (max_parallel_central): Maximum number of cubes at any one time for the parallel part of search prep computation.
Type: integer
Default: 1000
Outputs
Make Fast Similarity Search Input Collection (make_sim_coll): Determines whether to make fast similarity search prepared collection.
Required
Type: boolean
Default: True
Choices: [True, False]
Fast Substructure Search Input Collection Choice (ss_coll_choice): Determines query type of fast substructure search prepared collection:
Both(default), MDL, SMARTS, or None. This floe will fail if None is chosen for this parameter and Make Fast Similarity Search Input Collection is turned off.
Required
Type: string
Default: Both
Choices: [‘Both’, ‘MDL’, ‘SMARTS’, ‘None’]
Substructure Search Input Collection Name (coll_name): Name of fast substructure search prepared collection.
Required
Type: collection_sink
Default: Fast Substructure Search Input Collection
Similarity Search Input Collection Name (sim_coll_name): Name of fast similarity search prepared collection.
Required
Type: collection_sink
Default: Fast Similarity Search Input Collection
Advanced: Large Inputs
Number of Molecules per Shard (mols_per_shard): Number of molecules per shard. Since each shard is searched in parallel this controls the granularity of the subsequent fast similarity or substructure search floe run.
Required
Type: integer
Default: 250000
File Reader Disk Space (MiB) (disk_space): This size (in MiB) must be larger than the total size of the file(s) provided as input.
Type: decimal
Default: 6120