Prepare Collection for Fast Similarity or Substructure Search from Dataset¶

Category Paths

Follow one of these paths in the Orion user interface, to find the floe.

Role-based/Medicinal Chemist

Task-based/Library Prep & Design/Substructure & Similarity Search

Solution-based/Virtual-screening/DB Search/2D Similarity and SubSearch

Description

From input dataset, prepares collection for fast similarity or substructure search. This floe screens molecules for filtering characteristics so that they can be subsequently searched more quickly in the Fast Substructure Search floe, and generates a collection with multiple types of fingerprints, so that they can be searched in the Fast Similarity Search floe.

Promoted Parameters

Title in user interface (promoted name)

Inputs

Input Dataset (in): Input dataset, containing a molecule on each record.

Required

Type: data_source

Advanced Parallelism Settings

Maximum Parallel Cube Instances (max_parallel_central): Maximum number of cubes at any one time for the parallel part of search prep computation.

Type: integer

Default: 1000

Outputs

Make Fast Similarity Search Input Collection (make_sim_coll): Determines whether to make fast similarity search prepared collection.

Required

Type: boolean

Default: True

Choices: [True, False]

Fast Substructure Search Input Collection Choice (ss_coll_choice): Determines query type of fast substructure search prepared collection:

Both(default), MDL, SMARTS, or None. This floe will fail if None is chosen for this parameter and Make Fast Similarity Search Input Collection is turned off.

Required

Type: string

Default: Both

Choices: [‘Both’, ‘MDL’, ‘SMARTS’, ‘None’]

Substructure Search Input Collection Name (coll_name): Name of fast substructure search prepared collection.

Required

Type: collection_sink

Default: Fast Substructure Search Input Collection

Similarity Search Input Collection Name (sim_coll_name): Name of fast similarity search prepared collection.

Required

Type: collection_sink

Default: Fast Similarity Search Input Collection

Advanced: Large Inputs

Number of Molecules per Shard (mols_per_shard): Number of molecules per shard. Since each shard is searched in parallel this controls the granularity of the subsequent fast similarity or substructure search floe run.

Type: integer

Default: 250000

File Reader Disk Space (MiB) (disk_space): This size (in MiB) must be larger than the total size of the file(s) provided as input.

Type: decimal

Default: 6120