Prepare Giga Collections

Category Paths

Follow one of these paths in the Orion user interface, to find the floe.

Product-based/FastROCS

Product-based/Gigadock

Role-based/Computational Chemist

Solution-based/Virtual-screening/DB Preparation

Task-based/Library Prep & Design/Conformer Generation

Description

Constructs Giga-Docking and FastROCS collections from input file(s) or collection(s).

This floe filters the input molecule and generates conformers for the Gigadock and FastROCS collections, with a maximum of 200 and 10 conformers per molecule, respectively. The filtering is configurable (see the “Filtering: Basic Properties”, “Filtering: SMARTS” and “Filtering: OEFilter” parameter groups). Use the ‘Dry Run’ switch to, at negligible cost examine the contents of the input files (see the Floe report) and see how many molecules pass each filter (examine the port counts) without generating conformers or creating any collections.

The input molecules are expected to come from a collection outputted by one of the generative design floes, or more commonly an input file. Supported file formats are: can, can.gz, csv, csv.bz2, csv.gz, cxsmiles, cxsmiles.gz, ism, ism.gz, isosmi, isosmi.gz, mol2, mol2.gz, oeb, oeb.gz, oedb, oez, sd, sd.gz, sdf, sdf.gz, smi, smi.gz, syb, syb.gz, tar, tar.bz2, tar.gz, tgz, tsv, tsv.bz2, tsv.gz, usm, usm.gz, zip. Comma and tab separated files (i.e. csv and tsv) are expected to have SMILES representations of molecules in one of the columns. For archive formats (e.g., .tar and .zip) this floe will read molecules from files in the archive that are of a supported format.

IMPORTANT: For comma and tab separated files (i.e., CSV and TSV) this floe cannot automatically identify molecule titles from the file, and will by default create molecules with no titles that will be filtered out (i.e., the floe will read in no molecules). When working with CSV/TSV input files you should run the floe once with ‘Options -> Dry Run’ and ‘Options -> Allow Empty Titles’ both on. A floe report will be created that you can examine to see the column/field titles in the CSV/TSV file. In subsequent runs you can then specify the column/field you want to use (must be a string field) as the molecule title using the ‘Format Specific Settings -> Mol Title Field’ parameters. If the file doesn’t have any appropriate title columns/fields you can leave ‘Options -> Allow Empty Titles’ in subsequent runs.

Disk Space Considerations: With file input there is a limitation on the size of the file. The disk space available for a file is specified by the ‘Hardware->Disk Space’ parameter. For non-archive files this is that maximum size input file. For archive files (e.g., tar and zip) each file is extracted from the archive processed and then cleaned up before extracting the and processing the next file in the archive. The disk space required is thus the size of the archive file plus the size of the largest single file in the archive when it is extracted from the archive.

Promoted Parameters

Title in user interface (promoted name)

Inputs

Input File(s) (input_file): One or more input files to convert into docking & FastROCS collections.

Type: file_in

Inputs : Alternate

Input Dataset(s) (input_datasets): One or more input datasets to convert into Gigadock and FastROCS collections.

Type: data_source

Input Collection(s) (input_collection): One of more input collections to convert into docking and FastROCS collections. These collection(s) are expected to be from the output of the reaction enumeration floes.

Type: collection_source

Outputs

Giga Docking Collection Name (docked_collection_name): Name of the collection to create.

Required

Type: collection_sink

Default: GigaDock Collection

FastROCS Collection Name (rocs_collection_name): Name of the collection to create.

Required

Type: collection_sink

Default: FastROCS Collection

Temporary Collection Name (temp_collection_name): This collection is used to store temporary data during the run of the floe and is automatically deleted when the floe completes. If you cancel the floe you should delete the collection manually.

Required

Type: collection_sink

Default: Prepare Giga Collections Temporary Collection

Options

Dry Run (dry_run_switch): If ‘On’, molecules will be read and passed through the filtering cubes normally but will not be passed to the omega conformer generation and collection creation cubes. This makes the floe very inexpensive to run, however no output collections will be generated. With this switch on you can see how many molecules will be read from your input file(s)/collection(s) and how many will pass the filters (examine the filtering Cube port counts) that have configured, and make adjustments without incurring significant expense. Once the best filtering settings are determined you can run the Floe with those settings and this option turned off to generate the collections.

Required

Type: boolean

Default: False

Choices: [True, False]

Allow Empty Titles (allow_no_title_switch): If ‘On’, molecules without titles will be allowed. If ‘Off’ molecules without titles will be filtered out. If your input is a comma or tab separated file you can use the ‘Format Specific Settings -> Mol Title Field’ parameter to set the molecule title to a specific string column in the comma or tab separated file.

Required

Type: boolean

Default: False

Choices: [True, False]

Keep this fraction (random_retain_probability): Parameter to create collections that contain a random subset of the input molecule when set to less than the default value of 1. This parameter is the fraction of the input molecules that will be retained using a random selection criterion.

Required

Type: decimal

Default: 1.0

pKa

Set Neutral pH on input molecules (set_neutral_ph_on_input_molecules): If turned ‘Off’ a neutral pH 7.4 pKa state will not be set on the input molecules. This option should only be turned off if you are certain your input molecule has a correct pKa state set.

Required

Type: boolean

Default: True

Choices: [True, False]

Filtering : Basic Properties

Max molecular weight (mw_max): Molecules with molecular weight greater than this value will be filtered out. If unspecified this cube will not filter out molecules with high molecular weight.

Type: decimal

Default: 500.0

Min molecular weight (mw_min): Molecules with molecular weight less than this value will be filtered out. If unspecified this cube will not filter out molecules with low molecular weight.

Type: decimal

Max rotatable bond count (rot_bond_max): Molecules with rotatable bond count greater than this value will be filtered out. If unspecified this cube will not filter out molecules with high rotatable bond count.

Type: integer

Default: 15

Min rotatable bond count (rot_bond_min): Molecules with rotatable bond count less than this value will be filtered out. If unspecified this cube will not filter out molecules with low rotatable bond count.

Type: integer

Max count undefined atom stereo (atom_stereo_max): Molecules with count undefined atom stereo greater than this value will be filtered out. If unspecified this cube will not filter out molecules with high count undefined atom stereo.

Type: integer

Default: 3

Max count undefined bond stereo (bond_stereo_max): Molecules with count undefined bond stereo greater than this value will be filtered out. If unspecified this cube will not filter out molecules with high count undefined bond stereo.

Type: integer

Default: 3

Max acceptor count (acc_max): Molecules with acceptor count greater than this value will be filtered out. If unspecified this cube will not filter out molecules with high acceptor count.

Type: integer

Min acceptor count (acc_min): Molecules with acceptor count less than this value will be filtered out. If unspecified this cube will not filter out molecules with low acceptor count.

Type: integer

Max donor count (don_max): Molecules with donor count greater than this value will be filtered out. If unspecified this cube will not filter out molecules with high donor count.

Type: integer

Min donor count (don_min): Molecules with donor count less than this value will be filtered out. If unspecified this cube will not filter out molecules with low donor count.

Type: integer

Max topological polar surface area (tpsa_max): Molecules with topological polar surface area greater than this value will be filtered out. If unspecified this cube will not filter out molecules with high topological polar surface area.

Type: decimal

Min topological polar surface area (tpsa_min): Molecules with topological polar surface area less than this value will be filtered out. If unspecified this cube will not filter out molecules with low topological polar surface area.

Type: decimal

Max xlogp (xlogp_max): Molecules with xlogp greater than this value will be filtered out. If unspecified this cube will not filter out molecules with high xlogp.

Type: decimal

Min xlogp (xlogp_min): Molecules with xlogp less than this value will be filtered out. If unspecified this cube will not filter out molecules with low xlogp.

Type: decimal

Max formal charge (charge_max): Molecules with formal charge greater than this value will be filtered out. If unspecified this cube will not filter out molecules with high formal charge.

Type: integer

Min formal charge (charge_min): Molecules with formal charge less than this value will be filtered out. If unspecified this cube will not filter out molecules with low formal charge.

Type: integer

Max aromatic ring count (aro_max): Molecules with aromatic ring count greater than this value will be filtered out. If unspecified this cube will not filter out molecules with high aromatic ring count.

Type: integer

Min aromatic ring count (aro_min): Molecules with aromatic ring count less than this value will be filtered out. If unspecified this cube will not filter out molecules with low aromatic ring count.

Type: integer

Filtering: SMARTS

Required SMARTS (required_smarts): If one or more SMARTS patterns are supplied to this parameter then every molecule passed to this cube must match one of these smarts patterns of it will be filtered. This check is skipped if no SMARTS patterns are supplied to this cube.

Type: string

Excluded SMARTS (excluded_smarts): Every molecule that matched any of the SMARTS patterns supplied to this parameter will be filtered.

Type: string

Filtering: OEFilter

OEFilter Type (oefilter_type):

Required

Type: string

Default: BlockBuster

Choices: [‘BlockBuster’, ‘Lead’, ‘Drug’, ‘PAINS’, ‘None’]

Filter Rules (filter_in): Optional rules to create an OEFilter (see https://docs.eyesopen.com/toolkits/python/molproptk/filter_files.html).

Type: file_in

Filtering: 2D Similarity to Known Molecules

Known Molecules (known_molecules): If this parameter is specified each molecule being prepared will be assigned a single 2D Tanimoto value equal highest 2D Tanimoto to any molecule in this dataset(s). The prepared molecule will then be filtered by comparing this value to the setting of the ‘Filter Out Tanimotos Higher Than’ and/or ‘Filter Out Tanimotos Lower Than’ parameters. WARNING: A significant filtering compute cost can be incurred, even in ‘Dry Run’ mode, if a large number of molecule are passed to this parameter (For a 1 Billion molecule collection and 10K known molecules the filtering portion of the cost is typically be about ~$20).

Type: data_source

Filter Out Tanimotos Higher Than (filter_out_tanimotos_higher_than): If specified molecules with a 2D Tanimoto higher that this value will be filtered out. Use this parameter if you want to remove molecules that are similar in 2D space to any of the known molecules.

Type: decimal

Filter Out Tanimotos Lower Than (filter_out_tanimotos_lower_than): If specified molecules with a 2D Tanimoto lower than this value will be filtered out. Use this parameter if you want to remove molecules are different in 2D space to any of the known molecules.

Type: decimal

Known Molecules 2D Fingerprint Method (known_molecules_2d_fingerprint_method): The 2D Fingerprint method used to compute the Tanimotos for the known molecules filter.

Type: string

Default: Circular

Choices: [‘Circular’, ‘Path’, ‘Tree’]

Use Virtual Screening 2D Fingerprint Variant (use_virtual_screening_2d_fingerprint_variant): If ‘On’ the virtual screening variant of the selected 2D fingerprint will be used for the knownmolecules filter. The virtual screening variant treats certain functional group identically regardless of there pKa state. E.g. protonated and unprotonated carboxylic acids.

Type: boolean

Default: True

Choices: [True, False]

Known Molecule Tanimoto Field (known_molecule_tanimoto_field): If this parameter is specified the 2D Tanimoto used for known molecule filtering for each processed molecule will be placed in the output collections in a field of this name. If unspecified the Tanimoto value will not be stored in the output collections.

Type: field_parameter::float

Maximum Number of Filtering Cubes (maximum_number_of_filtering_cubes): Maximum number of cubes to use for filtering. Increasing this value can improved the runtime in cases where a large number of known molecules are supplied. This value can only be set above the default value of 500 if the number of molecules passed to ‘Filtering: 2D Similarity To Known Molecules -> Known Molecules’ time the setting of ‘Options->Keep This Fraction’ is greater than 10000.

Type: integer

Default: 500

Stereo Enumeration

Enantiomer Title Field (enantiomer_title_field): This string field will be added to the created collection and contains the title of the input molecule appended by and index number to identify the enantiomer. If unspecified the field will not be created.

Type: field_parameter::string

Default: Enantiomer Title

Conformer Generation Settings

Use GPU Omega (omega_switch): If this value is set to true this floe with use the GPU version of Omega. The GPU version of Omega (i.e., with this paramter set to True) is typically ~40% more cost efficient and than CPU version (i.e., with this parameter set to False). However, the AWS spot instances GPU Omega requires have been in very high demand on AWS and running this floe in a high demand period with this flag set to true can result in very long run times and less cost efficiency.

Required

Type: boolean

Default: False

Choices: [True, False]

Torsion Library Type (torsion_library_type): Type of torsion library Omega will use. If ‘custom’ is selected then a torsion library must be supplied to the ‘Conformer Generation Settings -> Torsion Library’ parameter.

Type: string

Default: Original

Choices: [‘Original’, ‘Guba’, ‘Custom’]

Torsion Library (torlib_in): Optional torsion library for Omega. (see https://docs.eyesopen.com/toolkits/python/omegatk/OEConfGenClasses/OETorLib.html ). These torsion rules will be added to the selected torsion library. Select ‘Custom’ for ‘Torsion Library Type’ if you want to replace rather than add rules.

Type: file_in

GPU Omega Input Shard Size (gpu_omega_input_shard_size): This parameter controls how many molecule must be processed within the Orion 12h limit when Use GPU Omega is On. With lower values the conformer generation will not be as cost efficient. In general there is no need to adjust this parameter unless instructed to do so by OpenEye Support.

Required

Type: integer

Default: 10000

CPU Omega Input Shard Size (cpu_omega_input_shard_size): This parameter controls how many molecule must be processed within the Orion 12h limit when Use GPU Omega is Off. With lower values the conformer generation will not be as cost efficient. In general there is no need to adjust this parameter unless instructed to do so by OpenEye Support.

Required

Type: integer

Default: 5000

Format Specific Settings

Additional CSV Extension (additional_csv_extension): Additional file extensions that will be recognized as comma separated valuesfiles. Do not include the ‘.’ in the extensions. Gzip and Bzip2 variants will also automatically be supported.If a built in extension is specified this format specification willoverride it. The primary purpose of this parameter is to enable specifying formats of files within archives (e.g., tar or zip) which cannot easily be renamed.

Type: string

Additional TSV Extension (additional_tsv_extension): Additional file extensions that will be recognized as tab separated valuesfiles. Do not include the ‘.’ in the extensions. Gzip and Bzip2 variants will also automatically be supported.If a built in extension is specified this format specification willoverride it. The primary purpose of this parameter is to enable specifying formats of files within archives (e.g., tar or zip) which cannot easily be renamed.

Type: string

Additional SMILES Extension (additional_smiles_extension): Additional file extensions that will be recognized as SMILESfiles. Do not include the ‘.’ in the extensions. Gzip and Bzip2 variants will also automatically be supported.If a built in extension is specified this format specification willoverride it. The primary purpose of this parameter is to enable specifying formats of files within archives (e.g., tar or zip) which cannot easily be renamed.

Type: string

Mol Title Field (sv_title_field): When reading tab or comma separated files this parameter specifies the column that should be used as the molecules title. When reading tab or comma separated files either this parameter of must be set or ‘Options->Allow Empty Titles’ must be set to ‘On’ (not recommended the molecules will have no titles). This parameter is ignored if the file being read is not a tab or comma separated file.

Type: string

Molecule Field Name for TSV/CSV (molecule_field_name_for_tsvcsv): This parameter specifies the name field the molecule will be stored in. If no value is specified for this parameter the name will the the column header name from the original CSV/TSV file.

Type: string

Default: Molecule

GPU Hardware

GPU Omega Instance Type (gpu_omega_instance_type): This parameters control the AWS instance type the GPU Omega Cube will use.

Type: string

Default: !g6e,!cdns

Omega CPUs per GPU (omega_cpus_per_gpu): The minimum number of CPUs per GPU and instance running GPU Omega must have.

Type: integer

Default: 4

GPU Omega Max Parallel (gpu_omega_max_parallel): Maximum number of instances the GPU Omega conformer generation will use.

Type: integer

Default: 2000

Hardware

Disk Space (read_disk_space): Disk space in MB for the Cube reading the input file(s). The default value should be sufficient, but if you have a single huge file you may need to increase this value.

Type: decimal

Default: 102400

Input Fields

Known Molecule Field (known_molecule_field): Field on the known molecules dataset holding the known molecules. If unspecified the default Primary molecule on the record will be used.

Type: field_parameter::mol

Input Dataset Molecule Field (input_dataset_molecule_field): Field on the input dataset(s) containing the field with molecules to prepare. If unspecified the primary molecule field will be used.

Type: field_parameter::mol

Developer

Enable Thompson Sampling (enable_thompson_sampling): Enable Thompson Sampling for CPU Omega.

Type: boolean

Default: False

Choices: [True, False]

Docking Max Confs (docking_max_confs): Maximum number of conformers for the docking collection. The number of confs for the fastrocs collection is always 10.

Type: integer

Default: 200

Min Shard Download Timeout (min_shard_download_timeout): Minimum timeout for the smart shard to records cubes

Required

Type: integer

Default: 2

Max Shard Download Timeout (max_shard_download_timeout): Maximum timeout for the smart shard to records cubes

Required

Type: integer

Default: 21600.0

Session Retry Dict for Shard Download (session_retry_dict_for_shard_download_): Session retry dict for the smart shard to records cubes

Type: string

Default: [‘429:1000’, ‘460:1000’, ‘500:1000’, ‘502:1000’, ‘503:1000’, ‘504:1000’]

Shard Download Attempts (shard_download_attempts): Download attempts for the smart shard to records cubes

Type: integer

Default: 1

Min Shard Upload Timeout (min_shard_upload_timeout): Minimum timeout for the smart records to record shard cubes

Required

Type: integer

Default: 2

Max Shard Upload Timeout (max_shard_upload_timeout): Maximum timeout for the smart records to record shard cubes

Required

Type: integer

Default: 21600.0

Session Retry Dict for Shard Upload (session_retry_dict_for_shard_upload_): Session retry dict for the smart records to record shard cubes

Type: string

Default: [‘429:1000’, ‘460:1000’, ‘500:1000’, ‘502:1000’, ‘503:1000’, ‘504:1000’]

Shard Upload Attempts (shard_upload_attempts): Download attempts for the smart record to record shard cubes

Type: integer

Default: 3

GPU Omega Download Session Retry Timeout (gpu_omega_download_session_retry_timeout): Sets the retry timeout (sec) on the cube_session OrionSession for this cube. If unspecified parallel cubes will uses a value of 600 and serial cubes will use a value of 7200.

Type: integer

Default: 2

GPU Omega Shard Download Attempts (gpu_omega_shard_download_attempts): Number of attempts to make when downloading a shard

Type: integer

Default: 1

GPU Omega Upload Session Retry Timeout (gpu_omega_upload_session_retry_timeout): Sets the retry timeout on the OrionSession used to create and upload the output shards. If unspecified the orionclient default will be used.

Type: integer

Default: 180

GPU Omega Shard Upload Attempts (gpu_omega_shard_upload_attempts): Number of attempts to make when uploading a shard

Type: integer

Default: 3

Serial Cube Retry Timeout (serial_cube_retry_timeout): Sets the retry timeout (sec) on the cube_session OrionSession for this cube. If unspecified parallel cubes will uses a value of 600 and serial cubes will use a value of 7200.

Type: integer

CPU Omega Instance Type (cpu_omega_instance_type): The type of instance that this cube needs to be run on

Type: string

CPU Omega Max Parallel (cpu_omega_max_parallel): Maximum number of instances the CPU Omega conformer generation will use.

Type: integer

Default: 50000

Sequential Deduplicate (sequential_deduplicate): If ‘On’, only deduplicate checking will only be applied to molecules in sequence. If Off, deduplication will be performed for all molecules of the same item_count group.

Type: boolean

Default: True

Choices: [True, False]

Omega Enum Nitrogen (omega_enum_nitrogen): Specifies how to handle nitrogen stereo centers.

Type: string

Default: Unspecified

Choices: [‘All’, ‘Unspecified’, ‘Off’]

Flipper Enum Nitrogen (flipper_enum_nitrogen): Flag indicating if non-planar nitrogens should be enumerated as possible stereo centers.

Type: boolean

Default: True

Choices: [True, False]

Spot instance policy for GPU Omega Instance. (spot_instance_policy_for_gpu_omega_instance): If this is set not to ‘required’ or ‘preferred’ the the floe to cost more, typically by x3-4 but the floe may run faster (wall time).

Type: string

Default: Required

Choices: [‘Allowed’, ‘Preferred’, ‘NotPreferred’, ‘Prohibited’, ‘Required’]

GPU Omega Shard ‘Item Count’ (gpu_omega_shard_item_count): Number of input messages to group into an output message.

Required

Type: integer

Default: 48

GPU Omega Processes (gpu_omega_processes): Number of processes to use with one GPU when running omega Omega. If unspecified the number will be the number of cpus the instance has.

Type: integer

Enable cube timing report (time_all_cubes): If true this cube will emit timing information to the timing_data port.

Type: boolean

Default: True

Choices: [True, False]

Catch exceptions (catch_exceptions): If Off exception handling will be disabled for this cube.

Type: boolean

Default: True

Choices: [True, False]

Catch exceptions (parallel_catch_exception_methods): Specifies which methods of a parallel cube an exception will be caught and emitted to the exception port if the port is connected. If the exception port is connected to an exception handler this will stop the floe

Type: string

Default: [‘begin’]

Choices: [‘begin’, ‘process’, ‘end’]