Collection to File

Category Paths

Follow one of these paths in the Orion user interface, to find the floe.

  • Role-based/Computational Chemist

  • Solution-based/Virtual-screening/DB Preparation

  • Task-based/Data Science/Conversion

Description

Concatenates all shards of a collection into a single file. All shards must have the same format. Validation is done to ensure that the format is concatenable. All shards of the input collection(s) must have the same format, and the output file format always matches the collection.

The purpose of this floe is to convert a FastROCS collection into a file that can be used with a FastROCS server (rather than floe).

Promoted Parameters

Title in user interface (promoted name)

Inputs

Input Collection (collection_in): An input collection to convert into a file.

  • Type: collection_source

Input Dataset (input_dataset): A dataset to convert into a file

  • Type: data_source

Outputs

Output Base Filename (filename): Basename of the output file (without the format extension). The format extension will automatically be added to this name.

  • Type: string

  • Default: Collection Converted To File

Temporary Collection (temporary_collection): Name of a temporary collection the floe will create and automatically delete at the end of the floe run. Deleting this collection manually before the floe finishes can cause this floe to fail. There is generally no reason to ever adjust this parameter.

  • Required

  • Type: collection_sink

  • Default: Temporary Collection

Options

Output Format (output_format): The desired format of the output file. Note that depending on the format of the input collection and the format of the output file all data on the input collection is not guaranteed to be retained in the output file.

  • Required

  • Type: string

  • Default: sdf.gz

  • Choices: [‘can’, ‘can.gz’, ‘csv’, ‘csv.bz2’, ‘csv.gz’, ‘cxsmiles’, ‘cxsmiles.gz’, ‘ism’, ‘ism.gz’, ‘isosmi’, ‘isosmi.gz’, ‘mol2’, ‘mol2.gz’, ‘oeb’, ‘oeb.gz’, ‘oedb’, ‘oez’, ‘sd’, ‘sd.gz’, ‘sdf’, ‘sdf.gz’, ‘smi’, ‘smi.gz’, ‘syb’, ‘syb.gz’, ‘tsv’, ‘tsv.bz2’, ‘tsv.gz’, ‘usm’, ‘usm.gz’]

Clear Titles (clear_titles): If set to On the title on the molecules will be cleared before writing the file

  • Required

  • Type: boolean

  • Default: False

  • Choices: [True, False]

Development

Verbose (verbose): If ‘On’, this floe will write to the log file each time a shard is processed.

  • Type: boolean

  • Default: False

  • Choices: [True, False]

Re-formated shard size (re_formated_shard_size): The target number of records in a shard.

0 indicates to run up to the max_shard_bytes limit per shard

  • Required

  • Type: integer

  • Default: 1000000

Dataset shard size (dataset_shard_size): The target number of records in a shard.

0 indicates to run up to the max_shard_bytes limit per shard

  • Required

  • Type: integer

  • Default: 100000

Number of Parallel Cubes (number_of_parallel_cubes): Max count for the parallel cube group that converts the format of the input collection

  • Type: integer

  • Default: 25