Orion Platform File Cubes

When you want to read files into floes, often they will have to be converted to records, so that they can be interpreted by other cubes. The following cubes provide utilities for converting files to records and records to various file formats.

Cubes

Binary File Reader

Import Statement

from orionplatform.cubes import BinaryFileReaderCube

Description

A cube that reads one or more files and emits their contents in a single stream. Generally used with a BinaryInputPort initializer.

Output Ports

  • success: BinaryOutputPort

Ungrouped Parameters

  • File to use as input: FileInputParameter

    The file from which to read in binary mode.

Parameter Groups

Floe Internals

  • buffer_size: IntegerParameter

    The amount of data buffered before sending downstream

Hardware

  • CPUs: IntegerParameter

    The number of CPUs to run this cube with

  • Temporary Disk Space (MiB): DecimalParameter

    The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • GPUs: DecimalParameter

    The number of GPUs to run this cube with

  • Instance Tags: StringParameter

    Only run on machines with matching tags (comma separated)

  • Instance Type: StringParameter

    The type of instance that this cube needs to be run on

  • Max Backlog Wait: IntegerParameter

    The max time (in seconds) that a cube will be backlogged on a group before being re-evaluated

  • Memory (MiB): DecimalParameter

    The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • Thread limit per CPU: IntegerParameter

    The number of threads per CPU

  • Shared Memory (MiB): DecimalParameter

    The amount of shared memory to allow a container to address

  • Spot policy: StringParameter

    Control cube placement on spot market instances

Metrics

  • Cube Metrics: StringParameter

    Set of metrics to be collected

  • Metric Period: DecimalParameter

    How often to sample metrics, in seconds

File to Record Converter

Import Statement

from orionplatform.cubes import FileToRecordConverter

Description

Reads a molecule or CSV file and converts to records.

Output Ports

  • success: RecordOutputPort

Ungrouped Parameters

  • File to use as input: FileInputParameter

    Molecular or CSV file to convert to records for use in a floe.

  • File extension to append to input the file name: StringParameter

    Override the file format derived from input file name.

Parameter Groups

Floe Internals

  • buffer_size: IntegerParameter

    The amount of data buffered before sending downstream

Hardware

  • CPUs: IntegerParameter

    The number of CPUs to run this cube with

  • Temporary Disk Space (MiB): DecimalParameter

    The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • GPUs: DecimalParameter

    The number of GPUs to run this cube with

  • Instance Tags: StringParameter

    Only run on machines with matching tags (comma separated)

  • Instance Type: StringParameter

    The type of instance that this cube needs to be run on

  • Max Backlog Wait: IntegerParameter

    The max time (in seconds) that a cube will be backlogged on a group before being re-evaluated

  • Memory (MiB): DecimalParameter

    The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • Thread limit per CPU: IntegerParameter

    The number of threads per CPU

  • Shared Memory (MiB): DecimalParameter

    The amount of shared memory to allow a container to address

  • Spot policy: StringParameter

    Control cube placement on spot market instances

Metrics

  • Cube Metrics: StringParameter

    Set of metrics to be collected

  • Metric Period: DecimalParameter

    How often to sample metrics, in seconds

Archive Reader

Import Statement

from orionplatform.cubes import ArchiveConverterCube

Description

Converts a TAR or ZIP file into records (if the output port is connected) or directly into datasets (if the output port isn’t connected).

Output Ports

  • success: RecordOutputPort

Ungrouped Parameters

  • Tar or zip file to use as input: FileInputParameter

    Archive file to convert to records.

Parameter Groups

Floe Internals

  • buffer_size: IntegerParameter

    The amount of data buffered before sending downstream

Hardware

  • CPUs: IntegerParameter

    The number of CPUs to run this cube with

  • Temporary Disk Space (MiB): DecimalParameter

    The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • GPUs: DecimalParameter

    The number of GPUs to run this cube with

  • Instance Tags: StringParameter

    Only run on machines with matching tags (comma separated)

  • Instance Type: StringParameter

    The type of instance that this cube needs to be run on

  • Max Backlog Wait: IntegerParameter

    The max time (in seconds) that a cube will be backlogged on a group before being re-evaluated

  • Memory (MiB): DecimalParameter

    The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • Thread limit per CPU: IntegerParameter

    The number of threads per CPU

  • Shared Memory (MiB): DecimalParameter

    The amount of shared memory to allow a container to address

  • Spot policy: StringParameter

    Control cube placement on spot market instances

Metrics

  • Cube Metrics: StringParameter

    Set of metrics to be collected

  • Metric Period: DecimalParameter

    How often to sample metrics, in seconds

Files and Datasets to Archive

Import Statement

from orionplatform.cubes import ArchiveFilesCube

Description

Adds selected files and datasets from Orion into a .tar.gz or .zip archive

Ungrouped Parameters

  • Datasets to compress into an archive: DatasetInputParameter

    Select all datasets that you wish to be converted to .oedb and compressed into a single archive.

  • File format for converted datasets: StringParameter

    File extension for converting datasets

  • Archive file name: FileOutputParameter

    Name of the destination .tar.gz archive file to store the files. If the extension is not .tar.gz, .tar.gz will be appended to the filename.

  • Archive file format: StringParameter

  • Files to compress into an archive: FileInputParameter

    Select all files that you wish to be compressed into a single archive file.

  • Safe mode: BooleanParameter

    Before doing anything, estimate the required disk space needed based on the sum of all file sizes and approximate dataset sizes. If the requested disk space is not enough, terminate the job and raise error message. If this value is set to False, this pre-run check will not run, and the job may run out of disk while running.

Parameter Groups

Floe Internals

  • buffer_size: IntegerParameter

    The amount of data buffered before sending downstream

Hardware

  • CPUs: IntegerParameter

    The number of CPUs to run this cube with

  • Temporary Disk Space (MiB): DecimalParameter

    The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • GPUs: DecimalParameter

    The number of GPUs to run this cube with

  • Instance Tags: StringParameter

    Only run on machines with matching tags (comma separated)

  • Instance Type: StringParameter

    The type of instance that this cube needs to be run on

  • Max Backlog Wait: IntegerParameter

    The max time (in seconds) that a cube will be backlogged on a group before being re-evaluated

  • Memory (MiB): DecimalParameter

    The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • Thread limit per CPU: IntegerParameter

    The number of threads per CPU

  • Shared Memory (MiB): DecimalParameter

    The amount of shared memory to allow a container to address

  • Spot policy: StringParameter

    Control cube placement on spot market instances

Metrics

  • Cube Metrics: StringParameter

    Set of metrics to be collected

  • Metric Period: DecimalParameter

    How often to sample metrics, in seconds

Record to File Converter

Import Statement

from orionplatform.cubes import RecordsToFileConverter

Description

A writer that converts a stream of records to an OE-recognized file. The format is determined from the file extension given to the “file_name” parameter. The cube will raise an exception if the content of the records cannot be converted into the requested file format.

Input Ports

  • intake: RecordInputPort

Ungrouped Parameters

  • Input Dataset: DatasetInputParameter

    The dataset(s) from which to read records.

  • file_name: FileOutputParameter

    Name of the file to create from records. The file extension will determine the format.

Parameter Groups

Floe Internals

  • buffer_size: IntegerParameter

    The amount of data buffered before sending downstream

Hardware

  • CPUs: IntegerParameter

    The number of CPUs to run this cube with

  • Temporary Disk Space (MiB): DecimalParameter

    The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • GPUs: DecimalParameter

    The number of GPUs to run this cube with

  • Instance Tags: StringParameter

    Only run on machines with matching tags (comma separated)

  • Instance Type: StringParameter

    The type of instance that this cube needs to be run on

  • Max Backlog Wait: IntegerParameter

    The max time (in seconds) that a cube will be backlogged on a group before being re-evaluated

  • Memory (MiB): DecimalParameter

    The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • Thread limit per CPU: IntegerParameter

    The number of threads per CPU

  • Shared Memory (MiB): DecimalParameter

    The amount of shared memory to allow a container to address

  • Spot policy: StringParameter

    Control cube placement on spot market instances

Metrics

  • Cube Metrics: StringParameter

    Set of metrics to be collected

  • Metric Period: DecimalParameter

    How often to sample metrics, in seconds

Record File to Record Converter

Import Statement

from orionplatform.cubes import RecordFileToRecordConverter

Description

“Reads a record file and converts to records.

Output Ports

  • success: RecordOutputPort

Ungrouped Parameters

  • File to use as input: FileInputParameter

    Record file to use as input to a floe.

Parameter Groups

Floe Internals

  • buffer_size: IntegerParameter

    The amount of data buffered before sending downstream

Hardware

  • CPUs: IntegerParameter

    The number of CPUs to run this cube with

  • Temporary Disk Space (MiB): DecimalParameter

    The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • GPUs: DecimalParameter

    The number of GPUs to run this cube with

  • Instance Tags: StringParameter

    Only run on machines with matching tags (comma separated)

  • Instance Type: StringParameter

    The type of instance that this cube needs to be run on

  • Max Backlog Wait: IntegerParameter

    The max time (in seconds) that a cube will be backlogged on a group before being re-evaluated

  • Memory (MiB): DecimalParameter

    The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • Thread limit per CPU: IntegerParameter

    The number of threads per CPU

  • Shared Memory (MiB): DecimalParameter

    The amount of shared memory to allow a container to address

  • Spot policy: StringParameter

    Control cube placement on spot market instances

Metrics

  • Cube Metrics: StringParameter

    Set of metrics to be collected

  • Metric Period: DecimalParameter

    How often to sample metrics, in seconds

Record to Record File

Import Statement

from orionplatform.cubes import RecordsToRecordFileConverter

Description

A writer that writes a stream of records to a record file (that is, OEDB).

Input Ports

  • intake: RecordBytesInputPort

Ungrouped Parameters

  • file_name: FileOutputParameter

    Name of the file to create from records.

Parameter Groups

Floe Internals

  • buffer_size: IntegerParameter

    The amount of data buffered before sending downstream

Hardware

  • CPUs: IntegerParameter

    The number of CPUs to run this cube with

  • Temporary Disk Space (MiB): DecimalParameter

    The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • GPUs: DecimalParameter

    The number of GPUs to run this cube with

  • Instance Tags: StringParameter

    Only run on machines with matching tags (comma separated)

  • Instance Type: StringParameter

    The type of instance that this cube needs to be run on

  • Max Backlog Wait: IntegerParameter

    The max time (in seconds) that a cube will be backlogged on a group before being re-evaluated

  • Memory (MiB): DecimalParameter

    The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • Thread limit per CPU: IntegerParameter

    The number of threads per CPU

  • Shared Memory (MiB): DecimalParameter

    The amount of shared memory to allow a container to address

  • Spot policy: StringParameter

    Control cube placement on spot market instances

Metrics

  • Cube Metrics: StringParameter

    Set of metrics to be collected

  • Metric Period: DecimalParameter

    How often to sample metrics, in seconds

Url to File

Import Statement

from orionplatform.cubes import URLToFileCube

Description

Reads a file from a URL and uploads it to Orion.

Ungrouped Parameters

  • filename: FileOutputParameter

    New file name (defaults to URL path basename).

  • Logging interval: IntegerParameter

    Log progress every N seconds (0 to disable).

  • None: StringParameter

    URL of file to be uploaded to Orion.

Parameter Groups

Floe Internals

  • buffer_size: IntegerParameter

    The amount of data buffered before sending downstream

Hardware

  • CPUs: IntegerParameter

    The number of CPUs to run this cube with

  • Temporary Disk Space (MiB): DecimalParameter

    The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • GPUs: DecimalParameter

    The number of GPUs to run this cube with

  • Instance Tags: StringParameter

    Only run on machines with matching tags (comma separated)

  • Instance Type: StringParameter

    The type of instance that this cube needs to be run on

  • Max Backlog Wait: IntegerParameter

    The max time (in seconds) that a cube will be backlogged on a group before being re-evaluated

  • Memory (MiB): DecimalParameter

    The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • Thread limit per CPU: IntegerParameter

    The number of threads per CPU

  • Shared Memory (MiB): DecimalParameter

    The amount of shared memory to allow a container to address

  • Spot policy: StringParameter

    Control cube placement on spot market instances

Metrics

  • Cube Metrics: StringParameter

    Set of metrics to be collected

  • Metric Period: DecimalParameter

    How often to sample metrics, in seconds