Orion Platform Collection Cubes

Cubes

Collection Reader

Import Statement

from orionplatform.cubes import CollectionReaderCube

Description

Emits collections

Output Ports

  • success: CollectionOutputPort

Ungrouped Parameters

  • Collection: CollectionInputParameter

    Collections to use as input to a floe

Parameter Groups

Floe Internals

  • buffer_size: IntegerParameter

    The amount of data buffered before sending downstream

Hardware

  • CPUs: IntegerParameter

    The number of CPUs to run this cube with

  • Temporary Disk Space (MiB): DecimalParameter

    The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • GPUs: IntegerParameter

    The number of GPUs to run this cube with

  • Instance Tags: StringParameter

    Only run on machines with matching tags (comma separated)

  • Instance Type: StringParameter

    The type of instance that this cube needs to be run on

  • Max Backlog Wait: IntegerParameter

    The max time (in seconds) that a cube will be backlogged on a group before being re-evaluated

  • Memory (MiB): DecimalParameter

    The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • Thread limit per CPU: IntegerParameter

    The number of threads per CPU

  • Shared Memory (MiB): DecimalParameter

    The amount of shared memory to allow a container to address

  • Spot policy: StringParameter

    Control cube placement on spot market instances

Metrics

  • Cube Metrics: StringParameter

    Set of metrics to be collected

  • Metric Period: DecimalParameter

    How often to sample metrics, in seconds

Shard Reader

Import Statement

from orionplatform.cubes import ShardReaderCube

Description

Reads collections and emits the shards from each collection

Output Ports

  • success: ShardOutputPort

Ungrouped Parameters

  • Input Collections: CollectionInputParameter

    Collections to emit shards from

  • limit: IntegerParameter

    Maximum number of shards to read with this cube

Parameter Groups

Floe Internals

  • buffer_size: IntegerParameter

    The amount of data buffered before sending downstream

Hardware

  • CPUs: IntegerParameter

    The number of CPUs to run this cube with

  • Temporary Disk Space (MiB): DecimalParameter

    The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • GPUs: IntegerParameter

    The number of GPUs to run this cube with

  • Instance Tags: StringParameter

    Only run on machines with matching tags (comma separated)

  • Instance Type: StringParameter

    The type of instance that this cube needs to be run on

  • Max Backlog Wait: IntegerParameter

    The max time (in seconds) that a cube will be backlogged on a group before being re-evaluated

  • Memory (MiB): DecimalParameter

    The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • Thread limit per CPU: IntegerParameter

    The number of threads per CPU

  • Shared Memory (MiB): DecimalParameter

    The amount of shared memory to allow a container to address

  • Spot policy: StringParameter

    Control cube placement on spot market instances

Metrics

  • Cube Metrics: StringParameter

    Set of metrics to be collected

  • Metric Period: DecimalParameter

    How often to sample metrics, in seconds

Collection to Records

Import Statement

from orionplatform.cubes import CollectionToRecordsCube

Description

Decodes a collection into OERecords.

Input Ports

  • intake: ShardInputPort

  • intake_list: JsonInputPort

Output Ports

  • failure: ShardOutputPort

  • success: RecordOutputPort

Ungrouped Parameters

  • Shard Format: StringParameter

    The format of the data that shards contain

Parameter Groups

Floe Internals

  • buffer_size: IntegerParameter

    The amount of data buffered before sending downstream

Hardware

  • CPUs: IntegerParameter

    The number of CPUs to run this cube with

  • Temporary Disk Space (MiB): DecimalParameter

    The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • GPUs: IntegerParameter

    The number of GPUs to run this cube with

  • Instance Tags: StringParameter

    Only run on machines with matching tags (comma separated)

  • Instance Type: StringParameter

    The type of instance that this cube needs to be run on

  • Max Backlog Wait: IntegerParameter

    The max time (in seconds) that a cube will be backlogged on a group before being re-evaluated

  • Memory (MiB): DecimalParameter

    The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • Thread limit per CPU: IntegerParameter

    The number of threads per CPU

  • Shared Memory (MiB): DecimalParameter

    The amount of shared memory to allow a container to address

  • Spot policy: StringParameter

    Control cube placement on spot market instances

Metrics

  • Cube Metrics: StringParameter

    Set of metrics to be collected

  • Metric Period: DecimalParameter

    How often to sample metrics, in seconds

Parallel Version

ParallelCollectionToRecordsCube

Create Collection

Import Statement

from orionplatform.cubes import CreateCollectionCube

Description

Creates a collection to which shards can be added.

Output Ports

  • success: CollectionOutputPort

Ungrouped Parameters

  • Collection Metadata Copy: CollectionInputParameter

    Collection to copy the meta data from

  • Collection Name: CollectionOutputParameter

    Name of the collection to create

  • Collection Tags: StringParameter

    Tags to apply to the output dataset, comma delimited

  • v2: BooleanParameter

Parameter Groups

Floe Internals

  • buffer_size: IntegerParameter

    The amount of data buffered before sending downstream

Hardware

  • CPUs: IntegerParameter

    The number of CPUs to run this cube with

  • Temporary Disk Space (MiB): DecimalParameter

    The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • GPUs: IntegerParameter

    The number of GPUs to run this cube with

  • Instance Tags: StringParameter

    Only run on machines with matching tags (comma separated)

  • Instance Type: StringParameter

    The type of instance that this cube needs to be run on

  • Max Backlog Wait: IntegerParameter

    The max time (in seconds) that a cube will be backlogged on a group before being re-evaluated

  • Memory (MiB): DecimalParameter

    The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • Thread limit per CPU: IntegerParameter

    The number of threads per CPU

  • Shared Memory (MiB): DecimalParameter

    The amount of shared memory to allow a container to address

  • Spot policy: StringParameter

    Control cube placement on spot market instances

Metrics

  • Cube Metrics: StringParameter

    Set of metrics to be collected

  • Metric Period: DecimalParameter

    How often to sample metrics, in seconds

Records To Collection

Import Statement

from orionplatform.cubes import RecordsToCollectionCube

Description

Converts records sent to the intake port into shards in a collection.

This cube must be initialized with a ShardCollection sent to the initializer port.

Converting records to molecules for collections will only preserve the molecule and none of the other data on the record.

Input Ports

  • init: CollectionInputPortV2

  • intake: RecordInputPort

Output Ports

  • failure: RecordOutputPort

  • success: ShardOutputPort

Ungrouped Parameters

  • Output Shard Format: StringParameter

    The format of the data that shards will contain

  • Maximum size of shards: IntegerParameter

    How large of shards to emit in bytes

  • primary_mol: PrimaryMolFieldParameter

    Primary Molecule field to retrieve molecule from

  • records_per_shard: IntegerParameter

    The target number of records in a shard. 0 indicates to run up to the max_shard_bytes limit per shard

  • Shard Upload Attempts: IntegerParameter

    Number of attempts to make when uploading a shard

  • Write Attempts: IntegerParameter

    Number of attempts to write each record to a shard’s temporary file

Parameter Groups

Floe Internals

  • buffer_size: IntegerParameter

    The amount of data buffered before sending downstream

Hardware

  • CPUs: IntegerParameter

    The number of CPUs to run this cube with

  • Temporary Disk Space (MiB): DecimalParameter

    The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • GPUs: IntegerParameter

    The number of GPUs to run this cube with

  • Instance Tags: StringParameter

    Only run on machines with matching tags (comma separated)

  • Instance Type: StringParameter

    The type of instance that this cube needs to be run on

  • Max Backlog Wait: IntegerParameter

    The max time (in seconds) that a cube will be backlogged on a group before being re-evaluated

  • Memory (MiB): DecimalParameter

    The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • Thread limit per CPU: IntegerParameter

    The number of threads per CPU

  • Shared Memory (MiB): DecimalParameter

    The amount of shared memory to allow a container to address

  • Spot policy: StringParameter

    Control cube placement on spot market instances

Metrics

  • Cube Metrics: StringParameter

    Set of metrics to be collected

  • Metric Period: DecimalParameter

    How often to sample metrics, in seconds

Parallel Version

ParallelRecordsToCollectionCube

Accumulate A List of Shards

Import Statement

from orionplatform.cubes import AccumulateShardsCube

Description

Accumulate a list of shards until a specified size.

A downstream cube can then combine the shards into a larger shard.

Input Ports

  • intake: ShardInputPort

Output Ports

  • success: JsonOutputPort

Ungrouped Parameters

  • size: IntegerParameter

    Number of shards to accumulate

Parameter Groups

Floe Internals

  • buffer_size: IntegerParameter

    The amount of data buffered before sending downstream

Hardware

  • CPUs: IntegerParameter

    The number of CPUs to run this cube with

  • Temporary Disk Space (MiB): DecimalParameter

    The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • GPUs: IntegerParameter

    The number of GPUs to run this cube with

  • Instance Tags: StringParameter

    Only run on machines with matching tags (comma separated)

  • Instance Type: StringParameter

    The type of instance that this cube needs to be run on

  • Max Backlog Wait: IntegerParameter

    The max time (in seconds) that a cube will be backlogged on a group before being re-evaluated

  • Memory (MiB): DecimalParameter

    The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • Thread limit per CPU: IntegerParameter

    The number of threads per CPU

  • Shared Memory (MiB): DecimalParameter

    The amount of shared memory to allow a container to address

  • Spot policy: StringParameter

    Control cube placement on spot market instances

Metrics

  • Cube Metrics: StringParameter

    Set of metrics to be collected

  • Metric Period: DecimalParameter

    How often to sample metrics, in seconds

Close Collection

Import Statement

from orionplatform.cubes import CloseCollectionCube

Description

Finalizes shards passed in, and then the collection they belong to.

Warning: do not include this cube in a parallel cube group. Shards must be closed after being emitted from a parallel cube group.

Input Ports

  • collection_intake: CollectionInputPortV2

  • intake: ShardInputPort

Ungrouped Parameters

  • close_collections: BooleanParameter

    Whether to close collections to additional shards.

  • close_shards: BooleanParameter

    Whether to mark shards as ready.

  • delete_collections: BooleanParameter

    Whether to delete collections.

Parameter Groups

Floe Internals

  • buffer_size: IntegerParameter

    The amount of data buffered before sending downstream

Hardware

  • CPUs: IntegerParameter

    The number of CPUs to run this cube with

  • Temporary Disk Space (MiB): DecimalParameter

    The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • GPUs: IntegerParameter

    The number of GPUs to run this cube with

  • Instance Tags: StringParameter

    Only run on machines with matching tags (comma separated)

  • Instance Type: StringParameter

    The type of instance that this cube needs to be run on

  • Max Backlog Wait: IntegerParameter

    The max time (in seconds) that a cube will be backlogged on a group before being re-evaluated

  • Memory (MiB): DecimalParameter

    The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • Thread limit per CPU: IntegerParameter

    The number of threads per CPU

  • Shared Memory (MiB): DecimalParameter

    The amount of shared memory to allow a container to address

  • Spot policy: StringParameter

    Control cube placement on spot market instances

Metrics

  • Cube Metrics: StringParameter

    Set of metrics to be collected

  • Metric Period: DecimalParameter

    How often to sample metrics, in seconds

Close Shards

Import Statement

from orionplatform.cubes import CloseShardsCube

Description

Finalizes shards passed in.

If shards are further sent to CloseCollectionCube, then CloseCollectionCube’s close_shards parameter should be set to False.

Warning: do not include this cube in a cube group. Shards must be closed after being emitted from a group.

Input Ports

  • intake: ShardInputPort

Output Ports

  • failure: ShardOutputPort

  • success: ShardOutputPort

Ungrouped Parameters

Parameter Groups

Floe Internals

  • buffer_size: IntegerParameter

    The amount of data buffered before sending downstream

Hardware

  • CPUs: IntegerParameter

    The number of CPUs to run this cube with

  • Temporary Disk Space (MiB): DecimalParameter

    The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • GPUs: IntegerParameter

    The number of GPUs to run this cube with

  • Instance Tags: StringParameter

    Only run on machines with matching tags (comma separated)

  • Instance Type: StringParameter

    The type of instance that this cube needs to be run on

  • Max Backlog Wait: IntegerParameter

    The max time (in seconds) that a cube will be backlogged on a group before being re-evaluated

  • Memory (MiB): DecimalParameter

    The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • Thread limit per CPU: IntegerParameter

    The number of threads per CPU

  • Shared Memory (MiB): DecimalParameter

    The amount of shared memory to allow a container to address

  • Spot policy: StringParameter

    Control cube placement on spot market instances

Metrics

  • Cube Metrics: StringParameter

    Set of metrics to be collected

  • Metric Period: DecimalParameter

    How often to sample metrics, in seconds

Parallel Version

ParallelCloseShardsCube

Collection Resize

Import Statement

from orionplatform.cubes import CollectionResizeCube

Description

Takes lists of shards from AccumulateShardsCube and concatenates those shards together.

This cube must be initialized with a ShardCollection sent to the initializer port.

Input Ports

  • init: CollectionInputPortV2

  • intake: JsonInputPort

Output Ports

  • failure: JsonOutputPort

  • success: ShardOutputPort

Ungrouped Parameters

  • Shard Format: StringParameter

    The format of the data that shards contain. Used in validation.

  • validation: StringParameter

    What kind of validation should be performed?

Parameter Groups

Floe Internals

  • buffer_size: IntegerParameter

    The amount of data buffered before sending downstream

Hardware

  • CPUs: IntegerParameter

    The number of CPUs to run this cube with

  • Temporary Disk Space (MiB): DecimalParameter

    The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • GPUs: IntegerParameter

    The number of GPUs to run this cube with

  • Instance Tags: StringParameter

    Only run on machines with matching tags (comma separated)

  • Instance Type: StringParameter

    The type of instance that this cube needs to be run on

  • Max Backlog Wait: IntegerParameter

    The max time (in seconds) that a cube will be backlogged on a group before being re-evaluated

  • Memory (MiB): DecimalParameter

    The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • Thread limit per CPU: IntegerParameter

    The number of threads per CPU

  • Shared Memory (MiB): DecimalParameter

    The amount of shared memory to allow a container to address

  • Spot policy: StringParameter

    Control cube placement on spot market instances

Metrics

  • Cube Metrics: StringParameter

    Set of metrics to be collected

  • Metric Period: DecimalParameter

    How often to sample metrics, in seconds

Parallel Version

ParallelCollectionResizeCube