Orion Platform Collection Cubes

Collection Reader

Import

from orionplatform.cubes import CollectionReaderCube

Description

Emits collections

Output Ports

  • success: CollectionOutputPort

Ungrouped Parameters

  • collection: CollectionInputParameter

    Collections to use as input to a floe

Parameter Groups

Floe Internals

  • buffer_size: IntegerParameter

    The amount of data buffered before sending downstream

Hardware

  • cpu_count: IntegerParameter

    The number of CPUs to run this cube with

  • disk_space: DecimalParameter

    The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • gpu_count: IntegerParameter

    The number of GPUs to run this cube with

  • instance_tags: StringParameter

    Only run on machines with matching tags (comma separated)

  • instance_type: StringParameter

    The type of instance that this cube needs to be run on

  • memory_mb: DecimalParameter

    The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • spot_policy: StringParameter

    Control cube placement on spot market instances

Metrics

  • cube_metrics: StringParameter

    Set of metrics to be collected

  • metric_period: DecimalParameter

    How often to sample metrics, in seconds

Shard Reader

Import

from orionplatform.cubes import ShardReaderCube

Description

Reads collections and emits the shards from each collection

Output Ports

  • success: ShardOutputPort

Ungrouped Parameters

  • collection: CollectionInputParameter

    Collections to emit shards from

  • limit: IntegerParameter

    Maximum number of shards to read with this cube

Parameter Groups

Floe Internals

  • buffer_size: IntegerParameter

    The amount of data buffered before sending downstream

Hardware

  • cpu_count: IntegerParameter

    The number of CPUs to run this cube with

  • disk_space: DecimalParameter

    The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • gpu_count: IntegerParameter

    The number of GPUs to run this cube with

  • instance_tags: StringParameter

    Only run on machines with matching tags (comma separated)

  • instance_type: StringParameter

    The type of instance that this cube needs to be run on

  • memory_mb: DecimalParameter

    The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • spot_policy: StringParameter

    Control cube placement on spot market instances

Metrics

  • cube_metrics: StringParameter

    Set of metrics to be collected

  • metric_period: DecimalParameter

    How often to sample metrics, in seconds

Collection to Records

Import

from orionplatform.cubes import CollectionToRecordsCube

Description

Decodes a collection into OERecords.

Input Ports

  • intake: ShardInputPort

  • intake_list: JsonInputPort

Output Ports

  • failure: ShardOutputPort

  • success: RecordOutputPort

Ungrouped Parameters

  • collection_type: StringParameter

    The format of the data that shards contain

Parameter Groups

Floe Internals

  • buffer_size: IntegerParameter

    The amount of data buffered before sending downstream

Hardware

  • cpu_count: IntegerParameter

    The number of CPUs to run this cube with

  • disk_space: DecimalParameter

    The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • gpu_count: IntegerParameter

    The number of GPUs to run this cube with

  • instance_tags: StringParameter

    Only run on machines with matching tags (comma separated)

  • instance_type: StringParameter

    The type of instance that this cube needs to be run on

  • memory_mb: DecimalParameter

    The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • spot_policy: StringParameter

    Control cube placement on spot market instances

Metrics

  • cube_metrics: StringParameter

    Set of metrics to be collected

  • metric_period: DecimalParameter

    How often to sample metrics, in seconds

Parallel Version

ParallelCollectionToRecordsCube

Create Collection

Import

from orionplatform.cubes import CreateCollectionCube

Description

Creates a collection to which shards can be added.

Output Ports

  • success: CollectionOutputPort

Ungrouped Parameters

  • collection_metadata_copy: CollectionInputParameter

    Collection to copy the meta data from

  • collection_name: CollectionOutputParameter

    Name of the collection to create

  • output_tags: StringParameter

    Tags to apply to the output dataset, comma delimited

Parameter Groups

Floe Internals

  • buffer_size: IntegerParameter

    The amount of data buffered before sending downstream

Hardware

  • cpu_count: IntegerParameter

    The number of CPUs to run this cube with

  • disk_space: DecimalParameter

    The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • gpu_count: IntegerParameter

    The number of GPUs to run this cube with

  • instance_tags: StringParameter

    Only run on machines with matching tags (comma separated)

  • instance_type: StringParameter

    The type of instance that this cube needs to be run on

  • memory_mb: DecimalParameter

    The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • spot_policy: StringParameter

    Control cube placement on spot market instances

Metrics

  • cube_metrics: StringParameter

    Set of metrics to be collected

  • metric_period: DecimalParameter

    How often to sample metrics, in seconds

Records To Collection

Import

from orionplatform.cubes import RecordsToCollectionCube

Description

Converts records sent to the intake port into shards in a collection.

This cube must be initialized with a ShardCollection sent to the initializer port.

Converting records to molecules for collections will only preserve the molecule and none of the other data on the record.

Input Ports

  • init: CollectionInputPortV2

  • intake: RecordInputPort

Output Ports

  • failure: RecordOutputPort

  • success: ShardOutputPort

Ungrouped Parameters

  • collection_type: StringParameter

    The format of the data that shards will contain

  • max_shard_bytes: IntegerParameter

    How large of shards to emit in bytes

  • primary_mol: PrimaryMolFieldParameter

    Primary Molecule field to retrieve molecule from

  • records_per_shard: IntegerParameter

    The target number of records in a shard. 0 indicates to run up to the max_shard_bytes limit per shard

  • upload_attempts: IntegerParameter

    Number of attempts to make when uploading a shard

  • write_attempts: IntegerParameter

    Number of attempts to write each record to a shard’s temporary file

Parameter Groups

Floe Internals

  • buffer_size: IntegerParameter

    The amount of data buffered before sending downstream

Hardware

  • cpu_count: IntegerParameter

    The number of CPUs to run this cube with

  • disk_space: DecimalParameter

    The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • gpu_count: IntegerParameter

    The number of GPUs to run this cube with

  • instance_tags: StringParameter

    Only run on machines with matching tags (comma separated)

  • instance_type: StringParameter

    The type of instance that this cube needs to be run on

  • memory_mb: DecimalParameter

    The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • spot_policy: StringParameter

    Control cube placement on spot market instances

Metrics

  • cube_metrics: StringParameter

    Set of metrics to be collected

  • metric_period: DecimalParameter

    How often to sample metrics, in seconds

Parallel Version

ParallelRecordsToCollectionCube

Accumulate A List of Shards

Import

from orionplatform.cubes import AccumulateShardsCube

Description

Accumulate a list of shards until a specified size.

A downstream cube can then combine the shards into a larger shard.

Input Ports

  • intake: ShardInputPort

Output Ports

  • success: JsonOutputPort

Ungrouped Parameters

  • size: IntegerParameter

    Number of shards to accumulate

Parameter Groups

Floe Internals

  • buffer_size: IntegerParameter

    The amount of data buffered before sending downstream

Hardware

  • cpu_count: IntegerParameter

    The number of CPUs to run this cube with

  • disk_space: DecimalParameter

    The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • gpu_count: IntegerParameter

    The number of GPUs to run this cube with

  • instance_tags: StringParameter

    Only run on machines with matching tags (comma separated)

  • instance_type: StringParameter

    The type of instance that this cube needs to be run on

  • memory_mb: DecimalParameter

    The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • spot_policy: StringParameter

    Control cube placement on spot market instances

Metrics

  • cube_metrics: StringParameter

    Set of metrics to be collected

  • metric_period: DecimalParameter

    How often to sample metrics, in seconds

Close Collection

Import

from orionplatform.cubes import CloseCollectionCube

Description

Finalizes shards passed in, and then the collection they belong to.

Warning: do not include this cube in a parallel cube group. Shards must be closed after being emitted from a parallel cube group.

Input Ports

  • intake: ShardInputPort

Ungrouped Parameters

  • close_collections: BooleanParameter

    Whether to close collections to additional shards.

  • close_shards: BooleanParameter

    Whether to mark shards as ready.

  • delete_collections: BooleanParameter

    Whether to delete collections.

Parameter Groups

Floe Internals

  • buffer_size: IntegerParameter

    The amount of data buffered before sending downstream

Hardware

  • cpu_count: IntegerParameter

    The number of CPUs to run this cube with

  • disk_space: DecimalParameter

    The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • gpu_count: IntegerParameter

    The number of GPUs to run this cube with

  • instance_tags: StringParameter

    Only run on machines with matching tags (comma separated)

  • instance_type: StringParameter

    The type of instance that this cube needs to be run on

  • memory_mb: DecimalParameter

    The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • spot_policy: StringParameter

    Control cube placement on spot market instances

Metrics

  • cube_metrics: StringParameter

    Set of metrics to be collected

  • metric_period: DecimalParameter

    How often to sample metrics, in seconds

Close Shards

Import

from orionplatform.cubes import CloseShardsCube

Description

Finalizes shards passed in.

If shards are further sent to CloseCollectionCube, then CloseCollectionCube’s close_shards parameter should be set to False.

Warning: do not include this cube in a cube group. Shards must be closed after being emitted from a group.

Input Ports

  • intake: ShardInputPort

Output Ports

  • failure: ShardOutputPort

  • success: ShardOutputPort

Ungrouped Parameters

Parameter Groups

Floe Internals

  • buffer_size: IntegerParameter

    The amount of data buffered before sending downstream

Hardware

  • cpu_count: IntegerParameter

    The number of CPUs to run this cube with

  • disk_space: DecimalParameter

    The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • gpu_count: IntegerParameter

    The number of GPUs to run this cube with

  • instance_tags: StringParameter

    Only run on machines with matching tags (comma separated)

  • instance_type: StringParameter

    The type of instance that this cube needs to be run on

  • memory_mb: DecimalParameter

    The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • spot_policy: StringParameter

    Control cube placement on spot market instances

Metrics

  • cube_metrics: StringParameter

    Set of metrics to be collected

  • metric_period: DecimalParameter

    How often to sample metrics, in seconds

Parallel Version

ParallelCloseShardsCube

Collection Resize

Import

from orionplatform.cubes import CollectionResizeCube

Description

Takes lists of shards from AccumulateShardsCube and concatenates those shards together.

This cube must be initialized with a ShardCollection sent to the initializer port.

Input Ports

  • init: CollectionInputPortV2

  • intake: JsonInputPort

Output Ports

  • failure: JsonOutputPort

  • success: ShardOutputPort

Ungrouped Parameters

  • collection_type: StringParameter

    The format of the data that shards contain. Used in validation.

  • validation: StringParameter

    What kind of validation should be performed?

Parameter Groups

Floe Internals

  • buffer_size: IntegerParameter

    The amount of data buffered before sending downstream

Hardware

  • cpu_count: IntegerParameter

    The number of CPUs to run this cube with

  • disk_space: DecimalParameter

    The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • gpu_count: IntegerParameter

    The number of GPUs to run this cube with

  • instance_tags: StringParameter

    Only run on machines with matching tags (comma separated)

  • instance_type: StringParameter

    The type of instance that this cube needs to be run on

  • memory_mb: DecimalParameter

    The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

  • spot_policy: StringParameter

    Control cube placement on spot market instances

Metrics

  • cube_metrics: StringParameter

    Set of metrics to be collected

  • metric_period: DecimalParameter

    How often to sample metrics, in seconds

Parallel Version

ParallelCollectionResizeCube