Orion Platform Collection Cubes¶
Cubes¶
Collection Reader¶
Import
from orionplatform.cubes import CollectionReaderCube
Description
Emits collections
Output Ports
success: CollectionOutputPort
Ungrouped Parameters¶
collection: CollectionInputParameter
Collections to use as input to a floe
Parameter Groups¶
Floe Internals
buffer_size: IntegerParameter
The amount of data buffered before sending downstream
Hardware
cpu_count: IntegerParameter
The number of CPUs to run this cube with
disk_space: DecimalParameter
The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
gpu_count: IntegerParameter
The number of GPUs to run this cube with
instance_tags: StringParameter
Only run on machines with matching tags (comma separated)
instance_type: StringParameter
The type of instance that this cube needs to be run on
memory_mb: DecimalParameter
The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
spot_policy: StringParameter
Control cube placement on spot market instances
Metrics
cube_metrics: StringParameter
Set of metrics to be collected
metric_period: DecimalParameter
How often to sample metrics, in seconds
Shard Reader¶
Import
from orionplatform.cubes import ShardReaderCube
Description
Reads collections and emits the shards from each collection
Output Ports
success: ShardOutputPort
Ungrouped Parameters¶
collection: CollectionInputParameter
Collections to emit shards from
limit: IntegerParameter
Maximum number of shards to read with this cube
Parameter Groups¶
Floe Internals
buffer_size: IntegerParameter
The amount of data buffered before sending downstream
Hardware
cpu_count: IntegerParameter
The number of CPUs to run this cube with
disk_space: DecimalParameter
The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
gpu_count: IntegerParameter
The number of GPUs to run this cube with
instance_tags: StringParameter
Only run on machines with matching tags (comma separated)
instance_type: StringParameter
The type of instance that this cube needs to be run on
memory_mb: DecimalParameter
The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
spot_policy: StringParameter
Control cube placement on spot market instances
Metrics
cube_metrics: StringParameter
Set of metrics to be collected
metric_period: DecimalParameter
How often to sample metrics, in seconds
Collection to Records¶
Import
from orionplatform.cubes import CollectionToRecordsCube
Description
Decodes a collection into OERecords.
Input Ports
intake: ShardInputPort
intake_list: JsonInputPort
Output Ports
failure: ShardOutputPort
success: RecordOutputPort
Ungrouped Parameters¶
collection_type: StringParameter
The format of the data that shards contain
Parameter Groups¶
Floe Internals
buffer_size: IntegerParameter
The amount of data buffered before sending downstream
Hardware
cpu_count: IntegerParameter
The number of CPUs to run this cube with
disk_space: DecimalParameter
The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
gpu_count: IntegerParameter
The number of GPUs to run this cube with
instance_tags: StringParameter
Only run on machines with matching tags (comma separated)
instance_type: StringParameter
The type of instance that this cube needs to be run on
memory_mb: DecimalParameter
The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
spot_policy: StringParameter
Control cube placement on spot market instances
Metrics
cube_metrics: StringParameter
Set of metrics to be collected
metric_period: DecimalParameter
How often to sample metrics, in seconds
Parallel Version¶
ParallelCollectionToRecordsCube
Create Collection¶
Import
from orionplatform.cubes import CreateCollectionCube
Description
Creates a collection to which shards can be added.
Output Ports
success: CollectionOutputPort
Ungrouped Parameters¶
collection_metadata_copy: CollectionInputParameter
Collection to copy the meta data from
collection_name: CollectionOutputParameter
Name of the collection to create
output_tags: StringParameter
Tags to apply to the output dataset, comma delimited
Parameter Groups¶
Floe Internals
buffer_size: IntegerParameter
The amount of data buffered before sending downstream
Hardware
cpu_count: IntegerParameter
The number of CPUs to run this cube with
disk_space: DecimalParameter
The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
gpu_count: IntegerParameter
The number of GPUs to run this cube with
instance_tags: StringParameter
Only run on machines with matching tags (comma separated)
instance_type: StringParameter
The type of instance that this cube needs to be run on
memory_mb: DecimalParameter
The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
spot_policy: StringParameter
Control cube placement on spot market instances
Metrics
cube_metrics: StringParameter
Set of metrics to be collected
metric_period: DecimalParameter
How often to sample metrics, in seconds
Records To Collection¶
Import
from orionplatform.cubes import RecordsToCollectionCube
Description
- Converts records sent to the intake port into shards in a collection.
This cube must be initialized with a ShardCollection sent to the initializer port.
Converting records to molecules for collections will only preserve the molecule and none of the other data on the record.
Input Ports
init: CollectionInputPortV2
intake: RecordInputPort
Output Ports
failure: RecordOutputPort
success: ShardOutputPort
Ungrouped Parameters¶
collection_type: StringParameter
The format of the data that shards will contain
max_shard_bytes: IntegerParameter
How large of shards to emit in bytes
primary_mol: PrimaryMolFieldParameter
Primary Molecule field to retrieve molecule from
records_per_shard: IntegerParameter
The target number of records in a shard. 0 indicates to run up to the max_shard_bytes limit per shard
upload_attempts: IntegerParameter
Number of attempts to make when uploading a shard
write_attempts: IntegerParameter
Number of attempts to write each record to a shard’s temporary file
Parameter Groups¶
Floe Internals
buffer_size: IntegerParameter
The amount of data buffered before sending downstream
Hardware
cpu_count: IntegerParameter
The number of CPUs to run this cube with
disk_space: DecimalParameter
The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
gpu_count: IntegerParameter
The number of GPUs to run this cube with
instance_tags: StringParameter
Only run on machines with matching tags (comma separated)
instance_type: StringParameter
The type of instance that this cube needs to be run on
memory_mb: DecimalParameter
The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
spot_policy: StringParameter
Control cube placement on spot market instances
Metrics
cube_metrics: StringParameter
Set of metrics to be collected
metric_period: DecimalParameter
How often to sample metrics, in seconds
Parallel Version¶
ParallelRecordsToCollectionCube
Accumulate A List of Shards¶
Import
from orionplatform.cubes import AccumulateShardsCube
Description
Accumulate a list of shards until a specified size.
A downstream cube can then combine the shards into a larger shard.
Input Ports
intake: ShardInputPort
Output Ports
success: JsonOutputPort
Ungrouped Parameters¶
size: IntegerParameter
Number of shards to accumulate
Parameter Groups¶
Floe Internals
buffer_size: IntegerParameter
The amount of data buffered before sending downstream
Hardware
cpu_count: IntegerParameter
The number of CPUs to run this cube with
disk_space: DecimalParameter
The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
gpu_count: IntegerParameter
The number of GPUs to run this cube with
instance_tags: StringParameter
Only run on machines with matching tags (comma separated)
instance_type: StringParameter
The type of instance that this cube needs to be run on
memory_mb: DecimalParameter
The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
spot_policy: StringParameter
Control cube placement on spot market instances
Metrics
cube_metrics: StringParameter
Set of metrics to be collected
metric_period: DecimalParameter
How often to sample metrics, in seconds
Close Collection¶
Import
from orionplatform.cubes import CloseCollectionCube
Description
Finalizes shards passed in, and then the collection they belong to.
Warning: do not include this cube in a parallel cube group. Shards must be closed after being emitted from a parallel cube group.
Input Ports
intake: ShardInputPort
Ungrouped Parameters¶
close_collections: BooleanParameter
Whether to close collections to additional shards.
close_shards: BooleanParameter
Whether to mark shards as ready.
delete_collections: BooleanParameter
Whether to delete collections.
Parameter Groups¶
Floe Internals
buffer_size: IntegerParameter
The amount of data buffered before sending downstream
Hardware
cpu_count: IntegerParameter
The number of CPUs to run this cube with
disk_space: DecimalParameter
The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
gpu_count: IntegerParameter
The number of GPUs to run this cube with
instance_tags: StringParameter
Only run on machines with matching tags (comma separated)
instance_type: StringParameter
The type of instance that this cube needs to be run on
memory_mb: DecimalParameter
The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
spot_policy: StringParameter
Control cube placement on spot market instances
Metrics
cube_metrics: StringParameter
Set of metrics to be collected
metric_period: DecimalParameter
How often to sample metrics, in seconds
Close Shards¶
Import
from orionplatform.cubes import CloseShardsCube
Description
Finalizes shards passed in.
If shards are further sent to CloseCollectionCube, then CloseCollectionCube’s close_shards parameter should be set to False.
Warning: do not include this cube in a cube group. Shards must be closed after being emitted from a group.
Input Ports
intake: ShardInputPort
Output Ports
failure: ShardOutputPort
success: ShardOutputPort
Ungrouped Parameters¶
Parameter Groups¶
Floe Internals
buffer_size: IntegerParameter
The amount of data buffered before sending downstream
Hardware
cpu_count: IntegerParameter
The number of CPUs to run this cube with
disk_space: DecimalParameter
The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
gpu_count: IntegerParameter
The number of GPUs to run this cube with
instance_tags: StringParameter
Only run on machines with matching tags (comma separated)
instance_type: StringParameter
The type of instance that this cube needs to be run on
memory_mb: DecimalParameter
The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
spot_policy: StringParameter
Control cube placement on spot market instances
Metrics
cube_metrics: StringParameter
Set of metrics to be collected
metric_period: DecimalParameter
How often to sample metrics, in seconds
Parallel Version¶
ParallelCloseShardsCube
Collection Resize¶
Import
from orionplatform.cubes import CollectionResizeCube
Description
- Takes lists of shards from AccumulateShardsCube and concatenates those shards together.
This cube must be initialized with a ShardCollection sent to the initializer port.
Input Ports
init: CollectionInputPortV2
intake: JsonInputPort
Output Ports
failure: JsonOutputPort
success: ShardOutputPort
Ungrouped Parameters¶
collection_type: StringParameter
The format of the data that shards contain. Used in validation.
validation: StringParameter
What kind of validation should be performed?
Parameter Groups¶
Floe Internals
buffer_size: IntegerParameter
The amount of data buffered before sending downstream
Hardware
cpu_count: IntegerParameter
The number of CPUs to run this cube with
disk_space: DecimalParameter
The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
gpu_count: IntegerParameter
The number of GPUs to run this cube with
instance_tags: StringParameter
Only run on machines with matching tags (comma separated)
instance_type: StringParameter
The type of instance that this cube needs to be run on
memory_mb: DecimalParameter
The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
spot_policy: StringParameter
Control cube placement on spot market instances
Metrics
cube_metrics: StringParameter
Set of metrics to be collected
metric_period: DecimalParameter
How often to sample metrics, in seconds
Parallel Version¶
ParallelCollectionResizeCube