Sample Collection By Shards

Category Paths

Follow one of these paths in the Orion user interface, to find the floe.

Description

This floe can be used to downsample a large collection by selecting a subsample its shards. This is only appropriate when all shards in a collection contain exchangeable shards of the same data format (e.g. FastROCS Collections). Different sampling techniques can be chosen to allow specifying a sampling percentage or a specific sample size. Optionally, the unselected shards can also be saved into a separate collection.

Promoted Parameters

Title in user interface (promoted name)

Sampling Parameters

Sampling Method (sampling_method): Sampling technique to employ. Reservoir sampling guarantees that samples of a specified size, random sampling chooses by independent trials with a provided probability of selection (number selected is drawn from a binomial distribution).

  • Type: string

  • Default: Reservoir

  • Choices: [‘Reservoir’, ‘Random’]

Reservoir Sample Size (reservoir_sample_size): Reservoir sampling size. Only used if Reservoir technique is selected.

  • Type: integer

  • Default: 100

Random Sample Percentage (random_sample_percentage): The percentage (from 0 to 100) randomly selected. Only used if Random technique is selected.

  • Type: decimal

  • Default: 50.0

Save Unsampled Collection (save_unsampled_collection): If On, will also save the non-sampled shards in a provided separate collection

  • Required

  • Type: boolean

  • Default: False

  • Choices: [True, False]

Inputs

Input Collection(s) (collections_in): Input Collection(s) to emit

  • Type: collection_source

Outputs

Sampled Collection (out_coll): Name of the collection to create.

  • Required

  • Type: collection_sink

Optional Unselected Collection (out_unsampled_coll): Name of the collection to create.

  • Required

  • Type: collection_sink

  • Default: Unselected Collection