Dataset Subsetting – Random Splitting Or Selection

Category Paths

Follow one of these paths in the Orion user interface, to find the floe.

Task-based/Data Science/Sampling and Subsetting

Description

Randomly select a user-defined percentage or number of records in the input dataset and emit as the “selected” dataset. The remaining records can, upon request, be emitted as the “unselected” dataset. A specified percentage of records will be sent to the output dataset, called selected. The rest of the records are emitted to the unselected dataset (upon request).

Records are chosen stochastically, so the specified splitting percentage may not be precisely achieved. By default, the cube randomly splits a set of records into two sets of approximately equal size. This Floe is suitable for splitting larger datasets.

Promoted Parameters

Title in user interface (promoted name)

Inputs

Input Dataset (in): Dataset to split randomly

Required

Type: data_source

Outputs

Output selected dataset (selected): Name of output selected dataset

Required

Type: dataset_out

Default: selected

Name of unselected dataset (unselected): Name of output unselected dataset

Required

Type: dataset_out

Default: unselected

Write Unselected Dataset (switch): If set to True, writes dataset of unselected records

Required

Type: boolean

Default: False

Choices: [True, False]

Splitting Parameters

Use Percentage (use_percentage): The percentages of records randomly selected and be emitted to dataset called ‘selected’.

Required

Type: boolean

Default: True

Choices: [True, False]

Percentage (percentage): The percentages of records, integer 1 to 99, randomly selected and be emitted to dataset called ‘selected’.

Type: decimal

Default: 50

Number To Select (selected_number): The number of records randomly selected and be emitted to dataset called ‘selected’.

Type: integer

Default: 0