Dataset Subsetting – Random Splitting Or Selection
Category Paths
Follow one of these paths in the Orion user interface, to find the floe.
Task-based/Data Science/Sampling and Subsetting
Description
Randomly select a user-defined percentage or number of records in the input dataset and emit as the “selected” dataset. The remaining records can, upon request, be emitted as the “unselected” dataset. A specified percentage of records will be sent to the output dataset, called selected. The rest of the records are emitted to the unselected dataset (upon request).
Records are chosen stochastically, so the specified splitting percentage may not be precisely achieved. By default, the cube randomly splits a set of records into two sets of approximately equal size. This Floe is suitable for splitting larger datasets.
Promoted Parameters
Title in user interface (promoted name)
Inputs
Input Dataset (in): Dataset to split randomly
Required
Type: data_source
Outputs
Output selected dataset (selected): Name of output selected dataset
Required
Type: dataset_out
Default: selected
Name of unselected dataset (unselected): Name of output unselected dataset
Required
Type: dataset_out
Default: unselected
Write Unselected Dataset (switch): If set to True, writes dataset of unselected records
Required
Type: boolean
Default: False
Choices: [True, False]
Splitting Parameters
Use Percentage (use_percentage): The percentages of records randomly selected and be emitted to dataset called ‘selected’.
Required
Type: boolean
Default: True
Choices: [True, False]
Percentage (percentage): The percentages of records, integer 1 to 99, randomly selected and be emitted to dataset called ‘selected’.
Type: decimal
Default: 50
Number To Select (selected_number): The number of records randomly selected and be emitted to dataset called ‘selected’.
Type: integer
Default: 0