Dataset Subsetting – Random Splitting Or Selection

Category Paths

Follow one of these paths in the Orion user interface, to find the floe.

  • Task-based/Data Science/Sampling and Subsetting

Description

Randomly select a user-defined percentage or number of records in the input dataset and emit as the “selected” dataset. The remaining records can, upon request, be emitted as the “unselected” dataset. A specified percentage of records will be sent to the output dataset, called selected. The rest of the records are emitted to the unselected dataset (upon request).

Records are chosen stochastically, so the specified splitting percentage may not be precisely achieved. By default, the cube randomly splits a set of records into two sets of approximately equal size. This Floe is suitable for splitting larger datasets.

Promoted Parameters

Title in user interface (promoted name)

Inputs

Input Dataset (in): Dataset to split randomly

  • Required

  • Type: data_source

Outputs

Output selected dataset (selected): Name of output selected dataset

  • Required

  • Type: dataset_out

  • Default: selected

Name of unselected dataset (unselected): Name of output unselected dataset

  • Required

  • Type: dataset_out

  • Default: unselected

Write Unselected Dataset (switch): If set to True, writes dataset of unselected records

  • Required

  • Type: boolean

  • Default: False

  • Choices: [True, False]

Splitting Parameters

Use Percentage (use_percentage): The percentages of records randomly selected and be emitted to dataset called ‘selected’.

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Percentage (percentage): The percentages of records, integer 1 to 99, randomly selected and be emitted to dataset called ‘selected’.

  • Type: decimal

  • Default: 50

Number To Select (selected_number): The number of records randomly selected and be emitted to dataset called ‘selected’.

  • Type: integer

  • Default: 0