Dataset Subsetting Based on String Keys

Category Paths

Follow one of these paths in the Orion user interface, to find the floe.

  • Task-based/Data Science/Filtering

Description

Subset a dataset using string keys. This floe takes a dataset and two input parameters: a string field from that dataset and a string parameter as input. It splits the string field by line to create keys, and then emits records from the input dataset that have values of the specified string field which match any of these keys.

Promoted Parameters

Title in user interface (promoted name)

Inputs

Input Dataset (in): Dataset to subset.

  • Required

  • Type: data_source

Reference Input Dataset (ref_in): Reference dataset for file-based subset.

  • Type: data_source

Reference Identifier Field (String Reference Field): The name for the string data field in the reference records read from the ‘init’ port.

  • Type: field_parameter::string

  • Default: String Reference Field

Input String Field To Use For Subsetting (Field to Subset): The name for the string data field.

  • Required

  • Type: field_parameter::string

  • Default: Subset Field

Use String For Input (Use String For Input): If true, use input string as input. If false (default), use input dataset.

  • Type: boolean

  • Default: False

  • Choices: [True, False]

Number of messages to distribute at a time (item_count): Number of records to process in each instance of subset cube.

  • Required

  • Type: integer

  • Default: 5000

CPUs (cpu_count): Number of CPUs to use in each instance of subset cube.

  • Required

  • Type: integer

  • Default: 4

Inputs If Using String For Reference Input

Input String (Input String): String to convert into records separated by line breaks.

  • Type: string

Separator (Input String Separator): The string used to separate the input string into records.

  • Type: string

  • Default: ,

Outputs

Output matched dataset. (matched): Name of output matched dataset

  • Required

  • Type: dataset_out

  • Default: matched

Output unmatched dataset (unmatched):

  • Type: dataset_out

  • Default: unmatched

Write unmatched dataset (switch_unmatched): If off, then the ‘unmatched’ dataset is not generated.

  • Required

  • Type: boolean

  • Default: False

  • Choices: [True, False]