Dataset Subsetting Based on Dataset, Numerical, String, or Regex Field

Category Paths

Follow one of these paths in the Orion user interface, to find the floe.

  • Task-based/Data Science/Filtering

Description

Subset an input dataset based on dataset or based on a specified numerical, string, or regex field. The specified field type must be chosen in the Subset Type parameter. It only does one type of subsetting at a time, based on the string subset type chosen.

  • If the string subset type is chosen, records whose values in the specified string`_`field that match the string`_`constant`_`value parameter will be emitted to the matched dataset.

  • If the regex subset type is chosen, records whose values in the specified regex`_`field that match the regular expression given in the regex parameter will be emitted to the matched dataset.

  • If the numerical subset type is chosen, records whose values in their specified numerical`_`field that satisfy the chosen numerical relationship numerical`_`constant`_`value parameter will be emitted to the matched dataset. Integer or float values can be used for the constant and field values.

  • If the dataset subset type is chosen, this floe subsets the ‘input’ dataset based on whether its molecules exist in the ‘reference’ dataset. The molecule equivalence is determined by canonical SMILES matching. If this option is chosen, this floe has to generate and cache all SMILES representations of the molecules in the ‘reference’ dataset; therefore, this option is not suitable for handling a very large dataset.

Output ports:

  • Records that satisfy the given subset criteria will be sent to the matched dataset.

  • Records that do not match will be sent to the unmatched dataset if output of this data is chosen.

  • Records missing the field for subsetting will be sent to the missing dataset if output of this data is chosen.

Promoted Parameters

Title in user interface (promoted name)

Inputs

Input Dataset (in): Dataset to subset.

  • Required

  • Type: data_source

Reference Dataset (ref): Reference Dataset, only needed if subsetting by dataset.

  • Type: data_source

Subset Type (Subset Type): This parameter controls what type of subsetting is performed on the input data

  • Required

  • Type: string

  • Default: string

  • Choices: [‘string’, ‘regex’, ‘dataset’, ‘numerical’]

Outputs

Output matched dataset (matched): Name of matched output dataset.

  • Required

  • Type: dataset_out

  • Default: matched

Output unmatched dataset (unmatched):

  • Type: dataset_out

  • Default: unmatched

Output missing dataset (missing):

  • Type: dataset_out

  • Default: missing

Write unmatched dataset (switch_unmatched): If off, then the ‘unmatched’ dataset is not generated.

  • Required

  • Type: boolean

  • Default: False

  • Choices: [True, False]

Write missing dataset (switch_missing): If off, then the ‘missing’ dataset is not generated.

  • Required

  • Type: boolean

  • Default: False

  • Choices: [True, False]

Regex Subset Parameters

Regular Expression (regex): The regular expression to which the field value will be compared.

  • Type: string

  • Default:

Comparison Field (regex_field): The field on the incoming record that will be matched to the regular expression.

  • Type: field_parameter::string

  • Default: regex_field

String Subset Parameters

Constant Value (string_constant_value): The value to which the field value will be compared.

  • Type: string

  • Default:

Comparison Field (string_field): The field on the incoming record that will be compared to the constant value.

  • Type: field_parameter::string

  • Default: string_field

Numerical Subset Parameters

Comparison Field (numerical_field): Numerical field (integer or float) to use for comparison.

  • Type: field_parameter::any

Compare Options (operator): Option to compare the constant value to the field.

  • Type: string

  • Choices: [‘near’, ‘not near’, ‘less than’, ‘less than or equal to’, ‘greater than’, ‘greater than or equal to’]

Comparison Value (numerical_constant_value): Numerical constant (integer or float) to use for comparison.

  • Type: decimal

Tolerance (tolerance): The tolerance used when performing ‘near’ comparisons.

  • Type: decimal

  • Default: 0.01