Dataset Subsetting Based on Dataset, Numerical, String, or Regex Field¶

Category Paths

Follow one of these paths in the Orion user interface, to find the floe.

Task-based/Data Science/Filtering

Description

Subset an input dataset based on dataset or based on a specified numerical, string, or regex field. The specified field type must be chosen in the Subset Type parameter. It only does one type of subsetting at a time, based on the string subset type chosen.

If the string subset type is chosen, records whose values in the specified string`_`field that match the string`_`constant`_`value parameter will be emitted to the matched dataset.
If the regex subset type is chosen, records whose values in the specified regex`_`field that match the regular expression given in the regex parameter will be emitted to the matched dataset.
If the numerical subset type is chosen, records whose values in their specified numerical`_`field that satisfy the chosen numerical relationship numerical`_`constant`_`value parameter will be emitted to the matched dataset. Integer or float values can be used for the constant and field values.
If the dataset subset type is chosen, this floe subsets the ‘input’ dataset based on whether its molecules exist in the ‘reference’ dataset. The molecule equivalence is determined by canonical SMILES matching. If this option is chosen, this floe has to generate and cache all SMILES representations of the molecules in the ‘reference’ dataset; therefore, this option is not suitable for handling a very large dataset.

Output ports:

Records that satisfy the given subset criteria will be sent to the matched dataset.
Records that do not match will be sent to the unmatched dataset if output of this data is chosen.
Records missing the field for subsetting will be sent to the missing dataset if output of this data is chosen.

Promoted Parameters

Title in user interface (promoted name)

Inputs

Input Dataset (in): Dataset to subset.

Required

Type: data_source

Reference Dataset (ref): Reference Dataset, only needed if subsetting by dataset.

Type: data_source

Subset Type (Subset Type): This parameter controls what type of subsetting is performed on the input data

Required

Type: string

Default: string

Choices: [‘string’, ‘regex’, ‘dataset’, ‘numerical’]

Outputs

Output matched dataset (matched): Name of matched output dataset.

Required

Type: dataset_out

Default: matched

Output unmatched dataset (unmatched):

Type: dataset_out

Default: unmatched

Output missing dataset (missing):

Type: dataset_out

Default: missing

Write unmatched dataset (switch_unmatched): If off, then the ‘unmatched’ dataset is not generated.

Required

Type: boolean

Default: False

Choices: [True, False]

Write missing dataset (switch_missing): If off, then the ‘missing’ dataset is not generated.

Required

Type: boolean

Default: False

Choices: [True, False]

Regex Subset Parameters

Regular Expression (regex): The regular expression to which the field value will be compared.

Type: string

Default:

Comparison Field (regex_field): The field on the incoming record that will be matched to the regular expression.

Type: field_parameter::string

Default: regex_field

String Subset Parameters

Constant Value (string_constant_value): The value to which the field value will be compared.

Type: string

Default:

Comparison Field (string_field): The field on the incoming record that will be compared to the constant value.

Type: field_parameter::string

Default: string_field

Numerical Subset Parameters

Comparison Field (numerical_field): Numerical field (integer or float) to use for comparison.

Type: field_parameter::any

Compare Options (operator): Option to compare the constant value to the field.

Type: string

Choices: [‘near’, ‘not near’, ‘less than’, ‘less than or equal to’, ‘greater than’, ‘greater than or equal to’]

Comparison Value (numerical_constant_value): Numerical constant (integer or float) to use for comparison.

Type: decimal

Tolerance (tolerance): The tolerance used when performing ‘near’ comparisons.

Type: decimal

Default: 0.01