Dataset Subsetting Based on Dataset, Numerical, String, or Regex Field¶
Category Paths
Follow one of these paths in the Orion user interface, to find the floe.
Task-based/Data Science/Filtering
Description
Subset an input dataset based on dataset or based on a specified numerical, string, or regex field. The specified field type must be chosen in the Subset Type parameter. It only does one type of subsetting at a time, based on the string subset type chosen.
If the string subset type is chosen, records whose values in the specified string`_`field that match the string`_`constant`_`value parameter will be emitted to the matched dataset.
If the regex subset type is chosen, records whose values in the specified regex`_`field that match the regular expression given in the regex parameter will be emitted to the matched dataset.
If the numerical subset type is chosen, records whose values in their specified numerical`_`field that satisfy the chosen numerical relationship numerical`_`constant`_`value parameter will be emitted to the matched dataset. Integer or float values can be used for the constant and field values.
If the dataset subset type is chosen, this floe subsets the ‘input’ dataset based on whether its molecules exist in the ‘reference’ dataset. The molecule equivalence is determined by canonical SMILES matching. If this option is chosen, this floe has to generate and cache all SMILES representations of the molecules in the ‘reference’ dataset; therefore, this option is not suitable for handling a very large dataset.
Output ports:
Records that satisfy the given subset criteria will be sent to the matched dataset.
Records that do not match will be sent to the unmatched dataset if output of this data is chosen.
Records missing the field for subsetting will be sent to the missing dataset if output of this data is chosen.
Promoted Parameters
Title in user interface (promoted name)
Inputs
Input Dataset (in): Dataset to subset.
Required
Type: data_source
Reference Dataset (ref): Reference Dataset, only needed if subsetting by dataset.
Type: data_source
Subset Type (Subset Type): This parameter controls what type of subsetting is performed on the input data
Required
Type: string
Default: string
Choices: [‘string’, ‘regex’, ‘dataset’, ‘numerical’]
Outputs
Output matched dataset (matched): Name of matched output dataset.
Required
Type: dataset_out
Default: matched
Output unmatched dataset (unmatched):
Type: dataset_out
Default: unmatched
Output missing dataset (missing):
Type: dataset_out
Default: missing
Write unmatched dataset (switch_unmatched): If off, then the ‘unmatched’ dataset is not generated.
Required
Type: boolean
Default: False
Choices: [True, False]
Write missing dataset (switch_missing): If off, then the ‘missing’ dataset is not generated.
Required
Type: boolean
Default: False
Choices: [True, False]
Regex Subset Parameters
Regular Expression (regex): The regular expression to which the field value will be compared.
Type: string
Default:
Comparison Field (regex_field): The field on the incoming record that will be matched to the regular expression.
Type: field_parameter::string
Default: regex_field
String Subset Parameters
Constant Value (string_constant_value): The value to which the field value will be compared.
Type: string
Default:
Comparison Field (string_field): The field on the incoming record that will be compared to the constant value.
Type: field_parameter::string
Default: string_field
Numerical Subset Parameters
Comparison Field (numerical_field): Numerical field (integer or float) to use for comparison.
Type: field_parameter::any
Compare Options (operator): Option to compare the constant value to the field.
Type: string
Choices: [‘near’, ‘not near’, ‘less than’, ‘less than or equal to’, ‘greater than’, ‘greater than or equal to’]
Comparison Value (numerical_constant_value): Numerical constant (integer or float) to use for comparison.
Type: decimal
Tolerance (tolerance): The tolerance used when performing ‘near’ comparisons.
Type: decimal
Default: 0.01