Dataset Subsetting Based on Dataset, Numerical, String, or Regex Field

This Floe subsets an input dataset based on the specified field type chosen in the Subset Type parameter. It only does one type of subsetting at a time, based on the string subset type chosen.

  • If the string subset type is chosen, records whose values in the specified string`_`field that match the string`_`constant`_`value parameter will be emitted to the matched dataset.

  • If the regex subset type is chosen, records whose values in the specified regex_field that match the regular expression given in the regex parameter will be emitted to the matched dataset.

  • If the numerical subset type is chosen, records whose values in their specified numerical`_`field that satisfy the chosen numerical relationship numerical`_`constant_value parameter will be emitted to the matched dataset. Integer or float values can be used for the constant and field values.

  • IF the dataset subset type is chosen, this floe subsets the ‘input’ dataset based on whether its molecules exist in the ‘reference’ dataset. The molecule equivalence is determined by canonical SMILES matching. If this option is chosen, this floe has to generate and cache all SMILES representation of the molecules in the ‘reference’ dataset, therefore, this option is not suitable for handling a very large dataset.

Output ports:

  • Records that satisfy the given subset criteria will be sent to the matched dataset.

  • Records that does not match will be sent to the unmatched dataset if output of this data is chosen.

  • Records missing the field for subsetting will be sent to the missing dataset if output of this data is chosen.

Extra Required Parameters

  • SMILES Field (Field Type: String) : The name for the SMILES field.
    Default: SMILES
  • SMILES Type (string) : The type of the SMILES generated.
    Default: isomeric-canonical
    Choices: isomeric-canonical, non-isomeric-canonical, non-canonical
  • Subset Type (string) : This parameter controls what type of subsetting is performed on the input data
    Default: string
    Choices: string, regex, dataset, numerical
  • Output matched dataset (dataset_out) : Output dataset of successful calculations
    Default: matched
  • Input Identifier Field (Field Type: String) : The name for the string data field.
  • Reference Identifier Field (Field Type: String) : The name for the string data field in the reference records read from the ‘init’ port.
  • Write missing dataset (boolean) : If off, then the ‘missing’ dataset is not generated.
    Default: False
  • SMILES Field (Field Type: String) : The name for the SMILES field.
    Default: SMILES
  • SMILES Type (string) : The type of the SMILES generated.
    Default: isomeric-canonical
    Choices: isomeric-canonical, non-isomeric-canonical, non-canonical
  • Write unmatched dataset (boolean) : If off, then the ‘unmatched’ dataset is not generated.
    Default: False
  • Failed Dataset (dataset_out) : Output dataset of failed calculations
    Default: Failed Output for Dataset Subsetting Based on Dataset, Numerical, String, or Regex Field
  • Input Dataset (data_source) : Dataset to subset