Dataset Subsetting Based on Dataset, Numerical, String, or Regex Field¶

This Floe subsets an input dataset based on the specified field type chosen in the Subset Type parameter. It only does one type of subsetting at a time, based on the string subset type chosen.

If the string subset type is chosen, records whose values in the specified string`_`field that match the string`_`constant`_`value parameter will be emitted to the matched dataset.
If the regex subset type is chosen, records whose values in the specified regex_field that match the regular expression given in the regex parameter will be emitted to the matched dataset.
If the numerical subset type is chosen, records whose values in their specified numerical`_`field that satisfy the chosen numerical relationship numerical`_`constant_value parameter will be emitted to the matched dataset. Integer or float values can be used for the constant and field values.
IF the dataset subset type is chosen, this floe subsets the ‘input’ dataset based on whether its molecules exist in the ‘reference’ dataset. The molecule equivalence is determined by canonical SMILES matching. If this option is chosen, this floe has to generate and cache all SMILES representation of the molecules in the ‘reference’ dataset, therefore, this option is not suitable for handling a very large dataset.

Output ports:

Records that satisfy the given subset criteria will be sent to the matched dataset.
Records that does not match will be sent to the unmatched dataset if output of this data is chosen.
Records missing the field for subsetting will be sent to the missing dataset if output of this data is chosen.

Extra Required Parameters

SMILES Field (Field Type: String) : The name for the SMILES field.

Default: SMILES

SMILES Type (string) : The type of the SMILES generated.

Default: isomeric-canonical

Choices: isomeric-canonical, non-isomeric-canonical, non-canonical

Subset Type (string) : This parameter controls what type of subsetting is performed on the input data

Default: string

Choices: string, regex, dataset, numerical

Output matched dataset (dataset_out) : Output dataset of successful calculations

Default: matched

Input Identifier Field (Field Type: String) : The name for the string data field.

Reference Identifier Field (Field Type: String) : The name for the string data field in the reference records read from the ‘init’ port.

Write missing dataset (boolean) : If off, then the ‘missing’ dataset is not generated.

Default: False

SMILES Field (Field Type: String) : The name for the SMILES field.

Default: SMILES

SMILES Type (string) : The type of the SMILES generated.

Default: isomeric-canonical

Choices: isomeric-canonical, non-isomeric-canonical, non-canonical

Write unmatched dataset (boolean) : If off, then the ‘unmatched’ dataset is not generated.

Default: False

Failed Dataset (dataset_out) : Output dataset of failed calculations

Default: Failed Output for Dataset Subsetting Based on Dataset, Numerical, String, or Regex Field

Input Dataset (data_source) : Dataset to subset