Dataset Deduplication – Based on Molecule, String, Integer, or Float Field

Category Paths

Follow one of these paths in the Orion user interface, to find the floe.

  • Task-based/Data Science/Filtering

  • Solution-based/Virtual-screening/Analysis/Filtering

  • Role-based/Medicinal Chemist

Description

This floe deduplicates a dataset based on a user-defined molecule, string, float, or integer field. Only one type of deduplication can be carried out by this floe at a time. If multiple datasets are used, they must contain the field for deduplication with the same name and type. The type of deduplication must be selected in the Deduplication Type parameter.

Promoted Parameters

Title in user interface (promoted name)

Outputs

Output Dataset for Unique Records (unique): Output dataset to write to

  • Required

  • Type: dataset_out

  • Default: unique

Output Dataset for Duplicate Records (duplicate): Output dataset to write to

  • Required

  • Type: dataset_out

  • Default: duplicate_SMILES

Output Dataset for Records Missing SMILES (missing): Output dataset to write to

  • Required

  • Type: dataset_out

  • Default: missing_SMILES

Overall Deduplication

Deduplication Type (Deduplication Type): The type of field on which to deduplicate.

  • Required

  • Type: string

  • Default: molecule

  • Choices: [‘string’, ‘molecule’, ‘integer’, ‘float’]

Molecule deduplication

Molecule Deduplication Field (Molecule Deduplication Field): The molecule field on the incoming record on which deduplication is based, if specified.

  • Type: field_parameter::mol

Use Tautomer Normalization for Mol Deduplication (Tautomer Normalization): If set to True, molecules be tautomer normalized before deduplication.

  • Type: boolean

  • Default: False

  • Choices: [True, False]

Use pKa Normalization for Mol Deduplication (Pka Normalization): If set to True, molecules will be pKa normalized before deduplication.

  • Required

  • Type: boolean

  • Default: False

  • Choices: [True, False]

Numerical field deduplication

Int Deduplication Field (Int Deduplication Field): The integer field on the incoming record on which deduplication is based, if specified.

  • Type: field_parameter::int

Float Deduplication Field (Float Deduplication Field): The float field on the incoming record on which deduplication is based, if specified.

  • Type: field_parameter::float

Max Absolute Difference between Duplicate Float or Int Values (Max Absolute Difference): Maximum absolute difference allowed for numeric values to qualify as duplicates.

  • Type: decimal

  • Default: 0.0

String field deduplication

String Deduplication Field (String Deduplication Field): The string field on the incoming record on which deduplication is based, if specified.

  • Type: field_parameter::string