Dataset Deduplication – Based on Molecule, String, Integer, or Float Field¶
Category Paths
Follow one of these paths in the Orion user interface, to find the floe.
Task-based/Data Science/Filtering
Solution-based/Virtual-screening/Analysis/Filtering
Role-based/Medicinal Chemist
Description
This floe deduplicates a dataset based on a user-defined molecule, string, float, or integer field. Only one type of deduplication can be carried out by this floe at a time. If multiple datasets are used, they must contain the field for deduplication with the same name and type. The type of deduplication must be selected in the Deduplication Type parameter.
Promoted Parameters
Title in user interface (promoted name)
Outputs
Output Dataset for Unique Records (unique): Output dataset to write to
Required
Type: dataset_out
Default: unique
Output Dataset for Duplicate Records (duplicate): Output dataset to write to
Required
Type: dataset_out
Default: duplicate_SMILES
Output Dataset for Records Missing SMILES (missing): Output dataset to write to
Required
Type: dataset_out
Default: missing_SMILES
Overall Deduplication
Deduplication Type (Deduplication Type): The type of field on which to deduplicate.
Required
Type: string
Default: molecule
Choices: [‘string’, ‘molecule’, ‘integer’, ‘float’]
Molecule deduplication
Molecule Deduplication Field (Molecule Deduplication Field): The molecule field on the incoming record on which deduplication is based, if specified.
Type: field_parameter::mol
Use Tautomer Normalization for Mol Deduplication (Tautomer Normalization): If set to True, molecules be tautomer normalized before deduplication.
Type: boolean
Default: False
Choices: [True, False]
Use pKa Normalization for Mol Deduplication (Pka Normalization): If set to True, molecules will be pKa normalized before deduplication.
Required
Type: boolean
Default: False
Choices: [True, False]
Numerical field deduplication
Int Deduplication Field (Int Deduplication Field): The integer field on the incoming record on which deduplication is based, if specified.
Type: field_parameter::int
Float Deduplication Field (Float Deduplication Field): The float field on the incoming record on which deduplication is based, if specified.
Type: field_parameter::float
Max Absolute Difference between Duplicate Float or Int Values (Max Absolute Difference): Maximum absolute difference allowed for numeric values to qualify as duplicates.
Type: decimal
Default: 0.0
String field deduplication
String Deduplication Field (String Deduplication Field): The string field on the incoming record on which deduplication is based, if specified.
Type: field_parameter::string