Deduplicate Records Based on Integer, Float, String, or Mol Field
This cube deduplicate records based on a string field, a numerical field, and/or a mol field. If multiple fields are specified, then deduplication is based on all of the fields specified.
The reference field for deduplication is specified by either the primary molecule field or exactly one of the following parameters: Deduplication Field parameter Deduplication Field parameter Deduplication Field parameter
If more than one deduplication parameter is specified, deduplication will be based on only the first deduplication field specified, in the following order: molecule field, float field, integer field, string field. To deduplicate based on multiple fields, independently, use multiple deduplication cubes in series. Note that if you have multiple input datasets, you must select the same deduplication field in each.
If a float or integer deduplication field is specified, then the Maximum Absolute Difference parameter can be used to control the maximum absolute difference allowed between numerical values for them to still be treated as duplicates.
If a mol field is specified, the following parameters can also also be specified: Deduplication Field: Default False. If set to True, normalize PKA of molecules in primary mol field before deduplication of that field. Deduplication Field parameter: Default False. If set to True, normalize tautomers in primary mol field before deduplication of that field.
The output ports are:
unique: records with reference field(s) that have not been seen before
duplicate: records with reference field(s) that have been seen before
missing: any record with no reference field
Calculation Parameters
CPUs (cpu_count) type: integer: The number of CPUs to run this cube withDefault: 1 , Min: 1, Max: 128 Cube Metrics (cube_metrics) type: string: Set of metrics to be collectedChoices: cpu, disk, memory, network Deduplication Type (dedup_type) type: string: The type of field on which the cube should carry out deduplication.Choices: string, molecule, integer, float Temporary Disk Space (MiB) (disk_space) type: decimal: The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.Default: 5120.0 , Min: 128.0, Max: 8589934592 GPUs (gpu_count) type: integer: The number of GPUs to run this cube withDefault: 0 , Max: 16 Instance Tags (instance_tags) type: string: Only run on machines with matching tags (comma separated)Default: “” Instance Type (instance_type) type: string: The type of instance that this cube needs to be run on Max absolute difference (max_absolute_difference) type: decimal: Maximum absolute difference allowed for numeric values to qualify as duplicatesDefault: 0.0 Max Backlog Wait (max_backlog_wait) type: integer: The max time (in seconds) that a cube will be backlogged on a group before being re-evaluatedDefault: 600 , Min: 300 Memory (MiB) (memory_mb) type: decimal: The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.Default: 1800 , Min: 256.0, Max: 8589934592 Metric Period (metric_period) type: decimal: How often to sample metrics, in secondsDefault: 60Choices: 1, 5, 10, 30, 60, 120, 180, 240, 300, Min: 1, Max: 300 Thread limit per CPU (pids_per_cpu_limit) type: integer: The number of threads per CPUDefault: 32 Use PKA Normalization (pka_normalization) type: boolean: If set to True, molecules will be pKa normalized before deduplication.Default: False Shared Memory (MiB) (shared_memory_mb) type: decimal: The amount of shared memory to allow a container to addressDefault: 64 Spot policy (spot_policy) type: string: Control cube placement on spot market instancesDefault: ProhibitedChoices: Allowed, Preferred, NotPreferred, Prohibited, Required Use Tautomer Normalization (tautomer_normalization) type: boolean: If set to True, molecules be tautomer normalized before deduplication.Default: False
Field parameters
Moelcule Deduplication Field (None) type: Field Type: Chem.Mol: The molecule field on the incoming record on which deduplication is based, if specified. Float Deduplication Field (None) type: Field Type: Float: The float field on the incoming record on which deduplicationis based, if specified Integer Deduplication Field (None) type: Field Type: Int: The integer field on the incoming record on which deduplicationis based, if specified String Deduplication Field (None) type: Field Type: String: The string field on the incoming record on which deduplication is based, if specified. Float Deduplication Field (float_dedup_field) type: Field Type: Float: The float field on the incoming record on which deduplicationis based, if specified Integer Deduplication Field (int_dedup_field) type: Field Type: Int: The integer field on the incoming record on which deduplicationis based, if specified Moelcule Deduplication Field (mol_dedup_field) type: Field Type: Chem.Mol: The molecule field on the incoming record on which deduplication is based, if specified. String Deduplication Field (string_dedup_field) type: Field Type: String: The string field on the incoming record on which deduplication is based, if specified.
Molecule Deduplication Parameters
- These parameters control deduplication using a molecule field.
- Use PKA Normalization (None) type: boolean: If set to True, molecules will be pKa normalized before deduplication.Default: False
- Use Tautomer Normalization (None) type: boolean: If set to True, molecules be tautomer normalized before deduplication.Default: False
Numeric Deduplication Parameters
- These parameters control deduplication using a numeric field.
- Max absolute difference (None) type: decimal: Maximum absolute difference allowed for numeric values to qualify as duplicatesDefault: 0.0
Hardware Parameters
- Machine hardware requirements
- Memory (MiB) (memory_mb) type: decimal: The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.Default: 1800 , Min: 256.0, Max: 8589934592
- Shared Memory (MiB) (shared_memory_mb) type: decimal: The amount of shared memory to allow a container to addressDefault: 64
- Thread limit per CPU (pids_per_cpu_limit) type: integer: The number of threads per CPUDefault: 32
- Max Backlog Wait (max_backlog_wait) type: integer: The max time (in seconds) that a cube will be backlogged on a group before being re-evaluatedDefault: 600 , Min: 300
- Temporary Disk Space (MiB) (disk_space) type: decimal: The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.Default: 5120.0 , Min: 128.0, Max: 8589934592
- GPUs (gpu_count) type: integer: The number of GPUs to run this cube withDefault: 0 , Max: 16
- CPUs (cpu_count) type: integer: The number of CPUs to run this cube withDefault: 1 , Min: 1, Max: 128
- Instance Type (instance_type) type: string: The type of instance that this cube needs to be run on
- Spot policy (spot_policy) type: string: Control cube placement on spot market instancesDefault: ProhibitedChoices: Allowed, Preferred, NotPreferred, Prohibited, Required
- Instance Tags (instance_tags) type: string: Only run on machines with matching tags (comma separated)Default: “”
Metrics Parameters
- Cube Metric Parameters
- Metric Period (None) type: decimal: How often to sample metrics, in secondsDefault: 60Choices: 1, 5, 10, 30, 60, 120, 180, 240, 300, Min: 1, Max: 300
- Cube Metrics (None) type: string: Set of metrics to be collectedChoices: cpu, disk, memory, network