Deduplicate Records Based on Integer, Float, String, or Mol Field

This cube deduplicate records based on a string field, a numerical field, and/or a mol field. If multiple fields are specified, then deduplication is based on all of the fields specified.

The reference field for deduplication is specified by either the primary molecule field or exactly one of the following parameters: Deduplication Field parameter Deduplication Field parameter Deduplication Field parameter

If more than one deduplication parameter is specified, deduplication will be based on only the first deduplication field specified, in the following order: molecule field, float field, integer field, string field. To deduplicate based on multiple fields, independently, use multiple deduplication cubes in series. Note that if you have multiple input datasets, you must select the same deduplication field in each.

If a float or integer deduplication field is specified, then the Maximum Absolute Difference parameter can be used to control the maximum absolute difference allowed between numerical values for them to still be treated as duplicates.

If a mol field is specified, the following parameters can also also be specified: Deduplication Field: Default False. If set to True, normalize PKA of molecules in primary mol field before deduplication of that field. Deduplication Field parameter: Default False. If set to True, normalize tautomers in primary mol field before deduplication of that field.

The output ports are:

  • unique: records with reference field(s) that have not been seen before

  • duplicate: records with reference field(s) that have been seen before

  • missing: any record with no reference field

Calculation Parameters

  • CPUs (cpu_count) type: integer: The number of CPUs to run this cube with
    Default: 1 , Min: 1, Max: 128
  • Cube Metrics (cube_metrics) type: string: Set of metrics to be collected

    Choices: cpu, disk, memory, network
  • Deduplication Type (dedup_type) type: string: The type of field on which the cube should carry out deduplication.

    Choices: string, molecule, integer, float
  • Temporary Disk Space (MiB) (disk_space) type: decimal: The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
    Default: 5120.0 , Min: 128.0, Max: 8589934592
  • GPUs (gpu_count) type: integer: The number of GPUs to run this cube with
    Default: 0 , Max: 16
  • Instance Tags (instance_tags) type: string: Only run on machines with matching tags (comma separated)
    Default: “”
  • Instance Type (instance_type) type: string: The type of instance that this cube needs to be run on
  • Max absolute difference (max_absolute_difference) type: decimal: Maximum absolute difference allowed for numeric values to qualify as duplicates
    Default: 0.0
  • Max Backlog Wait (max_backlog_wait) type: integer: The max time (in seconds) that a cube will be backlogged on a group before being re-evaluated
    Default: 600 , Min: 300
  • Memory (MiB) (memory_mb) type: decimal: The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
    Default: 1800 , Min: 256.0, Max: 8589934592
  • Metric Period (metric_period) type: decimal: How often to sample metrics, in seconds
    Default: 60
    Choices: 1, 5, 10, 30, 60, 120, 180, 240, 300, Min: 1, Max: 300
  • Thread limit per CPU (pids_per_cpu_limit) type: integer: The number of threads per CPU
    Default: 32
  • Use PKA Normalization (pka_normalization) type: boolean: If set to True, molecules will be pKa normalized before deduplication.
    Default: False
  • Shared Memory (MiB) (shared_memory_mb) type: decimal: The amount of shared memory to allow a container to address
    Default: 64
  • Spot policy (spot_policy) type: string: Control cube placement on spot market instances
    Default: Prohibited
    Choices: Allowed, Preferred, NotPreferred, Prohibited, Required
  • Use Tautomer Normalization (tautomer_normalization) type: boolean: If set to True, molecules be tautomer normalized before deduplication.
    Default: False

Field parameters

  • Moelcule Deduplication Field (None) type: Field Type: Chem.Mol: The molecule field on the incoming record on which deduplication is based, if specified.
  • Float Deduplication Field (None) type: Field Type: Float: The float field on the incoming record on which deduplicationis based, if specified
  • Integer Deduplication Field (None) type: Field Type: Int: The integer field on the incoming record on which deduplicationis based, if specified
  • String Deduplication Field (None) type: Field Type: String: The string field on the incoming record on which deduplication is based, if specified.
  • Float Deduplication Field (float_dedup_field) type: Field Type: Float: The float field on the incoming record on which deduplicationis based, if specified
  • Integer Deduplication Field (int_dedup_field) type: Field Type: Int: The integer field on the incoming record on which deduplicationis based, if specified
  • Moelcule Deduplication Field (mol_dedup_field) type: Field Type: Chem.Mol: The molecule field on the incoming record on which deduplication is based, if specified.
  • String Deduplication Field (string_dedup_field) type: Field Type: String: The string field on the incoming record on which deduplication is based, if specified.

Molecule Deduplication Parameters

These parameters control deduplication using a molecule field.
  • Use PKA Normalization (None) type: boolean: If set to True, molecules will be pKa normalized before deduplication.
    Default: False
  • Use Tautomer Normalization (None) type: boolean: If set to True, molecules be tautomer normalized before deduplication.
    Default: False

Numeric Deduplication Parameters

These parameters control deduplication using a numeric field.
  • Max absolute difference (None) type: decimal: Maximum absolute difference allowed for numeric values to qualify as duplicates
    Default: 0.0

Hardware Parameters

Machine hardware requirements
  • Memory (MiB) (memory_mb) type: decimal: The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
    Default: 1800 , Min: 256.0, Max: 8589934592
  • Shared Memory (MiB) (shared_memory_mb) type: decimal: The amount of shared memory to allow a container to address
    Default: 64
  • Thread limit per CPU (pids_per_cpu_limit) type: integer: The number of threads per CPU
    Default: 32
  • Max Backlog Wait (max_backlog_wait) type: integer: The max time (in seconds) that a cube will be backlogged on a group before being re-evaluated
    Default: 600 , Min: 300
  • Temporary Disk Space (MiB) (disk_space) type: decimal: The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
    Default: 5120.0 , Min: 128.0, Max: 8589934592
  • GPUs (gpu_count) type: integer: The number of GPUs to run this cube with
    Default: 0 , Max: 16
  • CPUs (cpu_count) type: integer: The number of CPUs to run this cube with
    Default: 1 , Min: 1, Max: 128
  • Instance Type (instance_type) type: string: The type of instance that this cube needs to be run on
  • Spot policy (spot_policy) type: string: Control cube placement on spot market instances
    Default: Prohibited
    Choices: Allowed, Preferred, NotPreferred, Prohibited, Required
  • Instance Tags (instance_tags) type: string: Only run on machines with matching tags (comma separated)
    Default: “”

Metrics Parameters

Cube Metric Parameters
  • Metric Period (None) type: decimal: How often to sample metrics, in seconds
    Default: 60
    Choices: 1, 5, 10, 30, 60, 120, 180, 240, 300, Min: 1, Max: 300
  • Cube Metrics (None) type: string: Set of metrics to be collected

    Choices: cpu, disk, memory, network