Deduplicate Records Based on Integer, Float, String, or Mol Field¶
Calculation Parameters¶
CPUs (cpu_count) type: integer: The number of CPUs to run this cube withDefault: 1 , Min: 1, Max: 128
Cube Metrics (cube_metrics) type: string: Set of metrics to be collectedChoices: cpu, disk, memory, network
Deduplication Type (dedup_type) type: string: The type of field on which the cube should carry out deduplication.Choices: string, molecule, integer, float
Temporary Disk Space (MiB) (disk_space) type: decimal: The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.Default: 5120.0 , Min: 128.0, Max: 8589934592
GPUs (gpu_count) type: integer: The number of GPUs to run this cube withDefault: 0 , Max: 16
Instance Tags (instance_tags) type: string: Only run on machines with matching tags (comma separated)Default: “”
Instance Type (instance_type) type: string: The type of instance that this cube needs to be run on
Max absolute difference (max_absolute_difference) type: decimal: Maximum absolute difference allowed for numeric values to qualify as duplicatesDefault: 0.0
Memory (MiB) (memory_mb) type: decimal: The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.Default: 1800 , Min: 256.0, Max: 8589934592
Metric Period (metric_period) type: decimal: How often to sample metrics, in secondsDefault: 60Choices: 1, 5, 10, 30, 60, 120, 180, 240, 300, Min: 1, Max: 300
Thread limit per CPU (pids_per_cpu_limit) type: integer: The number of threads per CPUDefault: 32
Use PKA Normalization (pka_normalization) type: boolean: If set to True, molecules will be pka normalized before deduplication.Default: False
Shared Memory (MiB) (shared_memory_mb) type: decimal: The amount of shared memory to allow a container to addressDefault: 64
Spot policy (spot_policy) type: string: Control cube placement on spot market instancesDefault: ProhibitedChoices: Allowed, Preferred, NotPreferred, Prohibited, Required
Use Tautomer Normalization (tautomer_normalization) type: boolean: If set to True, molecules be tautomer normalized before deduplication.Default: False
Field parameters¶
Molecule Deduplication Field (None) type: Field Type: Chem.Mol: The molecule field on the incoming record on which deduplication is based, if specified.
Float Deduplication Field (None) type: Field Type: Float: The float field on the incoming record on which deduplicationis based, if specified
Integer Deduplication Field (None) type: Field Type: Int: The integer field on the incoming record on which deduplicationis based, if specified
String Deduplication Field (None) type: Field Type: String: The string field on the incoming record on which deduplication is based, if specified.
Float Deduplication Field (float_dedup_field) type: Field Type: Float: The float field on the incoming record on which deduplicationis based, if specified
Integer Deduplication Field (int_dedup_field) type: Field Type: Int: The integer field on the incoming record on which deduplicationis based, if specified
Molecule Deduplication Field (mol_dedup_field) type: Field Type: Chem.Mol: The molecule field on the incoming record on which deduplication is based, if specified.
String Deduplication Field (string_dedup_field) type: Field Type: String: The string field on the incoming record on which deduplication is based, if specified.
Molecule Deduplication Parameters¶
These parameters control deduplication using a molecule field.
Use PKA Normalization (None) type: boolean: If set to True, molecules will be pka normalized before deduplication.Default: False
Use Tautomer Normalization (None) type: boolean: If set to True, molecules be tautomer normalized before deduplication.Default: False
Numeric Deduplication Parameters¶
These parameters control deduplication using a numeric field.
Max absolute difference (None) type: decimal: Maximum absolute difference allowed for numeric values to qualify as duplicatesDefault: 0.0
Hardware Parameters¶
Machine hardware requirements
Memory (MiB) (memory_mb) type: decimal: The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.Default: 1800 , Min: 256.0, Max: 8589934592
Shared Memory (MiB) (shared_memory_mb) type: decimal: The amount of shared memory to allow a container to addressDefault: 64
Thread limit per CPU (pids_per_cpu_limit) type: integer: The number of threads per CPUDefault: 32
Temporary Disk Space (MiB) (disk_space) type: decimal: The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.Default: 5120.0 , Min: 128.0, Max: 8589934592
GPUs (gpu_count) type: integer: The number of GPUs to run this cube withDefault: 0 , Max: 16
CPUs (cpu_count) type: integer: The number of CPUs to run this cube withDefault: 1 , Min: 1, Max: 128
Instance Type (instance_type) type: string: The type of instance that this cube needs to be run on
Spot policy (spot_policy) type: string: Control cube placement on spot market instancesDefault: ProhibitedChoices: Allowed, Preferred, NotPreferred, Prohibited, Required
Instance Tags (instance_tags) type: string: Only run on machines with matching tags (comma separated)Default: “”
Metrics Parameters¶
Cube Metric Parameters
Metric Period (None) type: decimal: How often to sample metrics, in secondsDefault: 60Choices: 1, 5, 10, 30, 60, 120, 180, 240, 300, Min: 1, Max: 300
Cube Metrics (None) type: string: Set of metrics to be collectedChoices: cpu, disk, memory, network