Merge Records
This serial cube merges records from two ports, using a specified field as the merge key.
The input ports are labeled A and B. Records with matching identifiers are combined into a single record and output on the success port.
Note
This cube is identical to Join Records cube, only using different nomenclatures.
The ID A field parameter specifies which field in “A” records contains the unique identifier used to merge the records. The ID B field parameter specifies which field in the “B” records contains the identifier. ID B is null by default, meaning the ID A field will be used for both “A” and “B” records. Successfully merged records are emitted to the success port.
The Merge Type parameter determines what happens to records that cannot be merged with a matching record:
“A” means that unmatched records from the A input port are sent to the success port, while unmatched records from the B input port will be sent to the failure port.
“B” means that unmatched records from the B input port are sent to the success port, while unmatched records from the A input port will be sent to the failure port.
“A∩B” means only successfully merged records will be emitted from the success port, with all unmatched records going to the failure port.
“A∪B” means all merged plus unmatched records will be emitted from the success port, and no records will be emitted on the failure port.
If the cube encounters identically named fields on two records being merged, the fields on the “B” record will be ignored, renamed, or appended, depending on the value of the Duplicate Fields parameter.
Calculation Parameters
CPUs (cpu_count) type: integer: The number of CPUs to run this cube withDefault: 1 , Min: 1, Max: 128 Cube Metrics (cube_metrics) type: string: Set of metrics to be collectedChoices: cpu, disk, memory, network Temporary Disk Space (MiB) (disk_space) type: decimal: The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.Default: 5120.0 , Min: 128.0, Max: 8589934592 Duplicate Fields (dup_action) type: string: This parameter specifies how to handle duplicate fields when merging records.Default: RenameChoices: Ignore, Rename, Append GPUs (gpu_count) type: integer: The number of GPUs to run this cube withDefault: 0 , Max: 16 Instance Tags (instance_tags) type: string: Only run on machines with matching tags (comma separated)Default: “” Instance Type (instance_type) type: string: The type of instance that this cube needs to be run on Merge Type (join_type) type: string: This parameter determines what happens to unmatched records. See the cube description for explanation.Default: A∪BChoices: A, B, A∩B, A∪B ID A (left_join_field) type: string: Field containing a unique identifier for records on the ‘A’ port. Max Backlog Wait (max_backlog_wait) type: integer: The max time (in seconds) that a cube will be backlogged on a group before being re-evaluatedDefault: 600 , Min: 300 Memory (MiB) (memory_mb) type: decimal: The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.Default: 1800 , Min: 256.0, Max: 8589934592 Metric Period (metric_period) type: decimal: How often to sample metrics, in secondsDefault: 60Choices: 1, 5, 10, 30, 60, 120, 180, 240, 300, Min: 1, Max: 300 Thread limit per CPU (pids_per_cpu_limit) type: integer: The number of threads per CPUDefault: 32 ID B (right_join_field) type: string: Field containing a unique identifier for records on the ‘B’ port. Shared Memory (MiB) (shared_memory_mb) type: decimal: The amount of shared memory to allow a container to addressDefault: 64 Spot policy (spot_policy) type: string: Control cube placement on spot market instancesDefault: ProhibitedChoices: Allowed, Preferred, NotPreferred, Prohibited, Required Match molecules by title (use_mol_title) type: boolean: If set, and a molecule field is chosen for merging, the molecule titles will be used as unique identifiers. Otherwise, the molecule SMILES will be used.Default: False
Hardware Parameters
- Machine hardware requirements
- Memory (MiB) (memory_mb) type: decimal: The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.Default: 1800 , Min: 256.0, Max: 8589934592
- Shared Memory (MiB) (shared_memory_mb) type: decimal: The amount of shared memory to allow a container to addressDefault: 64
- Thread limit per CPU (pids_per_cpu_limit) type: integer: The number of threads per CPUDefault: 32
- Max Backlog Wait (max_backlog_wait) type: integer: The max time (in seconds) that a cube will be backlogged on a group before being re-evaluatedDefault: 600 , Min: 300
- Temporary Disk Space (MiB) (disk_space) type: decimal: The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.Default: 5120.0 , Min: 128.0, Max: 8589934592
- GPUs (gpu_count) type: integer: The number of GPUs to run this cube withDefault: 0 , Max: 16
- CPUs (cpu_count) type: integer: The number of CPUs to run this cube withDefault: 1 , Min: 1, Max: 128
- Instance Type (instance_type) type: string: The type of instance that this cube needs to be run on
- Spot policy (spot_policy) type: string: Control cube placement on spot market instancesDefault: ProhibitedChoices: Allowed, Preferred, NotPreferred, Prohibited, Required
- Instance Tags (instance_tags) type: string: Only run on machines with matching tags (comma separated)Default: “”
Metrics Parameters
- Cube Metric Parameters
- Metric Period (None) type: decimal: How often to sample metrics, in secondsDefault: 60Choices: 1, 5, 10, 30, 60, 120, 180, 240, 300, Min: 1, Max: 300
- Cube Metrics (None) type: string: Set of metrics to be collectedChoices: cpu, disk, memory, network