Join Records

../../../../../_images/JoinCubeIcon.svg

This serial cube joins records from two ports, using a specified field as the join key.

The input ports are labeled left and right. Records with matching identifiers are combined into a single record and output on the success port.



Note

This cube is identical to Merge Records cube, only using different nomenclatures.

The ID Left field parameter specifies which field in “left” records contains the unique identifier used to join the records. The ID Right field parameter specifies which field in the “right” records contains the identifier. ID Right is null by default, meaning the ID Left field will be used for both “left” and “right” records. Successfully joined records are emitted to the success port.

The Join Type parameter determines what happens to records that cannot be joined with a matching record:

  • “left” means that unmatched records from the left input port are sent to the success port, while unmatched records from the right input port will be sent to the failure port.

  • “right” means that unmatched records from the right input port are sent to the success port, while unmatched records from the left input port will be sent to the failure port.

  • “Inner” means only successfully joined records will be emitted from the success port, with all unmatched records going to the failure port.

  • “Outer” means all joined plus unmatched records will be emitted from the success port, and no records will be emitted on the failure port.

If the cube encounters identically named fields on two records being joined, the fields on the “right” record will be ignored, renamed, or appended, depending on the value of the Duplicate Fields parameter.

Calculation Parameters

  • CPUs (cpu_count) type: integer: The number of CPUs to run this cube with
    Default: 1 , Min: 1, Max: 128
  • Cube Metrics (cube_metrics) type: string: Set of metrics to be collected

    Choices: cpu, disk, memory, network
  • Temporary Disk Space (MiB) (disk_space) type: decimal: The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
    Default: 5120.0 , Min: 128.0, Max: 8589934592
  • Duplicate Fields (dup_action) type: string: This parameter specifies how to handle duplicate fields when joining records.
    Default: Rename
    Choices: Ignore, Rename, Append
  • GPUs (gpu_count) type: integer: The number of GPUs to run this cube with
    Default: 0 , Max: 16
  • Instance Tags (instance_tags) type: string: Only run on machines with matching tags (comma separated)
    Default: “”
  • Instance Type (instance_type) type: string: The type of instance that this cube needs to be run on
  • Join Type (join_type) type: string: This parameter determines what happens to unmatched records. See the cube description for explanation.
    Default: Outer
    Choices: left, right, Inner, Outer
  • ID Left (left_join_field) type: string: Field containing a unique identifier for records on the ‘left’ port.
  • Max Backlog Wait (max_backlog_wait) type: integer: The max time (in seconds) that a cube will be backlogged on a group before being re-evaluated
    Default: 600 , Min: 300
  • Memory (MiB) (memory_mb) type: decimal: The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
    Default: 1800 , Min: 256.0, Max: 8589934592
  • Metric Period (metric_period) type: decimal: How often to sample metrics, in seconds
    Default: 60
    Choices: 1, 5, 10, 30, 60, 120, 180, 240, 300, Min: 1, Max: 300
  • Thread limit per CPU (pids_per_cpu_limit) type: integer: The number of threads per CPU
    Default: 32
  • ID Right (right_join_field) type: string: Field containing a unique identifier for records on the ‘right’ port.
  • Shared Memory (MiB) (shared_memory_mb) type: decimal: The amount of shared memory to allow a container to address
    Default: 64
  • Spot policy (spot_policy) type: string: Control cube placement on spot market instances
    Default: Prohibited
    Choices: Allowed, Preferred, NotPreferred, Prohibited, Required
  • Match molecules by title (use_mol_title) type: boolean: If set, and a molecule field is chosen for merging, the molecule titles will be used as unique identifiers. Otherwise, the molecule SMILES will be used.
    Default: False

Hardware Parameters

Machine hardware requirements
  • Memory (MiB) (memory_mb) type: decimal: The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
    Default: 1800 , Min: 256.0, Max: 8589934592
  • Shared Memory (MiB) (shared_memory_mb) type: decimal: The amount of shared memory to allow a container to address
    Default: 64
  • Thread limit per CPU (pids_per_cpu_limit) type: integer: The number of threads per CPU
    Default: 32
  • Max Backlog Wait (max_backlog_wait) type: integer: The max time (in seconds) that a cube will be backlogged on a group before being re-evaluated
    Default: 600 , Min: 300
  • Temporary Disk Space (MiB) (disk_space) type: decimal: The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
    Default: 5120.0 , Min: 128.0, Max: 8589934592
  • GPUs (gpu_count) type: integer: The number of GPUs to run this cube with
    Default: 0 , Max: 16
  • CPUs (cpu_count) type: integer: The number of CPUs to run this cube with
    Default: 1 , Min: 1, Max: 128
  • Instance Type (instance_type) type: string: The type of instance that this cube needs to be run on
  • Spot policy (spot_policy) type: string: Control cube placement on spot market instances
    Default: Prohibited
    Choices: Allowed, Preferred, NotPreferred, Prohibited, Required
  • Instance Tags (instance_tags) type: string: Only run on machines with matching tags (comma separated)
    Default: “”

Metrics Parameters

Cube Metric Parameters
  • Metric Period (None) type: decimal: How often to sample metrics, in seconds
    Default: 60
    Choices: 1, 5, 10, 30, 60, 120, 180, 240, 300, Min: 1, Max: 300
  • Cube Metrics (None) type: string: Set of metrics to be collected

    Choices: cpu, disk, memory, network