Dataset Deduplication – Merge

This Floe merges two datasets based on user-defined ID field. Records with matching unique identifiers are combined into a single record and will be emitted to the merged dataset. The Floe supports the following join types that determine how to handle cases when a specific unique identifier from one dataset is missing from the other dataset.

Dataset A (ID, COLOR) (A, red) (C, blue)

Dataset B (ID, PRICE) (A, 1.0) (B, 2.0)

A∪B – full outer join

Joins records with matching unique identifiers by merging their data fields. For unmatched records empty data field(s) will be added. All records will be sent to the merged output dataset.

Dataset merged (ID, COLOR, PRICE) (A, red, 1.0) (C, blue, ) (B, , 2.0)

A∩B – inner join

Joins only records with matching unique identifiers for the two datasets. Upon request, all other unmatched records will be sent to the unmatched output dataset.

Dataset merged (ID, COLOR, PRICE) (A, red, 1.0) Dataset unmatched (ID, COLOR, PRICE) (C, blue) (B, 2.0)

A – left join

Joins all records form dataset A with matching unique identifiers from dataset B. Empty data field(s) will be added for unmatched records of dataset A. Upon request, unmatched records from input dataset B will be sent to the unmatched output dataset.

Dataset merged (ID, COLOR, PRICE) (A, red, 1.0) (C, blue, )

Dataset unmatched (ID, COLOR, PRICE) (B, 2.0)

B – right join

Joins all records form dataset N with matching unique identifiers from dataset A. Empty data field(s) will be added for unmatched records of dataset B. Upon request, unmatched records from input dataset A will be sent to the unmatched output dataset.

Dataset merged (ID, COLOR, PRICE) (A, red, 1.0) (B, , 2.0)

Dataset unmatched (ID, COLOR, PRICE) (C, blue)

Extra Required Parameters

  • Merged dataset (dataset_out) : Output dataset of successful calculations
    Default: merged
  • Write unmatched dataset (boolean) : If off, then the ‘unmatched’ dataset is not generated.
    Default: False
  • Input Dataset ‘A’ (data_source) : Dataset to merge
  • Failed Dataset (dataset_out) : Output dataset of failed calculations
    Default: Failed Output for Dataset Deduplication – Merge
  • Input Dataset ‘B’ (data_source) : Dataset to merge
  • Duplicate handling (string) : This parameter specifies how to handle duplicate fields when merging records.
    Default: Rename
    Choices: Ignore, Rename
  • Join type (string) : This parameter determines how the records are combined and what happens to unmatched records. See the floe description for explanation.
    Default: A∪B
    Choices: A, B, A∩B, A∪B
  • The field to merge on (string) : Field containing a unique identifier for records from dataset ‘A’.