Dataset Deduplication – Merge¶

This Floe merges two datasets based on user-defined ID field. Records with matching unique identifiers are combined into a single record and will be emitted to the merged dataset. The Floe supports the following join types that determine how to handle cases when a specific unique identifier from one dataset is missing from the other dataset.

Dataset A (ID, COLOR) (A, red) (C, blue)

Dataset B (ID, PRICE) (A, 1.0) (B, 2.0)

A∪B – full outer join

Joins records with matching unique identifiers by merging their data fields. For unmatched records empty data field(s) will be added. All records will be sent to the merged output dataset.

Dataset merged (ID, COLOR, PRICE) (A, red, 1.0) (C, blue, ) (B, , 2.0)

A∩B – inner join

Joins only records with matching unique identifiers for the two datasets. Upon request, all other unmatched records will be sent to the unmatched output dataset.

Dataset merged (ID, COLOR, PRICE) (A, red, 1.0) Dataset unmatched (ID, COLOR, PRICE) (C, blue) (B, 2.0)

A – left join

Joins all records form dataset A with matching unique identifiers from dataset B. Empty data field(s) will be added for unmatched records of dataset A. Upon request, unmatched records from input dataset B will be sent to the unmatched output dataset.

Dataset merged (ID, COLOR, PRICE) (A, red, 1.0) (C, blue, )

Dataset unmatched (ID, COLOR, PRICE) (B, 2.0)

B – right join

Joins all records form dataset N with matching unique identifiers from dataset A. Empty data field(s) will be added for unmatched records of dataset B. Upon request, unmatched records from input dataset A will be sent to the unmatched output dataset.

Dataset merged (ID, COLOR, PRICE) (A, red, 1.0) (B, , 2.0)

Dataset unmatched (ID, COLOR, PRICE) (C, blue)

Extra Required Parameters

Merged dataset (dataset_out) : Output dataset of successful calculations

Default: merged

Write unmatched dataset (boolean) : If off, then the ‘unmatched’ dataset is not generated.

Default: False

Input Dataset ‘A’ (data_source) : Dataset to merge

Failed Dataset (dataset_out) : Output dataset of failed calculations

Default: Failed Output for Dataset Deduplication – Merge

Input Dataset ‘B’ (data_source) : Dataset to merge

Duplicate handling (string) : This parameter specifies how to handle duplicate fields when merging records.

Default: Rename

Choices: Ignore, Rename

Join type (string) : This parameter determines how the records are combined and what happens to unmatched records. See the floe description for explanation.

Default: A∪B

Choices: A, B, A∩B, A∪B

The field to merge on (string) : Field containing a unique identifier for records from dataset ‘A’.