Dataset Deduplication – Merge¶
This Floe merges two datasets based on user-defined ID field. Records with matching unique identifiers are combined into a single record and will be emitted to the merged dataset. The Floe supports the following join types that determine how to handle cases when a specific unique identifier from one dataset is missing from the other dataset.
Dataset A (ID, COLOR) (A, red) (C, blue)
Dataset B (ID, PRICE) (A, 1.0) (B, 2.0)
A∪B – full outer join
Joins records with matching unique identifiers by merging their data fields. For unmatched records empty data field(s) will be added. All records will be sent to the merged output dataset.
Dataset merged (ID, COLOR, PRICE) (A, red, 1.0) (C, blue, ) (B, , 2.0)
A∩B – inner join
Joins only records with matching unique identifiers for the two datasets. Upon request, all other unmatched records will be sent to the unmatched output dataset.
Dataset merged (ID, COLOR, PRICE) (A, red, 1.0) Dataset unmatched (ID, COLOR, PRICE) (C, blue) (B, 2.0)
A – left join
Joins all records form dataset A with matching unique identifiers from dataset B. Empty data field(s) will be added for unmatched records of dataset A. Upon request, unmatched records from input dataset B will be sent to the unmatched output dataset.
Dataset merged (ID, COLOR, PRICE) (A, red, 1.0) (C, blue, )
Dataset unmatched (ID, COLOR, PRICE) (B, 2.0)
B – right join
Joins all records form dataset N with matching unique identifiers from dataset A. Empty data field(s) will be added for unmatched records of dataset B. Upon request, unmatched records from input dataset A will be sent to the unmatched output dataset.
Dataset merged (ID, COLOR, PRICE) (A, red, 1.0) (B, , 2.0)
Dataset unmatched (ID, COLOR, PRICE) (C, blue)
Extra Required Parameters
Merged dataset (dataset_out) : Output dataset of successful calculationsDefault: merged Write unmatched dataset (boolean) : If off, then the ‘unmatched’ dataset is not generated.Default: False Input Dataset ‘A’ (data_source) : Dataset to merge Failed Dataset (dataset_out) : Output dataset of failed calculationsDefault: Failed Output for Dataset Deduplication – Merge Input Dataset ‘B’ (data_source) : Dataset to merge Duplicate handling (string) : This parameter specifies how to handle duplicate fields when merging records.Default: RenameChoices: Ignore, Rename Join type (string) : This parameter determines how the records are combined and what happens to unmatched records. See the floe description for explanation.Default: A∪BChoices: A, B, A∩B, A∪B The field to merge on (string) : Field containing a unique identifier for records from dataset ‘A’.