Dataset Deduplication – Merge
Category Paths
Follow one of these paths in the Orion user interface, to find the floe.
Task-based/Data Science/Manipulation
Description
Merge two datasets based on a user-defined ID field. Records with matching unique identifiers are combined into a single record and will be emitted to the merged dataset. The Floe supports the following join types that determine how to handle cases when a specific unique identifier from one dataset is missing from the other dataset.
Dataset A (ID, COLOR) (A, red) (C, blue)
Dataset B (ID, PRICE) (A, 1.0) (B, 2.0)
A∪B – full outer join
Joins records with matching unique identifiers by merging their data fields. For unmatched records empty data field(s) will be added. All records will be sent to the merged output dataset.
Dataset merged (ID, COLOR, PRICE) (A, red, 1.0) (C, blue, ) (B, , 2.0)
A∩B – inner join
Joins only records with matching unique identifiers for the two datasets. Upon request, all other unmatched records will be sent to the unmatched output dataset.
Dataset merged (ID, COLOR, PRICE) (A, red, 1.0) Dataset unmatched (ID, COLOR, PRICE) (C, blue) (B, 2.0)
A – left join
Joins all records from dataset A with matching unique identifiers from dataset B. Empty data field(s) will be added for unmatched records of dataset A. Upon request, unmatched records from input dataset B will be sent to the unmatched output dataset.
Dataset merged (ID, COLOR, PRICE) (A, red, 1.0) (C, blue, )
Dataset unmatched (ID, COLOR, PRICE) (B, 2.0)
B – right join
Joins all records from dataset B with matching unique identifiers from dataset A. Empty data field(s) will be added for unmatched records of dataset B. Upon request, unmatched records from input dataset A will be sent to the unmatched output dataset.
Dataset merged (ID, COLOR, PRICE) (A, red, 1.0) (B, , 2.0)
Dataset unmatched (ID, COLOR, PRICE) (C, blue)
Promoted Parameters
Title in user interface (promoted name)
Inputs
Input Dataset ‘A’ (inA): Dataset to merge
Required
Type: data_source
Input Dataset ‘B’ (inB): Dataset to merge
Required
Type: data_source
Outputs
Merged dataset (merged): Name of output merged dataset
Required
Type: dataset_out
Default: merged
Write unmatched dataset (switch_unmatched): If off, then the ‘unmatched’ dataset is not generated.
Required
Type: boolean
Default: False
Choices: [True, False]
Output unmatched dataset (unmatched):
Type: dataset_out
Default: unmatched
Merge
The field to merge on (idA): Field containing a unique identifier for records from dataset ‘A’.
Required
Type: string
The field to merge on (idB): Field containing a unique identifier for records from dataset ‘B’.
Type: string
Join type (join_type): This parameter determines how the records are combined and what happens to unmatched records. See the floe description for explanation.
Required
Type: string
Default: A∪B
Choices: [‘A’, ‘B’, ‘A∩B’, ‘A∪B’]
Duplicate handling (dup_action): This parameter specifies how to handle duplicate fields when merging records.
Required
Type: string
Default: Rename
Choices: [‘Ignore’, ‘Rename’]