Dataset Deduplication – Merge

Category Paths

Follow one of these paths in the Orion user interface, to find the floe.

  • Task-based/Data Science/Manipulation

Description

Merge two datasets based on a user-defined ID field. Records with matching unique identifiers are combined into a single record and will be emitted to the merged dataset. The Floe supports the following join types that determine how to handle cases when a specific unique identifier from one dataset is missing from the other dataset.

Dataset A (ID, COLOR) (A, red) (C, blue)

Dataset B (ID, PRICE) (A, 1.0) (B, 2.0)

A∪B – full outer join

Joins records with matching unique identifiers by merging their data fields. For unmatched records empty data field(s) will be added. All records will be sent to the merged output dataset.

Dataset merged (ID, COLOR, PRICE) (A, red, 1.0) (C, blue, ) (B, , 2.0)

A∩B – inner join

Joins only records with matching unique identifiers for the two datasets. Upon request, all other unmatched records will be sent to the unmatched output dataset.

Dataset merged (ID, COLOR, PRICE) (A, red, 1.0) Dataset unmatched (ID, COLOR, PRICE) (C, blue) (B, 2.0)

A – left join

Joins all records from dataset A with matching unique identifiers from dataset B. Empty data field(s) will be added for unmatched records of dataset A. Upon request, unmatched records from input dataset B will be sent to the unmatched output dataset.

Dataset merged (ID, COLOR, PRICE) (A, red, 1.0) (C, blue, )

Dataset unmatched (ID, COLOR, PRICE) (B, 2.0)

B – right join

Joins all records from dataset B with matching unique identifiers from dataset A. Empty data field(s) will be added for unmatched records of dataset B. Upon request, unmatched records from input dataset A will be sent to the unmatched output dataset.

Dataset merged (ID, COLOR, PRICE) (A, red, 1.0) (B, , 2.0)

Dataset unmatched (ID, COLOR, PRICE) (C, blue)

Promoted Parameters

Title in user interface (promoted name)

Inputs

Input Dataset ‘A’ (inA): Dataset to merge

  • Required

  • Type: data_source

Input Dataset ‘B’ (inB): Dataset to merge

  • Required

  • Type: data_source

Outputs

Merged dataset (merged): Name of output merged dataset

  • Required

  • Type: dataset_out

  • Default: merged

Write unmatched dataset (switch_unmatched): If off, then the ‘unmatched’ dataset is not generated.

  • Required

  • Type: boolean

  • Default: False

  • Choices: [True, False]

Output unmatched dataset (unmatched):

  • Type: dataset_out

  • Default: unmatched

Merge

The field to merge on (idA): Field containing a unique identifier for records from dataset ‘A’.

  • Required

  • Type: string

The field to merge on (idB): Field containing a unique identifier for records from dataset ‘B’.

  • Type: string

Join type (join_type): This parameter determines how the records are combined and what happens to unmatched records. See the floe description for explanation.

  • Required

  • Type: string

  • Default: A∪B

  • Choices: [‘A’, ‘B’, ‘A∩B’, ‘A∪B’]

Duplicate handling (dup_action): This parameter specifies how to handle duplicate fields when merging records.

  • Required

  • Type: string

  • Default: Rename

  • Choices: [‘Ignore’, ‘Rename’]