Skip to content

fix column lineage when multiple jobs write to same dataset#2289

Merged
pawel-big-lebowski merged 1 commit into
mainfrom
fix/column-lineage-multiple-jobs-write-to-same-columns
Dec 7, 2022
Merged

fix column lineage when multiple jobs write to same dataset#2289
pawel-big-lebowski merged 1 commit into
mainfrom
fix/column-lineage-multiple-jobs-write-to-same-columns

Conversation

@pawel-big-lebowski

@pawel-big-lebowski pawel-big-lebowski commented Dec 6, 2022

Copy link
Copy Markdown
Collaborator

Signed-off-by: Pawel Leszczynski leszczynski.pawel@gmail.com

Problem

Current model of column-lineage API does not suit well scenario such that multiple different jobs write to same column of output dataset. Specifically, a response of the form:

         "transformationDescription": "identical",
         "transformationType": "IDENTITY",    
        ,    
         "inputFields": [
            { "namespace": "DBA", "name": "tableA", "field": "columnA"},
            { "namespace": "DBB", "name": "tableB", "field": "columnB"},
            { "namespace": "DBC", "name": "tableC", "field": "columnC"}
         ]

should be converted into:

 "inputFields": [
            { 
                  "namespace": "DBA", 
                  "name": "tableA", 
                  "field": "columnA",   
                  "transformationDescription": "identical", 
                  "transformationType": "IDENTITY"
            },
            ....
         ]

with transformationDescription and transformationType contained per input field.

Solution

  • Update API model while still returning deprecated transformationDescription and transformationType,
  • Write a test such that multiple different jobs write to same column,
  • Additionally, display column-lineage of a dataset in Marquez UI (which is helpful for debugging purposes)

Note: All database schema changes require discussion. Please link the issue for context.

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

@boring-cyborg boring-cyborg Bot added api API layer changes client/java labels Dec 6, 2022
@pawel-big-lebowski pawel-big-lebowski force-pushed the fix/column-lineage-multiple-jobs-write-to-same-columns branch from 7f7a754 to 8a2bd5a Compare December 6, 2022 09:14
@codecov

codecov Bot commented Dec 6, 2022

Copy link
Copy Markdown

Codecov Report

Merging #2289 (a7ecf04) into main (c8a38a1) will increase coverage by 0.13%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##               main    #2289      +/-   ##
============================================
+ Coverage     76.84%   76.97%   +0.13%     
- Complexity     1154     1163       +9     
============================================
  Files           220      222       +2     
  Lines          5268     5298      +30     
  Branches        423      423              
============================================
+ Hits           4048     4078      +30     
  Misses          747      747              
  Partials        473      473              
Impacted Files Coverage Δ
api/src/main/java/marquez/db/ColumnLineageDao.java 100.00% <ø> (ø)
...a/marquez/client/models/ColumnLineageNodeData.java 0.00% <ø> (ø)
...arquez/db/mappers/ColumnLineageNodeDataMapper.java 90.47% <100.00%> (ø)
.../java/marquez/db/models/ColumnLineageNodeData.java 100.00% <100.00%> (ø)
...ain/java/marquez/service/ColumnLineageService.java 97.24% <100.00%> (+0.03%) ⬆️
...ain/java/marquez/service/models/ColumnLineage.java 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@pawel-big-lebowski pawel-big-lebowski force-pushed the fix/column-lineage-multiple-jobs-write-to-same-columns branch from 8a2bd5a to 5e3ce84 Compare December 6, 2022 11:45
@wslulciuc

Copy link
Copy Markdown
Member

@pawel-big-lebowski, per our discussion offline, we'll want to:

  1. First, add the fields transformationDescription and transformationType under the inputFields object
  2. Then, delete the top-level fields transformationDescription and transformationType

We'll also want to make the breaking changes above in separate (minor) releases and communicate that in our changelog / release notes on their deprecation / removal.

@pawel-big-lebowski pawel-big-lebowski force-pushed the fix/column-lineage-multiple-jobs-write-to-same-columns branch 2 times, most recently from 1b768e4 to ace6fe0 Compare December 6, 2022 13:38
@boring-cyborg boring-cyborg Bot added the docs label Dec 6, 2022
@pawel-big-lebowski pawel-big-lebowski marked this pull request as ready for review December 6, 2022 13:40
@pawel-big-lebowski pawel-big-lebowski force-pushed the fix/column-lineage-multiple-jobs-write-to-same-columns branch from ace6fe0 to e988542 Compare December 6, 2022 13:42
@pawel-big-lebowski pawel-big-lebowski force-pushed the fix/column-lineage-multiple-jobs-write-to-same-columns branch from e988542 to 17fb0f7 Compare December 7, 2022 07:27
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
@pawel-big-lebowski pawel-big-lebowski force-pushed the fix/column-lineage-multiple-jobs-write-to-same-columns branch from 17fb0f7 to a7ecf04 Compare December 7, 2022 10:46
@pawel-big-lebowski pawel-big-lebowski merged commit 11f6cec into main Dec 7, 2022
@pawel-big-lebowski pawel-big-lebowski deleted the fix/column-lineage-multiple-jobs-write-to-same-columns branch December 7, 2022 11:19
jonathanpmoraes referenced this pull request in nubank/NuMarquez Feb 6, 2025
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>

Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api API layer changes client/java docs

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants