Skip to content

fix symlink display on marquez#2736

Merged
wslulciuc merged 12 commits into
MarquezProject:mainfrom
sophiely:fix/display-symlinks-datasets-and-lineage
Mar 5, 2024
Merged

fix symlink display on marquez#2736
wslulciuc merged 12 commits into
MarquezProject:mainfrom
sophiely:fix/display-symlinks-datasets-and-lineage

Conversation

@sophiely

@sophiely sophiely commented Jan 23, 2024

Copy link
Copy Markdown
Contributor

Problem

Sending an event with a dataset symlink create an empty namespace with 0 dataset in it.

For example, this event:

{
  ...
  },
  "inputs": [
    {
      "namespace": "test_namespace"
      "name": "dataset_0",
      "facets": {
        "symlinks": {
          "_producer": "https://github.com/OpenLineage/OpenLineage/tree/1.1.0/client/python",
          "_schemaURL": "https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json",
          "identifiers": [
            {
              "name": "symlink_prefix",
              "type": "DB_TABLE",
              "namespace": "symlink_test"
            }
          ]
        },
      },
    }
  ],
  ...
}

create an empty namespace called symlink_test

Closes: #2645

Solution

  • Display the dataset create by a symlink facet in their own namespace (here in the example the namespace symlink_test will contain a dataset called symlink_prefix)
  • The lineage of this dataset will be redirect to the lineage of the "main" dataset (the dataset test_namespace.dataset_0 in our example)

Example:

If we run these 2 runs:

new_dataset_a {facet: new_dataset_sym_a} ------- new_symlink_job_a ----------> new_dataset_b

then:
new_dataset_sym_a ------- new_symlink_job_b --------------> new_dataset_sym_b

On marquez we'll have

image image

Please find more detailed explanation on the comments below.

This fix include another issue on the front though, the version/facets of the selected dataset are not directly displayed.

Since the selected dataset is the symlink and not the primary dataset, the front doesn't recognize the selected dataset as part of the lineage as a result the version endpoint is not run.

image

But if we click on the dataset new_dataset_a additional, the dataset version query is run and information are displayed

image

One-line summary:

Checklist

@boring-cyborg boring-cyborg Bot added the api API layer changes label Jan 23, 2024
@netlify

netlify Bot commented Jan 23, 2024

Copy link
Copy Markdown

Deploy Preview for peppy-sprite-186812 canceled.

Name Link
🔨 Latest commit e832a6f
🔍 Latest deploy log https://app.netlify.com/sites/peppy-sprite-186812/deploys/65e75d29fb98ab0008f9e98c

@codecov

codecov Bot commented Jan 23, 2024

Copy link
Copy Markdown

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.46%. Comparing base (cc9c2c0) to head (e832a6f).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##               main    #2736   +/-   ##
=========================================
  Coverage     84.45%   84.46%           
+ Complexity     1416     1415    -1     
=========================================
  Files           251      251           
  Lines          6447     6450    +3     
  Branches        291      292    +1     
=========================================
+ Hits           5445     5448    +3     
  Misses          850      850           
  Partials        152      152           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@boring-cyborg boring-cyborg Bot added the docs label Jan 23, 2024
Comment on lines +30 to +32
FROM datasets d
JOIN dataset_symlinks symlinks ON d.uuid = symlinks.dataset_uuid
JOIN namespaces ON symlinks.namespace_uuid = namespaces.uuid

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query is not much different from before, but now the dataset uuid is not the primary key anymore since a dataset and his symlink has the same dataset uuid (that's why the group by is not here anymore).
We join this view with dataset_symlinks to identify if it's a primary dataset or not. if it is a primary dataset, the values in the row remain the same as before. If not, the namespace, name and namespace uuid are replaced by the one from the symlinks (value from the join table).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we modify view definition in R__3_Datasets_view.sql?
Flywaydb scripts starting with R are run with each migration Repeatable migration

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid, since we recreate the dataset_view on every marquez deploy, we can make your changes in R__3_Datasets_view.sql as @pawel-big-lebowski suggested.

Comment on lines +105 to +108
LEFT JOIN dataset_versions dv ON dv.uuid = ds.current_version_uuid
LEFT JOIN dataset_symlinks dsym ON dsym.namespace_uuid = ds.namespace_uuid and dsym.name = ds.name
WHERE dsym.is_primary = true
AND ds.uuid IN (<dsUuids>)""")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here since the view datasets_views can have several rows with the same uuid we choose the one flagged as primary.

Comment on lines +116 to +125
LEFT JOIN dataset_symlinks dsym ON dsym.namespace_uuid = ds.namespace_uuid and dsym.name = ds.name
INNER JOIN (
SELECT uuid
FROM datasets_view as u
WHERE
u.name = :datasetName
AND u.namespace_name = :namespaceName
) as u
on u.uuid = ds.uuid
WHERE dsym.is_primary is true""")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here since the view datasets_views can have several rows with the same uuid we choose the one flagged as primary.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we use dataset_views for symlink filtering like

 INNER JOIN datasets_view AS d ON d.uuid = df.dataset_uuid
 WHERE CAST((:namespaceName, :datasetName) AS DATASET_NAME) = ANY(d.dataset_symlinks)

Comment on lines +118 to +130
if (nodeId.isDatasetType()) {
DatasetId datasetId = nodeId.asDatasetId();
DatasetData datasetData =
this.getDatasetData(datasetId.getNamespace().getValue(), datasetId.getName().getValue());

if (!datasetIds.contains(datasetData.getUuid())) {
log.warn(
"Found jobs {} which no longer share lineage with dataset '{}' - discarding",
jobData.stream().map(JobData::getId).toList(),
nodeId.getValue());
return toLineageWithOrphanDataset(nodeId.asDatasetId());
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we check if the uuid of the node and not the namespace+name

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for adding the warn log 💯

@pawel-big-lebowski pawel-big-lebowski left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be OK to add some test to LineageService which is failing prior to the code change introduced and is passing afterwards?

Comment on lines +30 to +32
FROM datasets d
JOIN dataset_symlinks symlinks ON d.uuid = symlinks.dataset_uuid
JOIN namespaces ON symlinks.namespace_uuid = namespaces.uuid

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we modify view definition in R__3_Datasets_view.sql?
Flywaydb scripts starting with R are run with each migration Repeatable migration

Comment on lines +116 to +125
LEFT JOIN dataset_symlinks dsym ON dsym.namespace_uuid = ds.namespace_uuid and dsym.name = ds.name
INNER JOIN (
SELECT uuid
FROM datasets_view as u
WHERE
u.name = :datasetName
AND u.namespace_name = :namespaceName
) as u
on u.uuid = ds.uuid
WHERE dsym.is_primary is true""")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we use dataset_views for symlink filtering like

 INNER JOIN datasets_view AS d ON d.uuid = df.dataset_uuid
 WHERE CAST((:namespaceName, :datasetName) AS DATASET_NAME) = ANY(d.dataset_symlinks)

Comment thread CHANGELOG.md

## [Unreleased](https://github.com/MarquezProject/marquez/compare/0.44.0...HEAD)

### Fixed

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the changelog 💯

@dkt-sophie-ly

dkt-sophie-ly commented Feb 7, 2024

Copy link
Copy Markdown

Hi @pawel-big-lebowski @wslulciuc

Just updated the code accordingly yo your comments

  • Modify R__3_Datasets_view.sql
  • Use INNER JOIN datasets_view AS d ON d.uuid = df.dataset_uuid WHERE CAST((:namespaceName, :datasetName) AS DATASET_NAME) = ANY(d.dataset_symlinks) as a filter
  • Add a test in LineageService
    Concerning this test I feel like the way I add a symlinks facets is not the best way (the dataset is not considered as a symlink). I may need a little help on that if that's ok with you :)

Thanks for your review !

@sophiely

Copy link
Copy Markdown
Contributor Author

Hi @wslulciuc @pawel-big-lebowski :)
Did you have the time to check this PR ?

Thanks for your review

Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>

@wslulciuc wslulciuc left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work and test, @sophiely 💯

@wslulciuc wslulciuc enabled auto-merge (squash) March 5, 2024 18:01
@wslulciuc wslulciuc disabled auto-merge March 5, 2024 18:24
@wslulciuc wslulciuc merged commit b0683ad into MarquezProject:main Mar 5, 2024
@sophiely sophiely deleted the fix/display-symlinks-datasets-and-lineage branch July 25, 2024 07:45
jonathanpmoraes referenced this pull request in nubank/NuMarquez Feb 6, 2025
* fix symlink display on marquez

Signed-off-by: sophiely <ly.sophie200@gmail.com>

* fix code formatting

Signed-off-by: sophiely <ly.sophie200@gmail.com>

* update changelog

Signed-off-by: sophiely <ly.sophie200@gmail.com>

* change dataset_views query

Signed-off-by: sophiely <ly.sophie200@gmail.com>

* update changelog

Signed-off-by: sophiely <ly.sophie200@gmail.com>

* rename migration file

Signed-off-by: sophiely <ly.sophie200@gmail.com>

* rename migration file

Signed-off-by: sophiely <ly.sophie200@gmail.com>

* fix formatting and add migration file

Signed-off-by: sophiely <ly.sophie200@gmail.com>

* fix formatting

Signed-off-by: sophiely <ly.sophie200@gmail.com>

* resolve comments

Signed-off-by: sophiely <ly.sophie200@gmail.com>

* resolve tests

Signed-off-by: sophiely <ly.sophie200@gmail.com>

---------

Signed-off-by: sophiely <ly.sophie200@gmail.com>
Co-authored-by: Willy Lulciuc <willy@datakin.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api API layer changes docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Symlink facet create an empty namespace with 0 dataset in it

4 participants