fix symlink display on marquez#2736
Conversation
✅ Deploy Preview for peppy-sprite-186812 canceled.
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2736 +/- ##
=========================================
Coverage 84.45% 84.46%
+ Complexity 1416 1415 -1
=========================================
Files 251 251
Lines 6447 6450 +3
Branches 291 292 +1
=========================================
+ Hits 5445 5448 +3
Misses 850 850
Partials 152 152 ☔ View full report in Codecov by Sentry. |
| FROM datasets d | ||
| JOIN dataset_symlinks symlinks ON d.uuid = symlinks.dataset_uuid | ||
| JOIN namespaces ON symlinks.namespace_uuid = namespaces.uuid |
There was a problem hiding this comment.
The query is not much different from before, but now the dataset uuid is not the primary key anymore since a dataset and his symlink has the same dataset uuid (that's why the group by is not here anymore).
We join this view with dataset_symlinks to identify if it's a primary dataset or not. if it is a primary dataset, the values in the row remain the same as before. If not, the namespace, name and namespace uuid are replaced by the one from the symlinks (value from the join table).
There was a problem hiding this comment.
Couldn't we modify view definition in R__3_Datasets_view.sql?
Flywaydb scripts starting with R are run with each migration Repeatable migration
There was a problem hiding this comment.
Valid, since we recreate the dataset_view on every marquez deploy, we can make your changes in R__3_Datasets_view.sql as @pawel-big-lebowski suggested.
| LEFT JOIN dataset_versions dv ON dv.uuid = ds.current_version_uuid | ||
| LEFT JOIN dataset_symlinks dsym ON dsym.namespace_uuid = ds.namespace_uuid and dsym.name = ds.name | ||
| WHERE dsym.is_primary = true | ||
| AND ds.uuid IN (<dsUuids>)""") |
There was a problem hiding this comment.
So here since the view datasets_views can have several rows with the same uuid we choose the one flagged as primary.
| LEFT JOIN dataset_symlinks dsym ON dsym.namespace_uuid = ds.namespace_uuid and dsym.name = ds.name | ||
| INNER JOIN ( | ||
| SELECT uuid | ||
| FROM datasets_view as u | ||
| WHERE | ||
| u.name = :datasetName | ||
| AND u.namespace_name = :namespaceName | ||
| ) as u | ||
| on u.uuid = ds.uuid | ||
| WHERE dsym.is_primary is true""") |
There was a problem hiding this comment.
So here since the view datasets_views can have several rows with the same uuid we choose the one flagged as primary.
There was a problem hiding this comment.
could we use dataset_views for symlink filtering like
INNER JOIN datasets_view AS d ON d.uuid = df.dataset_uuid
WHERE CAST((:namespaceName, :datasetName) AS DATASET_NAME) = ANY(d.dataset_symlinks)
| if (nodeId.isDatasetType()) { | ||
| DatasetId datasetId = nodeId.asDatasetId(); | ||
| DatasetData datasetData = | ||
| this.getDatasetData(datasetId.getNamespace().getValue(), datasetId.getName().getValue()); | ||
|
|
||
| if (!datasetIds.contains(datasetData.getUuid())) { | ||
| log.warn( | ||
| "Found jobs {} which no longer share lineage with dataset '{}' - discarding", | ||
| jobData.stream().map(JobData::getId).toList(), | ||
| nodeId.getValue()); | ||
| return toLineageWithOrphanDataset(nodeId.asDatasetId()); | ||
| } | ||
| } |
There was a problem hiding this comment.
Now we check if the uuid of the node and not the namespace+name
There was a problem hiding this comment.
Nice! Thanks for adding the warn log 💯
pawel-big-lebowski
left a comment
There was a problem hiding this comment.
Would it be OK to add some test to LineageService which is failing prior to the code change introduced and is passing afterwards?
| FROM datasets d | ||
| JOIN dataset_symlinks symlinks ON d.uuid = symlinks.dataset_uuid | ||
| JOIN namespaces ON symlinks.namespace_uuid = namespaces.uuid |
There was a problem hiding this comment.
Couldn't we modify view definition in R__3_Datasets_view.sql?
Flywaydb scripts starting with R are run with each migration Repeatable migration
| LEFT JOIN dataset_symlinks dsym ON dsym.namespace_uuid = ds.namespace_uuid and dsym.name = ds.name | ||
| INNER JOIN ( | ||
| SELECT uuid | ||
| FROM datasets_view as u | ||
| WHERE | ||
| u.name = :datasetName | ||
| AND u.namespace_name = :namespaceName | ||
| ) as u | ||
| on u.uuid = ds.uuid | ||
| WHERE dsym.is_primary is true""") |
There was a problem hiding this comment.
could we use dataset_views for symlink filtering like
INNER JOIN datasets_view AS d ON d.uuid = df.dataset_uuid
WHERE CAST((:namespaceName, :datasetName) AS DATASET_NAME) = ANY(d.dataset_symlinks)
|
|
||
| ## [Unreleased](https://github.com/MarquezProject/marquez/compare/0.44.0...HEAD) | ||
|
|
||
| ### Fixed |
There was a problem hiding this comment.
Thanks for updating the changelog 💯
|
Hi @pawel-big-lebowski @wslulciuc Just updated the code accordingly yo your comments
Thanks for your review ! |
|
Hi @wslulciuc @pawel-big-lebowski :) Thanks for your review |
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
* fix symlink display on marquez Signed-off-by: sophiely <ly.sophie200@gmail.com> * fix code formatting Signed-off-by: sophiely <ly.sophie200@gmail.com> * update changelog Signed-off-by: sophiely <ly.sophie200@gmail.com> * change dataset_views query Signed-off-by: sophiely <ly.sophie200@gmail.com> * update changelog Signed-off-by: sophiely <ly.sophie200@gmail.com> * rename migration file Signed-off-by: sophiely <ly.sophie200@gmail.com> * rename migration file Signed-off-by: sophiely <ly.sophie200@gmail.com> * fix formatting and add migration file Signed-off-by: sophiely <ly.sophie200@gmail.com> * fix formatting Signed-off-by: sophiely <ly.sophie200@gmail.com> * resolve comments Signed-off-by: sophiely <ly.sophie200@gmail.com> * resolve tests Signed-off-by: sophiely <ly.sophie200@gmail.com> --------- Signed-off-by: sophiely <ly.sophie200@gmail.com> Co-authored-by: Willy Lulciuc <willy@datakin.com>
Problem
Sending an event with a dataset symlink create an empty namespace with 0 dataset in it.
For example, this event:
create an empty namespace called symlink_test
Closes: #2645
Solution
Example:
If we run these 2 runs:
new_dataset_a {facet: new_dataset_sym_a} ------- new_symlink_job_a ----------> new_dataset_b
then:
new_dataset_sym_a ------- new_symlink_job_b --------------> new_dataset_sym_b
On marquez we'll have
Please find more detailed explanation on the comments below.
This fix include another issue on the front though, the version/facets of the selected dataset are not directly displayed.
Since the selected dataset is the symlink and not the primary dataset, the front doesn't recognize the selected dataset as part of the lineage as a result the version endpoint is not run.
But if we click on the dataset new_dataset_a additional, the dataset version query is run and information are displayed
One-line summary:
Checklist
CHANGELOG.md] (https://github.com/MarquezProject/marquez/blob/main/CHANGELOG.md#unreleased) (Depending on the change, this may not be necessary)..sqldatabase schema migration according to Flyway's naming convention (if relevant)