Update lineage query to only look at jobs with inputs or outputs#2068
Merged
Conversation
Signed-off-by: Michael Collado <collado.mike@gmail.com>
Codecov Report
@@ Coverage Diff @@
## main #2068 +/- ##
=========================================
Coverage 78.79% 78.79%
Complexity 1011 1011
=========================================
Files 200 200
Lines 5574 5574
Branches 422 422
=========================================
Hits 4392 4392
Misses 730 730
Partials 452 452 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
wslulciuc
approved these changes
Aug 10, 2022
wslulciuc
left a comment
Member
There was a problem hiding this comment.
@collado-mike, though you consider this a quick "hack", I do view it as reasonable optimization to improve the lineage query performance. Now, as you pointed out, it is a short-term optimization change. That is, coverage for OpenLineage will improve, which, in turn, will negatively impact the lineage query. But we have spoken briefly on ways we can continue to improve lineage query performance by introducing materialized views, caching, etc. Anyways, great work 💯 🥇
jonathanpmoraes
referenced
this pull request
in nubank/NuMarquez
Feb 6, 2025
Signed-off-by: Michael Collado <collado.mike@gmail.com> Signed-off-by: Michael Collado <collado.mike@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Signed-off-by: Michael Collado collado.mike@gmail.com
Problem
In many environments a large number of jobs reporting events have no inputs or outputs - e.g., PythonOperators in an Airflow deployment. If a Marquez installation has a lot of these, the lineage query spends a lot of its time searching for overlaps with jobs that have no inputs or outputs. In one installation, we have > 200K jobs, but only ~7000 jobs that have any inputs or outputs at all.
Solution
This changes the lineage query to query the
job_versions_io_mappingtable and INNER join with thejobs_viewso that only jobs that have inputs or outputs are present in thejobs_ioCTE. The impact of this is that table becomes very small and the recursive join in thelineageCTE is very fast.Probably notable that the missing inputs/outputs are largely due to insufficient coverage by the OpenLineage integrations - e.g., those PythonOperators are likely reading data from somewhere. This is, at best, a short term fix until OL coverage increases, at which point, the query will have to be revisited again.
Checklist
CHANGELOG.mdwith details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary).sqldatabase schema migration according to Flyway's naming convention (if relevant)