Update insert job function to avoid joining on symlinks for jobs that have no symlinks#2144
Merged
Conversation
wslulciuc
approved these changes
Sep 27, 2022
wslulciuc
left a comment
Member
There was a problem hiding this comment.
Left some minor comments, otherwise thanks for the amazing write up and analysis accompanying the fix @collado-mike 💯 🥇
… have no symlinks Signed-off-by: Michael Collado <collado.mike@gmail.com>
281da00 to
1fdd728
Compare
Codecov Report
@@ Coverage Diff @@
## main #2144 +/- ##
=========================================
Coverage 75.30% 75.30%
Complexity 1038 1038
=========================================
Files 203 203
Lines 4883 4883
Branches 399 399
=========================================
Hits 3677 3677
Misses 763 763
Partials 443 443 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
jonathanpmoraes
referenced
this pull request
in nubank/NuMarquez
Feb 6, 2025
… have no symlinks (#2144) Signed-off-by: Michael Collado <collado.mike@gmail.com> Signed-off-by: Michael Collado <collado.mike@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Typical marquez installations don't have a large number of new jobs being created on a regular basis. However, in some small number of installations, there can be a large number of new jobs being created, which executes the
rewrite_jobs_fqn_tablefunction each time, putting stress on the backing database. Most of the compute cost of this function is in computing the symlinks and aliases for jobs - even when the inserted job has no symlink.Closes: #ISSUE-NUMBER
Solution
Adding a check for the symlink field and offering a lower cost query in cases when no symlink is present (the norm) radically reduces the database compute load in Marquez installations that frequently create a large number of new jobs.
The following graph shows query count and latency and database CPU utilization under a test load of many new jobs being created. The test load was several days of real production OpenLineage events being replayed on a dev instance. To verify results, I ran the same test twice for both the old query and the new. Under heavy load, the job creation query causes database CPU utilization to climb to 100% and query latency climbs to as high as 2 seconds. Under the same load (I renamed all of the jobs in the database, so the same load shows up as new jobs that invoke the job creation query), the new query drives CPU utilization to around 30% and query latency is around 300 microseconds.
Note that the query latency in this graph is shown at log scale (right axis). Otherwise, the latency for the new query would be indistinguishable from 0.

Checklist
CHANGELOG.mdwith details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary).sqldatabase schema migration according to Flyway's naming convention (if relevant)