Job parent hierarchy api changes#1992
Conversation
500105b to
1f0e2fe
Compare
Codecov Report
@@ Coverage Diff @@
## main #1992 +/- ##
=========================================
Coverage 78.62% 78.62%
Complexity 1003 1003
=========================================
Files 197 197
Lines 5459 5459
Branches 424 424
=========================================
Hits 4292 4292
Misses 723 723
Partials 444 444 📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
| UUID symlinkTargetId, | ||
| PGobject inputs) { | ||
| UUID jobUuid = | ||
| upsertJobNoParent( |
There was a problem hiding this comment.
Wouldn't we want the upsertJobNoParent() call to the JobRow object similar to other upsert calls? This would keep contracts the same across DAOs but also avoid the subsequent findJobByUuidAsRow() call.
There was a problem hiding this comment.
Sorry, do you mean that the upsertJobNoParent query would go back to RETURNING * instead of RETURNING uuid? If that's what you mean, I made this change so that the subsequent findJobByUuidAsRow call queries the jobs_view - returning the FQN rather than the simple name.
| p -> { | ||
| if (event.getJob().getName().startsWith(p.getName() + '.')) { | ||
| return event.getJob().getName().substring(p.getName().length() + 1); | ||
| } else { |
There was a problem hiding this comment.
Minor: We may want to move this in a DbUtils class to handle parsing the simple name:
DbUtils.simpleJobNameFor()
wslulciuc
left a comment
There was a problem hiding this comment.
@collado-mike left some minor comments, but otherwise great work 💯 💯 🥇
As for keeping the jobs_view, here are my thoughts (also in response to our offline discussion):
Upside:
Queries remain simple, meaning when querying the jobs table, the name column is still the simple name of the job, not the FQN. Also, the web UI should display the simple name of the job and depending on how jobs are named, parsing the FQN for displaying may result in the wrong name being used (not ideal). I think given the scope of the change, jobs_view allows us to avoid any unknown cases around job naming. The view can also be seen as a migration step to eventually having the name column in the jobs table be the FQN. Meaning, we can add a simple_name column to the jobs table ensuring the simple name and FQN are clearly defined and possibly dropping the view all together (or keep it arounds as there are clear benefits).
Downside:
I think having the name and simple_name column in the jobs table would ensure the FQN or the simple name would always be referenced correctly (outside just the view). But, a deeper discussion on how much benefits this provides can be had as the REST API is how metadata should be queried for in the first place.
| } | ||
|
|
||
| @Test | ||
| public void testOpenLineageJobHierarchy() |
There was a problem hiding this comment.
Minor: Since this test is Airflow specific, I would name the test testOpenLineageJobHierarchyForAirflow()
a53fe74 to
55e0c6e
Compare
Signed-off-by: Michael Collado <collado.mike@gmail.com>
…s with parents Signed-off-by: Michael Collado <collado.mike@gmail.com>
Signed-off-by: Michael Collado <collado.mike@gmail.com>
Signed-off-by: Michael Collado <collado.mike@gmail.com>
Signed-off-by: Michael Collado <collado.mike@gmail.com>
b388088 to
b64b7f9
Compare
* Update API to write parent jobs and parent runs for OpenLineage events Signed-off-by: Michael Collado <collado.mike@gmail.com> * Added integration tests to verify behavior when recording Airflow jobs with parents Signed-off-by: Michael Collado <collado.mike@gmail.com> * Added integration test for airflow -> spark task recording Signed-off-by: Michael Collado <collado.mike@gmail.com> * Added simpleName field to Job model and updated integration tests Signed-off-by: Michael Collado <collado.mike@gmail.com> * Added parent job name to job models Signed-off-by: Michael Collado <collado.mike@gmail.com>
Problem
Final PR for #1928, continued from #1980. This updates the behavior of the write APIs to update the job parent field for new events and updates the read APIs to return the
simpleNamefield of the job as well as the FQN. Notably, parent jobs and parent runs are created if present in the OpenLineage event but not present in the Marquez database. This handles events from Airflow DAGs where the DAG is a parent job for all tasks even though no event is ever sent for the DAG itself. A few integration tests added to validate the behavior for receiving messages from Airflow and Spark.Closes: #1928
Solution
Please describe your change as it relates to the problem, or bug fix, as well as any dependencies. If your change requires a database schema migration, please describe the schema modification(s) and whether it's a backwards-incompatible or backwards-compatible change.
Checklist
CHANGELOG.mdwith details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary).sqldatabase schema migration according to Flyway's naming convention (if relevant)