Skip to content

Job parent hierarchy api changes#1992

Merged
collado-mike merged 5 commits into
mainfrom
job_parent_hierarchy_api_changes
May 20, 2022
Merged

Job parent hierarchy api changes#1992
collado-mike merged 5 commits into
mainfrom
job_parent_hierarchy_api_changes

Conversation

@collado-mike

Copy link
Copy Markdown
Collaborator

Problem

Final PR for #1928, continued from #1980. This updates the behavior of the write APIs to update the job parent field for new events and updates the read APIs to return the simpleName field of the job as well as the FQN. Notably, parent jobs and parent runs are created if present in the OpenLineage event but not present in the Marquez database. This handles events from Airflow DAGs where the DAG is a parent job for all tasks even though no event is ever sent for the DAG itself. A few integration tests added to validate the behavior for receiving messages from Airflow and Spark.

Closes: #1928

Solution

Please describe your change as it relates to the problem, or bug fix, as well as any dependencies. If your change requires a database schema migration, please describe the schema modification(s) and whether it's a backwards-incompatible or backwards-compatible change.

Note: All database schema changes require discussion. Please link the issue for context.

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)

@collado-mike collado-mike requested a review from wslulciuc May 16, 2022 19:14
@collado-mike collado-mike force-pushed the job_parent_hierarchy_api_changes branch 3 times, most recently from 500105b to 1f0e2fe Compare May 16, 2022 21:41
@codecov

codecov Bot commented May 16, 2022

Copy link
Copy Markdown

Codecov Report

Merging #1992 (9d97708) into main (9d97708) will not change coverage.
The diff coverage is n/a.

❗ Current head 9d97708 differs from pull request most recent head b64b7f9. Consider uploading reports for the commit b64b7f9 to get more accurate results

@@            Coverage Diff            @@
##               main    #1992   +/-   ##
=========================================
  Coverage     78.62%   78.62%           
  Complexity     1003     1003           
=========================================
  Files           197      197           
  Lines          5459     5459           
  Branches        424      424           
=========================================
  Hits           4292     4292           
  Misses          723      723           
  Partials        444      444           

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

UUID symlinkTargetId,
PGobject inputs) {
UUID jobUuid =
upsertJobNoParent(

@wslulciuc wslulciuc May 18, 2022

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we want the upsertJobNoParent() call to the JobRow object similar to other upsert calls? This would keep contracts the same across DAOs but also avoid the subsequent findJobByUuidAsRow() call.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, do you mean that the upsertJobNoParent query would go back to RETURNING * instead of RETURNING uuid? If that's what you mean, I made this change so that the subsequent findJobByUuidAsRow call queries the jobs_view - returning the FQN rather than the simple name.

Comment thread api/src/main/java/marquez/db/OpenLineageDao.java
p -> {
if (event.getJob().getName().startsWith(p.getName() + '.')) {
return event.getJob().getName().substring(p.getName().length() + 1);
} else {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: We may want to move this in a DbUtils class to handle parsing the simple name:

DbUtils.simpleJobNameFor()

Comment thread api/src/main/java/marquez/db/OpenLineageDao.java
Comment thread api/src/main/java/marquez/db/RunDao.java

@wslulciuc wslulciuc left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@collado-mike left some minor comments, but otherwise great work 💯 💯 🥇

As for keeping the jobs_view, here are my thoughts (also in response to our offline discussion):

Upside:

Queries remain simple, meaning when querying the jobs table, the name column is still the simple name of the job, not the FQN. Also, the web UI should display the simple name of the job and depending on how jobs are named, parsing the FQN for displaying may result in the wrong name being used (not ideal). I think given the scope of the change, jobs_view allows us to avoid any unknown cases around job naming. The view can also be seen as a migration step to eventually having the name column in the jobs table be the FQN. Meaning, we can add a simple_name column to the jobs table ensuring the simple name and FQN are clearly defined and possibly dropping the view all together (or keep it arounds as there are clear benefits).

Downside:

I think having the name and simple_name column in the jobs table would ensure the FQN or the simple name would always be referenced correctly (outside just the view). But, a deeper discussion on how much benefits this provides can be had as the REST API is how metadata should be queried for in the first place.

Comment thread api/src/main/java/marquez/db/RunDao.java
Comment thread api/src/test/java/marquez/OpenLineageIntegrationTest.java
}

@Test
public void testOpenLineageJobHierarchy()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Since this test is Airflow specific, I would name the test testOpenLineageJobHierarchyForAirflow()

Comment thread api/src/main/java/marquez/service/models/Job.java
@collado-mike collado-mike force-pushed the job_parent_hierarchy_backfills branch from a53fe74 to 55e0c6e Compare May 20, 2022 22:34
Base automatically changed from job_parent_hierarchy_backfills to main May 20, 2022 22:39
Signed-off-by: Michael Collado <collado.mike@gmail.com>
…s with parents

Signed-off-by: Michael Collado <collado.mike@gmail.com>
Signed-off-by: Michael Collado <collado.mike@gmail.com>
Signed-off-by: Michael Collado <collado.mike@gmail.com>
Signed-off-by: Michael Collado <collado.mike@gmail.com>
@collado-mike collado-mike force-pushed the job_parent_hierarchy_api_changes branch from b388088 to b64b7f9 Compare May 20, 2022 22:41
@collado-mike collado-mike enabled auto-merge (squash) May 20, 2022 22:41
@collado-mike collado-mike merged commit dd5f53f into main May 20, 2022
@collado-mike collado-mike deleted the job_parent_hierarchy_api_changes branch May 20, 2022 22:46
jonathanpmoraes referenced this pull request in nubank/NuMarquez Feb 6, 2025
* Update API to write parent jobs and parent runs for OpenLineage events

Signed-off-by: Michael Collado <collado.mike@gmail.com>

* Added integration tests to verify behavior when recording Airflow jobs with parents

Signed-off-by: Michael Collado <collado.mike@gmail.com>

* Added integration test for airflow -> spark task recording

Signed-off-by: Michael Collado <collado.mike@gmail.com>

* Added simpleName field to Job model and updated integration tests

Signed-off-by: Michael Collado <collado.mike@gmail.com>

* Added parent job name to job models

Signed-off-by: Michael Collado <collado.mike@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Supporting Job grouping and hierarchy in Marquez

2 participants