Skip to content

Add java migrations for backfilling runs with job uuids and parents#1980

Merged
collado-mike merged 2 commits into
mainfrom
job_parent_hierarchy_backfills
May 20, 2022
Merged

Add java migrations for backfilling runs with job uuids and parents#1980
collado-mike merged 2 commits into
mainfrom
job_parent_hierarchy_backfills

Conversation

@collado-mike

Copy link
Copy Markdown
Collaborator

Problem

Continuation of https://github.com/MarquezProject/marquez/pull/1935/files , this adds the migration scripts for populating job parents for Airflow runs and other runs that have explicit parent run ids. Special handling is given to the Airflow tasks to match the behavior in OpenLineage. Unit tests are provided to verify the behavior of the backfill scripts

Note: All database schema changes require discussion. Please link the issue for context.

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)

@collado-mike collado-mike requested a review from wslulciuc May 7, 2022 00:00
@codecov

codecov Bot commented May 7, 2022

Copy link
Copy Markdown

Codecov Report

Merging #1980 (55e0c6e) into main (0021a2b) will increase coverage by 0.42%.
The diff coverage is 87.26%.

@@             Coverage Diff              @@
##               main    #1980      +/-   ##
============================================
+ Coverage     78.20%   78.62%   +0.42%     
- Complexity      955     1003      +48     
============================================
  Files           194      197       +3     
  Lines          5303     5459     +156     
  Branches        420      424       +4     
============================================
+ Hits           4147     4292     +145     
- Misses          713      723      +10     
- Partials        443      444       +1     
Impacted Files Coverage Δ
api/src/main/java/marquez/common/Utils.java 85.33% <80.00%> (-0.83%) ⬇️
...b/migrations/V44_1__BackfillAirflowParentRuns.java 87.27% <87.27%> (ø)
...z/db/migrations/V44_2_BackfillJobsWithParents.java 88.46% <88.46%> (ø)
...ez/db/migrations/V43_1__UpdateRunsWithJobUUID.java 89.28% <89.28%> (ø)
api/src/main/java/marquez/db/FlywayFactory.java 96.96% <100.00%> (+0.09%) ⬆️
api/src/main/java/marquez/service/models/Job.java 74.07% <0.00%> (+3.70%) ⬆️
...main/java/marquez/service/models/LineageEvent.java 83.05% <0.00%> (+6.77%) ⬆️
...pi/src/main/java/marquez/db/mappers/JobMapper.java 77.41% <0.00%> (+12.90%) ⬆️

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@collado-mike collado-mike force-pushed the job_parent_hierarchy_backfills branch 2 times, most recently from 08dc5c3 to d7082b0 Compare May 10, 2022 18:57

@wslulciuc wslulciuc left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, but great to learn about writing migrations as java classes for flyway! I think backfills in this format make a lot of sense 💯 💯 🥇

private static final String FIND_AIRFLOW_PARENT_RUNS_SQL =
"""
SELECT DISTINCT(run_uuid) AS run_uuid,
e.parent_run_id,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming you branched off PR #1935 to continue your work before the PR was merge, but just wanted to point out you'll want to use parent_run_uuid and parent_job_uuid 😅 (see my earlier comment)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, this is the OpenLineage event!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, this is the OpenLineage event!

public static final String INSERT_PARENT_RUN_QUERY =
"""
INSERT INTO runs (uuid, created_at, updated_at, current_run_state, external_id, namespace_name, job_name, job_uuid, location, transitioned_at, started_at, ended_at)
SELECT :parentRunUuid, created_at, updated_at, current_run_state, :externalRunid, :namespace, :jobName, :parentJobUuid, location, transitioned_at, started_at, ended_at

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: externalRunid -> externalRunId (though looks like it still works, so maybe more of a naming convention thing)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same typo is present here and in the binding on line 147. If they didn't match, it would throw an exception

import org.jdbi.v3.core.result.ResultProducers;

@Slf4j
public class V44_2_BackfillJobsWithParents implements JavaMigration {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I'd make it explicit that it's for parent runs BackfillJobsWithParentRuns

import org.jdbi.v3.core.result.ResultProducers;

@Slf4j
public class V44_2_BackfillJobsWithParents implements JavaMigration {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I'd make it explicit that it's for parent runs BackfillJobsWithParentRuns

WHERE job_name=j.name AND namespace_name=j.namespace_name
ORDER BY transitioned_at DESC
LIMIT 1
) r ON true

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fancy. Not sure how ON true works though?

WHERE job_name=j.name AND namespace_name=j.namespace_name
ORDER BY transitioned_at DESC
LIMIT 1
) r ON true

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fancy. Not sure how ON true works though?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://medium.com/kkempin/postgresqls-lateral-join-bfd6bd0199df . Basically, the lateral join subquery is executed for each record returned from the jobs table - which is why we can reference j.name and j.namespace_name here on line 28. Since each returned run is computed from the jobs record it's supposed to be joined to, there's no need to do anything more to join, like compare job_uuid. So ON true

assertThat(jobByName)
.isPresent()
.get()
.hasFieldOrPropertyWithValue("name", new JobName(parentName + "." + task1Name));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Base automatically changed from job_parent_hierarchy to main May 20, 2022 21:33
@collado-mike collado-mike enabled auto-merge (squash) May 20, 2022 21:39
…Airflow runs

Signed-off-by: Michael Collado <collado.mike@gmail.com>

Signed-off-by: Michael Collado <collado.mike@gmail.com>
Signed-off-by: Michael Collado <collado.mike@gmail.com>
@collado-mike collado-mike force-pushed the job_parent_hierarchy_backfills branch from a53fe74 to 55e0c6e Compare May 20, 2022 22:34
@collado-mike collado-mike merged commit 9d97708 into main May 20, 2022
@collado-mike collado-mike deleted the job_parent_hierarchy_backfills branch May 20, 2022 22:39
jonathanpmoraes referenced this pull request in nubank/NuMarquez Feb 6, 2025
…#1980)

* Add migrations to support job parent relationship storage

Signed-off-by: Michael Collado <collado.mike@gmail.com>

* Update all job and run queries to reference jobs_view and runs_view

Signed-off-by: Michael Collado <collado.mike@gmail.com>

* Remove references to simple_name as job redirects handle redirecting simple name to fqn
added unit test to verify

Signed-off-by: Michael Collado <collado.mike@gmail.com>

* Fix runs migration script

Signed-off-by: Michael Collado <collado.mike@gmail.com>

* Add java migrations for backfilling runs with job uuids and backfill Airflow runs
Signed-off-by: Michael Collado <collado.mike@gmail.com>

Signed-off-by: Michael Collado <collado.mike@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants