Skip to content

Add metadata cmd#2091

Merged
wslulciuc merged 11 commits into
mainfrom
feature/metadata-cmd
Aug 26, 2022
Merged

Add metadata cmd#2091
wslulciuc merged 11 commits into
mainfrom
feature/metadata-cmd

Conversation

@wslulciuc

@wslulciuc wslulciuc commented Aug 26, 2022

Copy link
Copy Markdown
Member

Problem

There's currently no good way to performance test the data model of Marquez with significantly large OL events (see #2076).

Solution

Add cmd metadata to generate OpenLineage events; generated events will be saved to a file called metadata.json that can be used to seed Marquez via the seed cmd (sweet, right!?):

$ java -jar marquez-api.jar metadata --help
usage: java -jar marquez-api.jar
       metadata [--runs RUNS] [--bytes-per-event BYTES-PER-EVENT] [-o OUTPUT] [-h]

generate random metadata using the OpenLineage standard

named arguments:
  --runs RUNS            limits OL runs up to N (default: 25)
  --bytes-per-event BYTES-PER-EVENT
                         size (in bytes) per OL event (default: 33404)
  -o OUTPUT, --output OUTPUT
                         the output metadata file (default: metadata.json)
  -h, --help             show this help message and exit

When seeding Marquez with generated events, we can now observe query performance via pghero. When running:

$ ./docker/up.sh --build

containers, marquez-api, marquez-web, marquez-db and now pghero will start. Query stats aren't enabled by default, you'll need to manually enable query profiling via the UI by browsing to http://localhost:8080:

Screen Shot 2022-08-26 at 12 08 59 AM

Limitations of metadata cmd

As follow up work, well want to:

  • Expose option to set the number of I/O per event with --inputs-per-event / --outputs-per-event
  • Expose option for input / output schemas to have very large field names and descriptions (or just randomize the filed name length and description length give some range 5...N)
  • Link upstream and downstream jobs (randomly), currently all jobs have unique I/O datasets; therefore, a lineage graph consists only of a single job node and it's I/O datasets (not all that interesting!)

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
@wslulciuc wslulciuc added the review Ready for review label Aug 26, 2022
Signed-off-by: wslulciuc <willy@datakin.com>
@codecov

codecov Bot commented Aug 26, 2022

Copy link
Copy Markdown

Codecov Report

Merging #2091 (ddf8630) into main (07ba426) will decrease coverage by 2.09%.
The diff coverage is 8.21%.

❗ Current head ddf8630 differs from pull request most recent head f6d0536. Consider uploading reports for the commit f6d0536 to get more accurate results

@@             Coverage Diff              @@
##               main    #2091      +/-   ##
============================================
- Coverage     77.04%   74.94%   -2.10%     
- Complexity     1013     1017       +4     
============================================
  Files           201      202       +1     
  Lines          4643     4789     +146     
  Branches        389      393       +4     
============================================
+ Hits           3577     3589      +12     
- Misses          628      762     +134     
  Partials        438      438              
Impacted Files Coverage Δ
api/src/main/java/marquez/cli/MetadataCommand.java 7.58% <7.58%> (ø)
api/src/main/java/marquez/MarquezApp.java 65.33% <100.00%> (+0.46%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
@wslulciuc wslulciuc enabled auto-merge (squash) August 26, 2022 08:07
@wslulciuc wslulciuc disabled auto-merge August 26, 2022 08:23
Comment thread api/src/main/java/marquez/cli/MetadataCommand.java Outdated
Signed-off-by: wslulciuc <willy@datakin.com>
@wslulciuc wslulciuc removed the review Ready for review label Aug 26, 2022
@wslulciuc wslulciuc merged commit cf44452 into main Aug 26, 2022
@wslulciuc wslulciuc deleted the feature/metadata-cmd branch August 26, 2022 22:15
jonathanpmoraes referenced this pull request in nubank/NuMarquez Feb 6, 2025
* Add metadata.json to .gitignore

Signed-off-by: wslulciuc <willy@datakin.com>

* Add psql conf for pghero

Signed-off-by: wslulciuc <willy@datakin.com>

* Add pghero

Signed-off-by: wslulciuc <willy@datakin.com>

* Add metadata cmd

Signed-off-by: wslulciuc <willy@datakin.com>

* Update javadocs

Signed-off-by: wslulciuc <willy@datakin.com>

* Add steps to enable query stats with pghero

Signed-off-by: wslulciuc <willy@datakin.com>

* Give pghero superuser access

Signed-off-by: wslulciuc <willy@datakin.com>

* Update cmd arg constant for --bytes-per-event

Signed-off-by: wslulciuc <willy@datakin.com>

* Simplify newOlEvents()

Signed-off-by: wslulciuc <willy@datakin.com>

Signed-off-by: wslulciuc <willy@datakin.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants