Skip to content

OL facets - PR1 - create & write new events to new tables while not reading them#2350

Merged
wslulciuc merged 2 commits into
mainfrom
ol-facets/PR1-feed-new-tables
Jan 17, 2023
Merged

OL facets - PR1 - create & write new events to new tables while not reading them#2350
wslulciuc merged 2 commits into
mainfrom
ol-facets/PR1-feed-new-tables

Conversation

@pawel-big-lebowski

Copy link
Copy Markdown
Collaborator

Signed-off-by: Pawel Leszczynski leszczynski.pawel@gmail.com

Problem

Facets are stored in lineage_events table in JSONs containing whole events. This PR is the first among three PRs to create and make use of OpenLineage facet tables.

This PR was created as a part of #2152. Content provided within this PR is mostly authored by @wslulciuc.

Solution

  • Create job_facets, run_facets and dataset_facets tables.
  • Insert data to newly created tables for the new events.
  • Reading from those tables should be avoided until data is migrated. Migration procedure will be provided soon.

Note: All database schema changes require discussion. Please link the issue for context.

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

@boring-cyborg boring-cyborg Bot added api API layer changes docker labels Jan 9, 2023
@pawel-big-lebowski pawel-big-lebowski force-pushed the ol-facets/PR1-feed-new-tables branch from 9cfdd83 to 19e093b Compare January 9, 2023 09:36
@codecov

codecov Bot commented Jan 9, 2023

Copy link
Copy Markdown

Codecov Report

Merging #2350 (be0f179) into main (656b2e6) will increase coverage by 0.08%.
The diff coverage is 81.14%.

❗ Current head be0f179 differs from pull request most recent head d3f4829. Consider uploading reports for the commit d3f4829 to get more accurate results

@@             Coverage Diff              @@
##               main    #2350      +/-   ##
============================================
+ Coverage     76.72%   76.81%   +0.08%     
- Complexity     1177     1195      +18     
============================================
  Files           222      226       +4     
  Lines          5354     5473     +119     
  Branches        429      443      +14     
============================================
+ Hits           4108     4204      +96     
- Misses          768      772       +4     
- Partials        478      497      +19     
Impacted Files Coverage Δ
...ain/java/marquez/db/mappers/DatasetDataMapper.java 85.18% <ø> (ø)
...rc/main/java/marquez/db/mappers/JobDataMapper.java 87.50% <ø> (ø)
.../java/marquez/db/models/ColumnLineageNodeData.java 100.00% <ø> (ø)
.../src/main/java/marquez/service/LineageService.java 80.30% <ø> (ø)
.../main/java/marquez/service/models/DatasetData.java 100.00% <ø> (ø)
.../src/main/java/marquez/service/models/JobData.java 66.66% <ø> (ø)
api/src/main/java/marquez/service/models/Node.java 61.53% <ø> (ø)
api/src/main/java/marquez/service/models/Run.java 82.50% <ø> (ø)
api/src/main/java/marquez/db/Columns.java 79.74% <60.00%> (-2.87%) ⬇️
api/src/main/java/marquez/db/FacetUtils.java 66.66% <66.66%> (ø)
... and 4 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@pawel-big-lebowski pawel-big-lebowski force-pushed the ol-facets/PR1-feed-new-tables branch 2 times, most recently from a026783 to 34311f3 Compare January 10, 2023 08:19
@pawel-big-lebowski pawel-big-lebowski marked this pull request as ready for review January 10, 2023 08:25
@pawel-big-lebowski pawel-big-lebowski self-assigned this Jan 10, 2023
@pawel-big-lebowski pawel-big-lebowski changed the title OL facets - PR1 - create & feed new tables while not reading them OL facets - PR1 - create & write new events to new tables while not reading them Jan 10, 2023
@pawel-big-lebowski pawel-big-lebowski mentioned this pull request Jan 10, 2023
7 tasks
Comment on lines +34 to +44
enum DatasetFacet implements Facet {
DOCUMENTATION(Type.DATASET, "documentation") {
Optional<PGobject> toPgObject(LineageEvent.DatasetFacets facets) {
return Optional.ofNullable(facets.getDocumentation()).map(this::toPgObject);
}
},
SCHEMA(Type.DATASET, "schema") {
Optional<PGobject> toPgObject(LineageEvent.DatasetFacets facets) {
return Optional.ofNullable(facets.getSchema()).map(this::toPgObject);
}
},
DATASOURCE(Type.DATASET, "dataSource") {
Optional<PGobject> toPgObject(LineageEvent.DatasetFacets facets) {
return Optional.ofNullable(facets.getDataSource()).map(this::toPgObject);
}
},
DESCRIPTION(Type.DATASET, "description") {
Optional<PGobject> toPgObject(LineageEvent.DatasetFacets facets) {
return Optional.ofNullable(facets.getDescription()).map(this::toPgObject);
}
},
LIFECYCLE_STATE_CHANGE(Type.DATASET, "lifecycleStateChange") {
Optional<PGobject> toPgObject(LineageEvent.DatasetFacets facets) {
return Optional.ofNullable(facets.getLifecycleStateChange()).map(this::toPgObject);
}
},
VERSION(Type.DATASET, "version"),
COLUMN_LINEAGE(Type.DATASET, "columnLineage") {
Optional<PGobject> toPgObject(LineageEvent.DatasetFacets facets) {
return Optional.ofNullable(facets.getColumnLineage()).map(this::toPgObject);
}
},
OWNERSHIP(Type.DATASET, "ownership"),
DATA_QUALITY_METRICS(Type.INPUT, "dataQualityMetrics"),
DATA_QUALITY_ASSERTIONS(Type.INPUT, "dataQualityAssertions"),
OUTPUT_STATISTICS(Type.OUTPUT, "outputStatistics");

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed briefly the other day, but these enums should be unnecessary, especially the facets that aren't mapped directly to Java fields - e.g., outputStatistics, dataQualityMetrics, etc.). Your call to datasetFacets.getAdditionalFacets() below should fetch all facets that haven't been mapped to a field in the DatasetFacets class. But since you have to convert everything to JSON anyway, why not convert the whole DatasetFacets class to a generic JsonNode object? You could do something like

    JsonNode jsonNode = Utils.getMapper().valueToTree(facets);
    StreamSupport.stream(Spliterators.spliteratorUnknownSize(jsonNode.fieldNames(), Spliterator.DISTINCT),
    false)
        .forEach(fieldName -> {
          storeFacet(fieldName, toPgObject(jsonNode.get(fieldName)));
        });

You'll be able to support all currently known facets as well as new facets and custom facets without ever having to update the server code.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got rid of enums in RunFacetsDao and JobFacetsDao.
We still need it in a reduced form in DatasetFacetsDao to extract facet type (DATASET, INPUT, OUTPUT or UNKNOWN).

@pawel-big-lebowski pawel-big-lebowski force-pushed the ol-facets/PR1-feed-new-tables branch 2 times, most recently from 3b63ae1 to c89504d Compare January 12, 2023 08:49
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
@pawel-big-lebowski pawel-big-lebowski force-pushed the ol-facets/PR1-feed-new-tables branch from c89504d to be0f179 Compare January 13, 2023 13:28

@wslulciuc wslulciuc left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @pawel-big-lebowski 💯 🥇

@wslulciuc wslulciuc merged commit 6436ddc into main Jan 17, 2023
@wslulciuc wslulciuc deleted the ol-facets/PR1-feed-new-tables branch January 17, 2023 10:01
jonathanpmoraes referenced this pull request in nubank/NuMarquez Feb 6, 2025
…350)

Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>

Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
Co-authored-by: Willy Lulciuc <willy@datakin.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api API layer changes docker

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants