add lifecycleStateChange support#1847
Conversation
4854583 to
89717e2
Compare
Codecov Report
@@ Coverage Diff @@
## main #1847 +/- ##
============================================
+ Coverage 77.90% 78.09% +0.18%
- Complexity 937 944 +7
============================================
Files 193 193
Lines 5218 5249 +31
Branches 418 418
============================================
+ Hits 4065 4099 +34
+ Misses 706 705 -1
+ Partials 447 445 -2
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
89717e2 to
83b8204
Compare
|
@pawel-big-lebowski: though I agree we should handle the After looking over the initial issue on adding table operation support (and the facet definition), Marquez should perform certain actions based on the table state change. That is, for a given state change below the following actions are defined:
If we refer back to issue #1800, @collado-mike mentions:
The actions defined above would reflect the new state of the table, which is the intended goal. As for displaying or recording the state change as proposed in this PR?
|
|
Thanks @wslulciuc for the extensive comment. I think the issue is that we started with table state changes in Spark and wanted to implement it in Marquez, while not being aware of a whole dataset context within Marquez. Like you said, the approach may not fit well other dataset types like files or streams. I agree with the column names Based on the knowledge collected, I think we should start with: OpenLineage/OpenLineage#518 |
d04951f to
3fd3222
Compare
The still valid open question is: what do we want to do with this information on the frontend? There are at least two options:
@julienledem @wslulciuc What do you think? |
3fd3222 to
56424c3
Compare
|
We decided to proceed in a following way:
|
There was a problem hiding this comment.
Great work, @pawel-big-lebowski! I have a few minor suggestions before we can merge this sweet feature
- Mind using just
latest_lifecycle_stateandlifecycle_statefor the column names to represent states? In OpenLineage, the facet does capture the state change (so the naming is appropriate), but in Marquez the column will represent the current state at a given time in the dataset lifecycle. You can think about how the run states are stored in therunstable. - I think we should also add the column
latest_lifecycle_stateto thedatasetstable? - With #1925 merged, you'll need to update you sql migration file to use
V41and above - Mind also updating the changelog?
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
d4b2e85 to
0bbfc0a
Compare
|
Great comments @wslulciuc. I've made changes for I am not sure whether adding column Let me know if you had other argument for doing that. |
Totally agree, thanks for pointing this out! The |
wslulciuc
left a comment
There was a problem hiding this comment.
Great work, @pawel-big-lebowski 💯 💯 💯
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>

Signed-off-by: Pawel Leszczynski leszczynski.pawel@gmail.com
Problem
Lacking support for table operations reported from OpenLineage
Closes: #1800
Solution
Store
stateChangefield in backend database and expose the property over API.Checklist
CHANGELOG.mdwith details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary).sqldatabase schema migration according to Flyway's naming convention (if relevant)