Skip to content

Commit 57b00d0

Browse files
extend existing proposal with comment suggestion
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
1 parent 258f38c commit 57b00d0

1 file changed

Lines changed: 92 additions & 46 deletions

File tree

Lines changed: 92 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Proposal: Column lineage endpoint proposal
22

3-
Author(s): @julienledem
3+
Author(s): @julienledem, @pawel-big-lebowski
44

55
Created: 20022-08-18
66

@@ -15,19 +15,27 @@ Dicussion: [column lineage endpoint issue #2045](https://github.com/MarquezProje
1515

1616
### Existing elements
1717

18-
- OpenLineage defines a [column-level lineage facet]- (https://github.com/OpenLineage/OpenLineage/blob/ff0d87d30ed6c9fe39472788948266a6d3190585/spec/facets/ColumnLineageDatasetFacet.md).
18+
- OpenLineage defines a [column-level lineage facet]- (https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/ColumnLineageDatasetFacet.json).
1919
- Marquez has a lineage endpoint `GET /api/v1/lineage` that returns the current lineage graph connected to a job or a dataset
2020

21+
### Column lineage characteristics and general assumptions
22+
23+
Column level lineage is a different lineage graph due to a different node granularity - kind of zoomed-in view of existing lineage. Instead of datasets and jobs being lineage graph nodes, each dataset field becomes a node. Additionally, there are edges between dataset fields, instead of datasets itself. Thus, enriching existing lineage with column lineage information would not be sufficient. That’s why we propose another API endpoint with column lineage graph.
24+
25+
Upstream and downstream edges do have different characteristics. An output dataset is always produced by a single version of input dataset (one upstream), while a single input datset version can have multiple output dataset versions. Lineage graph can be then easily flooded by downstream subgraph which blurs the overall view. That's why we consider an upstream column lineage as a default one. Downstream lineage will be returned only when requested explicitly.
26+
2127
### New Elements
28+
2229
We propose to add the following:
23-
- Add column lineage to the lineage endpoint
24-
- A new column-lineage endpoint leveraging the column lineage facet to retrieve lineage for a given column.
25-
- Point-in-time upstream (dataset or column level) lineage given a version of a dataset.
30+
31+
- Add column lineage to the lineage endpoint as a part of dataset.
32+
- A new column-lineage endpoint leveraging the column lineage facet to retrieve lineage for a given column.
33+
- Point-in-time upstream (dataset or column level) lineage given a version of a dataset.
2634

2735
## Proposal
2836

29-
### add column lineage to existing endpoint
30-
In the GET /lineage api, add column lineage to DATASET nodes' data
37+
### Add column lineage to existing endpoint
38+
In the `GET /lineage` api, add column lineage to `DATASET` nodes' data:
3139
```diff
3240
{
3341
"id": "dataset:food_delivery:public.categories",
@@ -66,78 +74,116 @@ In the GET /lineage api, add column lineage to DATASET nodes' data
6674
}
6775
```
6876

69-
### add a column-level-lineage endpoint:
77+
The implementation here can reuse `columnLineage` facet classes.
78+
79+
80+
### Add a column-level-lineage endpoint:
7081

82+
New endpoints to retrieve a column lineage of a single field or a whole dataset will be added:
7183
```
72-
GET /column-lineage?nodeId=dataset:food_delivery:public.delivery_7_days&column=a
84+
GET /column-lineage?nodeId=dataset:{namespace}:{dataset}
85+
GET /column-lineage?nodeId=datasetField:{namespace}:{dataset}:{field}
86+
```
87+
For example:
88+
```
89+
GET /column-lineage?nodeId=dataset:food_delivery:public.delivery_7_days
90+
GET /column-lineage?nodeId=datasetField:food_delivery:public.delivery_7_days:a
7391
```
74-
`column` is a ne parameter that must be a column in the schema of the provided dataset `nodeId`.
7592

76-
The logic is layered on the existing lineage endpoint, filtering down to the datasets that contribute to that column.
77-
It only returns dataset nodes.
93+
Although creating a new endpoint, we would like to reuse existing data structures with a new `NodeType.FIELD` introduced.
7894

79-
```diff
95+
The logic returns dataset field node:
96+
97+
```
98+
GET /column-lineage?nodeId=datasetField:db1:table1:a
99+
...
80100
{
81101
graph: [
82102
{
83-
"id": "dataset:db1:table2",
84-
"type": "DATASET",
85-
data: {
86-
namespace: "DB1",
87-
name: "table2",
88-
> columnLineage: {
89-
> "a": {
90-
> inputFields: [
91-
> {namespace: "DB1", name: "table1, "field": "a"}
92-
> ],
93-
> transformationDescription: "identical",
94-
> transformationType: "IDENTITY"
95-
> },
96-
> "b": ... other output fields
97-
> }
103+
"id": "datasetField:db1:table1:a",
104+
"type": "DATASET_FIELD",
105+
"data": {
106+
"namespace": "DB1",
107+
"name": "table2",
108+
"field": "a",
109+
"type": "integer",
110+
"transformationDescription": "identical",
111+
"transformationType": "IDENTITY",
112+
"inputFields": [
113+
{ "namespace": "DBA", "name": "tableA", "field": "columnA"},
114+
{ "namespace": "DBB", "name": "tableB", "field": "columnB"},
115+
{ "namespace": "DBC", "name": "tableC", "field": "columnC"}
116+
]
117+
"inEdges": [
118+
{
119+
"origin": "datasetField:db1:table1:a",
120+
"destination": "datasetField:DBA:tableA:columnA"
121+
},
122+
{
123+
"origin": "datasetField:db1:table1:a",
124+
"destination": "datasetField:DBB:tableB:columnB"
125+
},
126+
{
127+
"origin": "datasetField:db1:table1:a",
128+
"destination": "datasetField:DBB:tableB:columnC"
129+
}
130+
],
98131
},
99132
...
133+
# Input fields, present within "inEdges", can be also returned within a graph due to a `depth` parameter greate than 0.
100134
}
101135
]
102136
}
103137
```
104138

139+
The `depth` parameter controls how many edges, from a given dataset field, shall be returned. The default is set to `0`. In case of default equal `1`, each `inputField` will be returned as a separate node within a response graph with `inputFields` used to produce it. Please note that extending depth may increase the graph size and affect request performance.
140+
141+
The endpoints above fetches upstream column-lineage for given dataset field or all fields within a dataset. Downstream column lineage is turned off by default. However, this can be turned on with an extra `withDownstream` parameter like:
142+
143+
```
144+
GET /column-lineage?nodeId=datasetField:food_delivery:public.delivery_7_days:a&withDownstream=true
145+
146+
```
147+
This will include `outEdges` within the returned node of the graph.
148+
149+
105150
### Point in time upstream lineage
106-
return historical upstream lineage from a given Dataset version.
107-
This adds the version element to the nodeId in both the existing `/api/v1/lineage` and newly proposed `/api/v1/column-lineage` endpoint
151+
152+
Changes related to `/api/v1/lineage` are out of scope of this proposal which will only include point in time lineage for newly proposed `/api/v1/column-lineage` endpoint:
108153
```
109-
GET /lineage?nodeId=dataset:food_delivery:public.delivery_7_days:{version}
110-
GET /column-lineage?nodeId=dataset:food_delivery:public.delivery_7_days:{version}&column=a
154+
GET /column-lineage?nodeId=dataset_field:food_delivery:public.delivery_7_days:a&datasetVersion=123e4567-e89b-12d3-a456-426614174000
155+
GET /column-lineage?nodeId=dataset_field:food_delivery:public.delivery_7_days:a&lineageAt=1661846242
111156
```
112-
This returns only upstream lineage in this current proposal. This is because upstream lineage is well defined to a specific version while downstream lineage is not. The data payload would add a version field.
157+
158+
Point in time can be controlled by:
159+
* **datasetVersion** - uuid of a specific dataset version,
160+
* **lineageAt** - which contains a unix timestamp.
161+
162+
When **lineageAt** specified, the latest dataset version before timestamp will be found. Regardles **datasetVersion** or **lineageAt** parameters applied, responses will be the same as below:
163+
113164
```diff
114165
{
115166
graph: [
116167
{
117-
< "id": "dataset:db1:table2",
118-
> "id": "dataset:db1:table2#{VERSION UUID}",
119-
"type": "DATASET",
120-
data: {
121-
namespace: "DB1",
122-
name: "table2",
123-
> version: "{VERSION UUID}"
124-
...
125-
}
126-
}
127-
]
168+
< "id": "datasetField:db1:table1:a",
169+
> "id": "datasetField:db1:table1:a#{VERSION UUID}",
170+
"type": "DATASET_FIELD",
171+
"data": {
172+
....
128173
}
129174
```
130175

131176
## Implementation
132177

133178
### columne lineage facet in lineage
134-
Adding the columnLineage facet requires a formatting of existing facet data.
179+
Adding the columnLineage facet requires a formatting of existing facet data (work in progress).
135180
### column lineage endpoint
136181
The `/column-lineage` endpoint leverages the `/lineage` endpoint and then filters down the payload to return the expected result.
137182
### point-in-time upstream lineage
183+
138184
The point-in-time upstream lineage leverages the run to dataset version relation to track back the lineage of a given dataset of job version.
139185
Dataset version -> run that produced it -> consumed Dataset Versions.
140186

141187
## Next Steps
142188

143-
Review of this proposal and production of detailed design for the implementation, in particular for the point in time lineage which might affect the dabtabase schema.
189+
Review of this proposal and production of detailed design for the implementation.:

0 commit comments

Comments
 (0)