Problem:
We need an ability to store alternative dataset names. For example hive datasets can be identified by their data files` location or metastore uri with database and table.
Solution in Spec:
SymlinksDatasetFacet -> OpenLineage/OpenLineage#936
Implementation in Marquez:
Model changes:
- Create extra dataset_symlink table in Marquez with columns: (symlinkUid, name, namespaceUid, symlinkType)
- Replace
name field in datasets table with symlinkUid
Implementation follows the proposed DB changes:
First PR -> reflect current behaviour in modifed schema
- provide migration SQL for existing instances
- create a
dataset_symlink row whenever dataset is created
- modify SQLs in dataset_version_dao, etc.
Second PR
- Extract symlink facet when posting new OpenLineage event
- fill
dataset_symlink with multiple entries per OL event.
Problem:
We need an ability to store alternative dataset names. For example hive datasets can be identified by their data files` location or metastore uri with database and table.
Solution in Spec:
SymlinksDatasetFacet-> OpenLineage/OpenLineage#936Implementation in Marquez:
Model changes:
namefield indatasetstable withsymlinkUidImplementation follows the proposed DB changes:
First PR -> reflect current behaviour in modifed schema
dataset_symlinkrow whenever dataset is createdSecond PR
dataset_symlinkwith multiple entries per OL event.