Skip to content

Implement symlinks in Marquez #2066

@pawel-big-lebowski

Description

@pawel-big-lebowski

Problem:

We need an ability to store alternative dataset names. For example hive datasets can be identified by their data files` location or metastore uri with database and table.

Solution in Spec:

SymlinksDatasetFacet -> OpenLineage/OpenLineage#936

Implementation in Marquez:

Model changes:

  • Create extra dataset_symlink table in Marquez with columns: (symlinkUid, name, namespaceUid, symlinkType)
  • Replace name field in datasets table with symlinkUid

Implementation follows the proposed DB changes:

First PR -> reflect current behaviour in modifed schema

  • provide migration SQL for existing instances
  • create a dataset_symlink row whenever dataset is created
  • modify SQLs in dataset_version_dao, etc.

Second PR

  • Extract symlink facet when posting new OpenLineage event
  • fill dataset_symlink with multiple entries per OL event.

Metadata

Metadata

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions