Skip to content

(feat) implement explicit joins with local and foreign datasets #3747

@rudolfix

Description

@rudolfix

followup: #3403
requires: #3746 (multischema datasets) and PR #3590 (magic joins)

Introduction

PR #3590 introduced Relation.join() that auto-discovers join conditions from dlt schema references (parent/child relationships). This ticket extends it with an explicit on parameter, enabling:

  • Explicit join conditions between same-dataset tables (no schema reference chain needed)
  • Cross-dataset joins where other is a Relation from a different Dataset

A cross-dataset join introduces a foreign schema into the query. We use the local/foreign schema mechanism from #3746 to register it, making the foreign dataset's tables visible to the qualify and bind steps.

Signature

TJoinType = Literal["left", "right", "inner", "full"]


# Overload 1: Magic join — auto-discovers ON from schema references
@overload
def join(
    self,
    other: Union[str, Self],
    *,
    kind: TJoinType = "inner",
    alias: Optional[str] = None,
) -> Self: ...

# Overload 2: Explicit ON condition
@overload
def join(
    self,
    other: Union[str, Self],
    on: Union[str, sge.Expression],
    *,
    kind: TJoinType = "inner",
    alias: Optional[str] = None,
) -> Self: ...

# Implementation
def join(
    self,
    other: Union[str, Self],
    on: Union[str, sge.Expression, None] = None,
    *,
    kind: TJoinType = "inner",
    alias: Optional[str] = None,
) -> Self:
    """Join this relation with another table or relation.

    When ``on`` is omitted, join conditions are discovered automatically
    from the dlt schema's reference chain (parent/child relationships
    created during loading). Both sides must be base-table relations
    from the same dataset.

    When ``on`` is provided, any two relations can be joined — including
    relations from different datasets on the same physical destination
    (cross-dataset join). The ``on`` condition uses logical column names
    from the dlt schema.

    Args:
        other: Table name (str) or Relation to join with. When a Relation
            from a different Dataset is passed, ``on`` is required.
        on: Join condition as SQL string or SQLGlot expression. Disables
            auto-discovery from schema references when provided.
            Examples: ``"users.id = orders.user_id"``,
            ``sge.EQ(this=sge.column("id", "users"),
            expression=sge.column("user_id", "orders"))``
        kind: SQL join type — ``"inner"``, ``"left"``, ``"right"``,
            or ``"full"``. Defaults to ``"inner"``.
        alias: Column projection prefix for the joined table's columns.
            Columns from ``other`` appear as ``{alias}__{column}``.
            Defaults to the target table name.

    Returns:
        New Relation with the join applied and the joined table's
        columns appended to the projection.

    Raises:
        ValueError: If ``on`` is omitted and schema references between
            the two tables cannot be resolved, or if ``other`` is from
            a different dataset and ``on`` is not provided.

    Example:
        >>> # Magic join via schema references (same dataset)
        >>> users = dataset.table("users")
        >>> joined = users.join("users__orders", kind="left")

        >>> # Explicit join (same or cross-dataset)
        >>> customers = dataset1.table("customers")
        >>> orders = dataset2.table("orders")
        >>> joined = customers.join(
        ...     orders,
        ...     on="customers.customer_id = orders.customer_id",
        ...     kind="left",
        ... )
    """

Requirements

Explicit join behavior

  • When on is provided: parse the string into a SQLGlot expression (if needed), build the JOIN node with the explicit ON condition. No schema reference lookup.
  • When on is omitted: existing magic join behavior from PR feat: add dlt.Relation.join(...) based on normalizer and references #3590 (unchanged).
  • Column projection uses the same {alias}__{column} convention as magic joins.

Cross-dataset join via Relation

  • When other is a Relation from a different Dataset, on is required.
  • Automatically register the foreign schema: self._dataset._add_foreign_schema(other._dataset.dataset_name, other._dataset.schema). This makes foreign tables visible to the qualify step.

Dataset-qualified string in other

  • Accept "warehouse_data.users" as other — parse into dataset_name + table_name.
  • Validate that the dataset is known (local or already registered as foreign).
  • If not known, raise ValueError with guidance to pass a Relation instead (so the schema can be auto-registered).
  • We recommend users pass a Relation rather than a qualified string — it carries the schema needed for automatic registration.

Query binder limitations for cross-dataset joins

Cross-dataset joins expose three limitations in the current qualify/bind pipeline that must be addressed:

1. create_sqlglot_schema is single-dataset

File: dlt/dataset/lineage.pycreate_sqlglot_schema() (line 28)

Currently wraps all tables under a single {dataset_name: {table: columns}}. Tables from foreign datasets are invisible to qualify(), so column references to foreign tables cannot be resolved.

2. bind_query uses a single expand_table_name callback with no dataset context

File: dlt/common/libs/sqlglot.pybind_query() (line 947)

The expand_table_name callback has signature Callable[[str], List[str]] — it receives only the table name. It is bound (via functools.partial) to the sql_client's own dataset_name. When bind_query encounters a foreign table node whose node.db differs from the local dataset, it still expands using the local dataset name — or skips expansion entirely if the table isn't in the SQLGlot schema.

3. make_qualified_table_name_path is hardcoded to self.dataset_name

File: dlt/destinations/sql_client.pymake_qualified_table_name_path() (line 226)

Always uses self.dataset_name (line 235). There is no way to override the dataset for a foreign table. Similarly, catalog_name() is per-sql_client, so if two datasets live in different catalogs the path cannot be constructed correctly.

Tests

Unit tests (tests/dataset/)

  • Explicit on join between two tables in the same dataset
  • Explicit on join with other as a Relation from a different Dataset (cross-dataset)
  • "dataset.table" string parsing in other
  • Foreign schema auto-registered when cross-dataset Relation is joined
  • Column projection with explicit alias
  • Error cases: on omitted with cross-dataset Relation, unknown dataset string

Integration tests with duckdb

  • Two pipelines loading to two datasets on same duckdb
  • Cross-dataset join produces correct SQL and returns correct data

Smoke tests on all destinations (CI matrix)

  • Same-dataset explicit join works on each destination

Cross-catalog tests (ducklake or similar)

These are expected to initially fail — the goal is to document binder limitations:

  • Join across catalogs where dataset name on each catalog is different
  • Join across catalogs where dataset name is the same on both catalogs (tests that the binder distinguishes by catalog, not just dataset_name)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions