You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR #3590 introduced Relation.join() that auto-discovers join conditions from dlt schema references (parent/child relationships). This ticket extends it with an explicit on parameter, enabling:
Explicit join conditions between same-dataset tables (no schema reference chain needed)
Cross-dataset joins where other is a Relation from a different Dataset
A cross-dataset join introduces a foreign schema into the query. We use the local/foreign schema mechanism from #3746 to register it, making the foreign dataset's tables visible to the qualify and bind steps.
Signature
TJoinType=Literal["left", "right", "inner", "full"]
# Overload 1: Magic join — auto-discovers ON from schema references@overloaddefjoin(
self,
other: Union[str, Self],
*,
kind: TJoinType="inner",
alias: Optional[str] =None,
) ->Self: ...
# Overload 2: Explicit ON condition@overloaddefjoin(
self,
other: Union[str, Self],
on: Union[str, sge.Expression],
*,
kind: TJoinType="inner",
alias: Optional[str] =None,
) ->Self: ...
# Implementationdefjoin(
self,
other: Union[str, Self],
on: Union[str, sge.Expression, None] =None,
*,
kind: TJoinType="inner",
alias: Optional[str] =None,
) ->Self:
"""Join this relation with another table or relation. When ``on`` is omitted, join conditions are discovered automatically from the dlt schema's reference chain (parent/child relationships created during loading). Both sides must be base-table relations from the same dataset. When ``on`` is provided, any two relations can be joined — including relations from different datasets on the same physical destination (cross-dataset join). The ``on`` condition uses logical column names from the dlt schema. Args: other: Table name (str) or Relation to join with. When a Relation from a different Dataset is passed, ``on`` is required. on: Join condition as SQL string or SQLGlot expression. Disables auto-discovery from schema references when provided. Examples: ``"users.id = orders.user_id"``, ``sge.EQ(this=sge.column("id", "users"), expression=sge.column("user_id", "orders"))`` kind: SQL join type — ``"inner"``, ``"left"``, ``"right"``, or ``"full"``. Defaults to ``"inner"``. alias: Column projection prefix for the joined table's columns. Columns from ``other`` appear as ``{alias}__{column}``. Defaults to the target table name. Returns: New Relation with the join applied and the joined table's columns appended to the projection. Raises: ValueError: If ``on`` is omitted and schema references between the two tables cannot be resolved, or if ``other`` is from a different dataset and ``on`` is not provided. Example: >>> # Magic join via schema references (same dataset) >>> users = dataset.table("users") >>> joined = users.join("users__orders", kind="left") >>> # Explicit join (same or cross-dataset) >>> customers = dataset1.table("customers") >>> orders = dataset2.table("orders") >>> joined = customers.join( ... orders, ... on="customers.customer_id = orders.customer_id", ... kind="left", ... ) """
Requirements
Explicit join behavior
When on is provided: parse the string into a SQLGlot expression (if needed), build the JOIN node with the explicit ON condition. No schema reference lookup.
Column projection uses the same {alias}__{column} convention as magic joins.
Cross-dataset join via Relation
When other is a Relation from a different Dataset, on is required.
Automatically register the foreign schema: self._dataset._add_foreign_schema(other._dataset.dataset_name, other._dataset.schema). This makes foreign tables visible to the qualify step.
Dataset-qualified string in other
Accept "warehouse_data.users" as other — parse into dataset_name + table_name.
Validate that the dataset is known (local or already registered as foreign).
If not known, raise ValueError with guidance to pass a Relation instead (so the schema can be auto-registered).
We recommend users pass a Relation rather than a qualified string — it carries the schema needed for automatic registration.
Query binder limitations for cross-dataset joins
Cross-dataset joins expose three limitations in the current qualify/bind pipeline that must be addressed:
Currently wraps all tables under a single {dataset_name: {table: columns}}. Tables from foreign datasets are invisible to qualify(), so column references to foreign tables cannot be resolved.
2. bind_query uses a single expand_table_name callback with no dataset context
The expand_table_name callback has signature Callable[[str], List[str]] — it receives only the table name. It is bound (via functools.partial) to the sql_client's own dataset_name. When bind_query encounters a foreign table node whose node.db differs from the local dataset, it still expands using the local dataset name — or skips expansion entirely if the table isn't in the SQLGlot schema.
3. make_qualified_table_name_path is hardcoded to self.dataset_name
Always uses self.dataset_name (line 235). There is no way to override the dataset for a foreign table. Similarly, catalog_name() is per-sql_client, so if two datasets live in different catalogs the path cannot be constructed correctly.
Tests
Unit tests (tests/dataset/)
Explicit on join between two tables in the same dataset
Explicit on join with other as a Relation from a different Dataset (cross-dataset)
"dataset.table" string parsing in other
Foreign schema auto-registered when cross-dataset Relation is joined
Column projection with explicit alias
Error cases: on omitted with cross-dataset Relation, unknown dataset string
Integration tests with duckdb
Two pipelines loading to two datasets on same duckdb
Cross-dataset join produces correct SQL and returns correct data
Smoke tests on all destinations (CI matrix)
Same-dataset explicit join works on each destination
Cross-catalog tests (ducklake or similar)
These are expected to initially fail — the goal is to document binder limitations:
Join across catalogs where dataset name on each catalog is different
Join across catalogs where dataset name is the same on both catalogs (tests that the binder distinguishes by catalog, not just dataset_name)
followup: #3403
requires: #3746 (multischema datasets) and PR #3590 (magic joins)
Introduction
PR #3590 introduced
Relation.join()that auto-discovers join conditions from dlt schema references (parent/child relationships). This ticket extends it with an explicitonparameter, enabling:otheris a Relation from a different DatasetA cross-dataset join introduces a foreign schema into the query. We use the local/foreign schema mechanism from #3746 to register it, making the foreign dataset's tables visible to the qualify and bind steps.
Signature
Requirements
Explicit join behavior
onis provided: parse the string into a SQLGlot expression (if needed), build the JOIN node with the explicit ON condition. No schema reference lookup.onis omitted: existing magic join behavior from PR feat: adddlt.Relation.join(...)based on normalizer and references #3590 (unchanged).{alias}__{column}convention as magic joins.Cross-dataset join via Relation
otheris aRelationfrom a differentDataset,onis required.self._dataset._add_foreign_schema(other._dataset.dataset_name, other._dataset.schema). This makes foreign tables visible to the qualify step.Dataset-qualified string in
other"warehouse_data.users"asother— parse into dataset_name + table_name.ValueErrorwith guidance to pass aRelationinstead (so the schema can be auto-registered).Relationrather than a qualified string — it carries the schema needed for automatic registration.Query binder limitations for cross-dataset joins
Cross-dataset joins expose three limitations in the current qualify/bind pipeline that must be addressed:
1.
create_sqlglot_schemais single-datasetFile:
dlt/dataset/lineage.py—create_sqlglot_schema()(line 28)Currently wraps all tables under a single
{dataset_name: {table: columns}}. Tables from foreign datasets are invisible toqualify(), so column references to foreign tables cannot be resolved.2.
bind_queryuses a singleexpand_table_namecallback with no dataset contextFile:
dlt/common/libs/sqlglot.py—bind_query()(line 947)The
expand_table_namecallback has signatureCallable[[str], List[str]]— it receives only the table name. It is bound (viafunctools.partial) to the sql_client's owndataset_name. Whenbind_queryencounters a foreign table node whosenode.dbdiffers from the local dataset, it still expands using the local dataset name — or skips expansion entirely if the table isn't in the SQLGlot schema.3.
make_qualified_table_name_pathis hardcoded toself.dataset_nameFile:
dlt/destinations/sql_client.py—make_qualified_table_name_path()(line 226)Always uses
self.dataset_name(line 235). There is no way to override the dataset for a foreign table. Similarly,catalog_name()is per-sql_client, so if two datasets live in different catalogs the path cannot be constructed correctly.Tests
Unit tests (
tests/dataset/)onjoin between two tables in the same datasetonjoin withotheras a Relation from a different Dataset (cross-dataset)"dataset.table"string parsing inotheraliasonomitted with cross-dataset Relation, unknown dataset stringIntegration tests with duckdb
Smoke tests on all destinations (CI matrix)
Cross-catalog tests (ducklake or similar)
These are expected to initially fail — the goal is to document binder limitations: