Skip to content

Snapshot-diff ingestion for append-only data warehousing #3827

@mattiasthalen

Description

@mattiasthalen

Feature description

Extend the insert-only merge strategy with a configurable comparison window that checks new records against only the previous load's keys instead of all keys ever loaded to the destination.

Currently, insert-only checks if a key exists anywhere in the destination table. This means if a record's state returns to a previously seen value (A→B→A), the return to state A is silently skipped because that hash already exists historically. For data warehousing patterns where DAS/bronze needs to capture every state change as seen by each extraction, this is a critical gap.

Are you a dlt user?

Yes, I run dlt in production.

Use case

I need to capture every state change from source systems into an append-only bronze layer. Each run extracts data, hashes rows, and should append only rows whose hashes differ from the previous run.

This produces a change log where every state transition is recorded — including returns to previously seen states (A→B→A).

The current strategies don't support this:

Strategy Behavior Problem
append Inserts everything Duplicates on every run
upsert Updates existing keys Overwrites _dlt_load_id, loses ingestion history
insert-only (new) Skips if key exists anywhere Misses A→B→A — hash A already exists historically
scd2 Tracks validity windows Not supported on Delta/filesystem destinations

What I need is insert-only but scoped to the previous load — "insert if this hash didn't exist in the last run" rather than "insert if this hash has never existed."

Proposed solution

Add a comparison_window option to the insert-only merge strategy:

@dlt.resource(
    primary_key="row_hash",
    write_disposition={
        "disposition": "merge",
        "strategy": "insert-only",
        "comparison_window": "previous_load"  # new option
    }
)
def my_source():
    ...

Behavior:

comparison_window NOT EXISTS check against Use case
"all" (default, current) Full destination table Deduplication
"previous_load" Rows from the last _dlt_load_id only Snapshot diff / change detection

Implementation approach:

The merge SQL would change from:

-- current: check against all rows
MERGE INTO target t USING staging s
ON t.row_hash = s.row_hash
WHEN NOT MATCHED THEN INSERT ...

To:

-- proposed: check against previous load only
MERGE INTO target t USING staging s
ON t.row_hash = s.row_hash AND t._dlt_load_id = '{last_load_id}'
WHEN NOT MATCHED THEN INSERT ...

The last_load_id can be resolved from dlt's existing pipeline state or from MAX(_dlt_load_id) in the destination table.

Alternatively, this could be implemented as a pre-step that creates a temporary hash set from the previous load, and the insert-only merge checks against that.

Why this belongs in dlt (not as external processing):

  • The building blocks already exist: _dlt_load_id tracks loads, row_value_hash computes hashes, insert-only handles the merge logic
  • It pushes the comparison down to the destination engine (SQL) instead of requiring Python-side hash set management
  • It works with dlt's existing state management for tracking the previous load
  • It enables a common data warehousing pattern (snapshot diff / change data feed) that currently requires custom post-processing

Related issues

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions