Feature description
Extend the insert-only merge strategy with a configurable comparison window that checks new records against only the previous load's keys instead of all keys ever loaded to the destination.
Currently, insert-only checks if a key exists anywhere in the destination table. This means if a record's state returns to a previously seen value (A→B→A), the return to state A is silently skipped because that hash already exists historically. For data warehousing patterns where DAS/bronze needs to capture every state change as seen by each extraction, this is a critical gap.
Are you a dlt user?
Yes, I run dlt in production.
Use case
I need to capture every state change from source systems into an append-only bronze layer. Each run extracts data, hashes rows, and should append only rows whose hashes differ from the previous run.
This produces a change log where every state transition is recorded — including returns to previously seen states (A→B→A).
The current strategies don't support this:
| Strategy |
Behavior |
Problem |
append |
Inserts everything |
Duplicates on every run |
upsert |
Updates existing keys |
Overwrites _dlt_load_id, loses ingestion history |
insert-only (new) |
Skips if key exists anywhere |
Misses A→B→A — hash A already exists historically |
scd2 |
Tracks validity windows |
Not supported on Delta/filesystem destinations |
What I need is insert-only but scoped to the previous load — "insert if this hash didn't exist in the last run" rather than "insert if this hash has never existed."
Proposed solution
Add a comparison_window option to the insert-only merge strategy:
@dlt.resource(
primary_key="row_hash",
write_disposition={
"disposition": "merge",
"strategy": "insert-only",
"comparison_window": "previous_load" # new option
}
)
def my_source():
...
Behavior:
comparison_window |
NOT EXISTS check against |
Use case |
"all" (default, current) |
Full destination table |
Deduplication |
"previous_load" |
Rows from the last _dlt_load_id only |
Snapshot diff / change detection |
Implementation approach:
The merge SQL would change from:
-- current: check against all rows
MERGE INTO target t USING staging s
ON t.row_hash = s.row_hash
WHEN NOT MATCHED THEN INSERT ...
To:
-- proposed: check against previous load only
MERGE INTO target t USING staging s
ON t.row_hash = s.row_hash AND t._dlt_load_id = '{last_load_id}'
WHEN NOT MATCHED THEN INSERT ...
The last_load_id can be resolved from dlt's existing pipeline state or from MAX(_dlt_load_id) in the destination table.
Alternatively, this could be implemented as a pre-step that creates a temporary hash set from the previous load, and the insert-only merge checks against that.
Why this belongs in dlt (not as external processing):
- The building blocks already exist:
_dlt_load_id tracks loads, row_value_hash computes hashes, insert-only handles the merge logic
- It pushes the comparison down to the destination engine (SQL) instead of requiring Python-side hash set management
- It works with dlt's existing state management for tracking the previous load
- It enables a common data warehousing pattern (snapshot diff / change data feed) that currently requires custom post-processing
Related issues
No response
Feature description
Extend the insert-only merge strategy with a configurable comparison window that checks new records against only the previous load's keys instead of all keys ever loaded to the destination.
Currently, insert-only checks if a key exists anywhere in the destination table. This means if a record's state returns to a previously seen value (A→B→A), the return to state A is silently skipped because that hash already exists historically. For data warehousing patterns where DAS/bronze needs to capture every state change as seen by each extraction, this is a critical gap.
Are you a dlt user?
Yes, I run dlt in production.
Use case
I need to capture every state change from source systems into an append-only bronze layer. Each run extracts data, hashes rows, and should append only rows whose hashes differ from the previous run.
This produces a change log where every state transition is recorded — including returns to previously seen states (A→B→A).
The current strategies don't support this:
appendupsert_dlt_load_id, loses ingestion historyinsert-only(new)scd2What I need is insert-only but scoped to the previous load — "insert if this hash didn't exist in the last run" rather than "insert if this hash has never existed."
Proposed solution
Add a
comparison_windowoption to the insert-only merge strategy:Behavior:
comparison_window"all"(default, current)"previous_load"_dlt_load_idonlyImplementation approach:
The merge SQL would change from:
To:
The
last_load_idcan be resolved from dlt's existing pipeline state or fromMAX(_dlt_load_id)in the destination table.Alternatively, this could be implemented as a pre-step that creates a temporary hash set from the previous load, and the insert-only merge checks against that.
Why this belongs in dlt (not as external processing):
_dlt_load_idtracks loads,row_value_hashcomputes hashes, insert-only handles the merge logicRelated issues
No response