Snapshot-diff ingestion for append-only data warehousing

### Feature description

Extend the insert-only merge strategy with a configurable comparison window that checks new records against only the **previous load's keys** instead of all keys ever loaded to the destination.

Currently, insert-only checks if a key exists anywhere in the destination table. This means if a record's state returns to a previously seen value (A→B→A), the return to state A is silently skipped because that hash already exists historically. For data warehousing patterns where DAS/bronze needs to capture every state change as seen by each extraction, this is a critical gap.

### Are you a dlt user?

Yes, I run dlt in production.

### Use case

I need to capture every state change from source systems into an append-only bronze layer. Each run extracts data, hashes rows, and should append only rows whose hashes differ from the previous run.

This produces a change log where every state transition is recorded — including returns to previously seen states (A→B→A).

The current strategies don't support this:

| Strategy | Behavior | Problem |
|---|---|---|
| `append` | Inserts everything | Duplicates on every run |
| `upsert` | Updates existing keys | Overwrites `_dlt_load_id`, loses ingestion history |
| `insert-only` (new) | Skips if key exists anywhere | Misses A→B→A — hash A already exists historically |
| `scd2` | Tracks validity windows | Not supported on Delta/filesystem destinations |

What I need is insert-only but scoped to the previous load — "insert if this hash didn't exist in the last run" rather than "insert if this hash has never existed."

### Proposed solution

Add a `comparison_window` option to the insert-only merge strategy:

```python
@dlt.resource(
    primary_key="row_hash",
    write_disposition={
        "disposition": "merge",
        "strategy": "insert-only",
        "comparison_window": "previous_load"  # new option
    }
)
def my_source():
    ...
```

**Behavior:**

| `comparison_window` | NOT EXISTS check against | Use case |
|---|---|---|
| `"all"` (default, current) | Full destination table | Deduplication |
| `"previous_load"` | Rows from the last `_dlt_load_id` only | Snapshot diff / change detection |

**Implementation approach:**

The merge SQL would change from:

```sql
-- current: check against all rows
MERGE INTO target t USING staging s
ON t.row_hash = s.row_hash
WHEN NOT MATCHED THEN INSERT ...
```

To:

```sql
-- proposed: check against previous load only
MERGE INTO target t USING staging s
ON t.row_hash = s.row_hash AND t._dlt_load_id = '{last_load_id}'
WHEN NOT MATCHED THEN INSERT ...
```

The `last_load_id` can be resolved from dlt's existing pipeline state or from `MAX(_dlt_load_id)` in the destination table.

Alternatively, this could be implemented as a pre-step that creates a temporary hash set from the previous load, and the insert-only merge checks against that.

**Why this belongs in dlt (not as external processing):**

- The building blocks already exist: `_dlt_load_id` tracks loads, `row_value_hash` computes hashes, insert-only handles the merge logic
- It pushes the comparison down to the destination engine (SQL) instead of requiring Python-side hash set management
- It works with dlt's existing state management for tracking the previous load
- It enables a common data warehousing pattern (snapshot diff / change data feed) that currently requires custom post-processing

### Related issues

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot-diff ingestion for append-only data warehousing #3827

Feature description

Are you a dlt user?

Use case

Proposed solution

Related issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Strategy	Behavior	Problem
`append`	Inserts everything	Duplicates on every run
`upsert`	Updates existing keys	Overwrites `_dlt_load_id`, loses ingestion history
`insert-only` (new)	Skips if key exists anywhere	Misses A→B→A — hash A already exists historically
`scd2`	Tracks validity windows	Not supported on Delta/filesystem destinations

`comparison_window`	NOT EXISTS check against	Use case
`"all"` (default, current)	Full destination table	Deduplication
`"previous_load"`	Rows from the last `_dlt_load_id` only	Snapshot diff / change detection

Snapshot-diff ingestion for append-only data warehousing #3827

Description

Feature description

Are you a dlt user?

Use case

Proposed solution

Related issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions