Skip to content

[Bug]: vacuum does not respect retention_hours when full=True #3989

@atzannes

Description

@atzannes

What happened?

My understanding of full=True is that it also vacuums orphaned files (e.g., the logs that reference file deletions were removed per delta.logRetentionDuration but the tombstoned files are still around), but it should still honor the delta.deletedFileRetentionDuration parameter (and the retention_hours passed to vacuum. This is not the case: when I call vacuum with full=True after a compaction operation, only the files referenced by the logs after the compaction are persisted and all other files are vacuumed, even if they are more recent than the retention duration.

As I am filling this issue, I am realizing that the issue has been fixed in 1.2.0, but since I couldn't find an issue for this problem, even among closed issues, I'll still report it. Feel free to close it as solved.

Expected behavior

Here's a pytest unit-test that demonstrates what I expect should happen:

def test_vacuum_issue(tmp_path):
    """Test issue of underlying delta-rs::vacuum."""
    from deltalake import DeltaTable, write_deltalake
    import pyarrow as pa

    table_path = str(tmp_path / "test_table")
    empty_data = [pa.array([], type=pa.string())]
    table_schema = pa.schema([pa.field("value", pa.string())])
    empty_table = pa.Table.from_arrays(empty_data, schema=table_schema)

    # Create the table
    write_deltalake(
        table_path,
        empty_table,
        name="test",
        description="test",
        # whether or not the following configuration is present does not matter for the behavior
        configuration={
            "delta.deletedFileRetentionDuration": "interval 7 days",
        },
    )
    dt = DeltaTable(table_path)

    # Append 10 records
    for ix in range(10):
        write_deltalake(dt, pa.table([[f"{ix}"]], schema=table_schema), mode="append")

    result = dt.vacuum(retention_hours=24 * 7, dry_run=False, full=True)
    assert len(result) == 0

    result = dt.optimize.compact()
    assert result["numFilesAdded"] == 1
    assert result["numFilesRemoved"] == 10

    result = dt.vacuum(retention_hours=24 * 7, dry_run=True, full=False)
    assert len(result) == 0

    result = dt.vacuum(retention_hours=24 * 7, dry_run=True, full=True)
    # The following assertion fails because vacuum with full=True does not honor the retention_hours.
    # Instead len(result) == 10 and it encompasses all 10 files that were just tombstoned by the compaction operation
    assert len(result) == 0

Operating System

Linux

Binding

Python

Bindings Version

1.1.4

Steps to reproduce

No response

Relevant logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions