What happened?
My understanding of full=True is that it also vacuums orphaned files (e.g., the logs that reference file deletions were removed per delta.logRetentionDuration but the tombstoned files are still around), but it should still honor the delta.deletedFileRetentionDuration parameter (and the retention_hours passed to vacuum. This is not the case: when I call vacuum with full=True after a compaction operation, only the files referenced by the logs after the compaction are persisted and all other files are vacuumed, even if they are more recent than the retention duration.
As I am filling this issue, I am realizing that the issue has been fixed in 1.2.0, but since I couldn't find an issue for this problem, even among closed issues, I'll still report it. Feel free to close it as solved.
Expected behavior
Here's a pytest unit-test that demonstrates what I expect should happen:
def test_vacuum_issue(tmp_path):
"""Test issue of underlying delta-rs::vacuum."""
from deltalake import DeltaTable, write_deltalake
import pyarrow as pa
table_path = str(tmp_path / "test_table")
empty_data = [pa.array([], type=pa.string())]
table_schema = pa.schema([pa.field("value", pa.string())])
empty_table = pa.Table.from_arrays(empty_data, schema=table_schema)
# Create the table
write_deltalake(
table_path,
empty_table,
name="test",
description="test",
# whether or not the following configuration is present does not matter for the behavior
configuration={
"delta.deletedFileRetentionDuration": "interval 7 days",
},
)
dt = DeltaTable(table_path)
# Append 10 records
for ix in range(10):
write_deltalake(dt, pa.table([[f"{ix}"]], schema=table_schema), mode="append")
result = dt.vacuum(retention_hours=24 * 7, dry_run=False, full=True)
assert len(result) == 0
result = dt.optimize.compact()
assert result["numFilesAdded"] == 1
assert result["numFilesRemoved"] == 10
result = dt.vacuum(retention_hours=24 * 7, dry_run=True, full=False)
assert len(result) == 0
result = dt.vacuum(retention_hours=24 * 7, dry_run=True, full=True)
# The following assertion fails because vacuum with full=True does not honor the retention_hours.
# Instead len(result) == 10 and it encompasses all 10 files that were just tombstoned by the compaction operation
assert len(result) == 0
Operating System
Linux
Binding
Python
Bindings Version
1.1.4
Steps to reproduce
No response
Relevant logs
What happened?
My understanding of full=True is that it also vacuums orphaned files (e.g., the logs that reference file deletions were removed per
delta.logRetentionDurationbut the tombstoned files are still around), but it should still honor thedelta.deletedFileRetentionDurationparameter (and theretention_hourspassed to vacuum. This is not the case: when I call vacuum with full=True after a compaction operation, only the files referenced by the logs after the compaction are persisted and all other files are vacuumed, even if they are more recent than the retention duration.As I am filling this issue, I am realizing that the issue has been fixed in 1.2.0, but since I couldn't find an issue for this problem, even among closed issues, I'll still report it. Feel free to close it as solved.
Expected behavior
Here's a pytest unit-test that demonstrates what I expect should happen:
Operating System
Linux
Binding
Python
Bindings Version
1.1.4
Steps to reproduce
No response
Relevant logs