Skip to content

Feature: Vacuum with version retention #3530

@corwinjoy

Description

@corwinjoy

Description

Allow specifying specific versions to retain when performing a vacuum operation.

Use Case
We have specific data history that we want to keep. Generally, we want to read either:

  1. A recentish data version (within the last 7 days).
  2. A collection of old fixed versions going back 250 days.

We cannot run a general vacuum operation, as that will remove files referenced in the old fixed versions (2).
What we would like, is more control of the vacuum operation to specify what versions to keep along with the associated files.
So, our old delta files directory might look like

{version 0-files}
{version 1-files}
{version 2-files}
...
{version 178-files}
{version 179-files}
{version 180-files}

And then we want to perform a vacuum where we keep versions {2, 179, 180} to get a new directory like:

{version 2-files}
{version 179-files}
{version 180-files}

We propose enhancing the vacuum operation to add an option to keep specific versions as specified in the VacuumBuilder. The idea would be to provide a vector of versions to keep. Then, after VACUUM it would still be possible to time travel back to the specified versions.
https://docs.rs/deltalake/latest/deltalake/operations/vacuum/struct.VacuumBuilder.html

Related Issue(s)
[Delete older files based on VERSION](#3143)

Background Reading
Vacuuming: https://delta.io/blog/remove-files-delta-lake-vacuum-command/
Time Travel: https://delta.io/blog/2023-02-01-delta-lake-time-travel/

Metadata

Metadata

Assignees

Labels

binding/rustIssues for the Rust crateenhancementNew feature or request

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions