Skip to content

feat: expose DV metadata and payloads as Arrow streams#4168

Merged
ion-elgreco merged 3 commits intodelta-io:mainfrom
ethan-tyler:feat/python-deletion-vectors-4159
Feb 7, 2026
Merged

feat: expose DV metadata and payloads as Arrow streams#4168
ion-elgreco merged 3 commits intodelta-io:mainfrom
ethan-tyler:feat/python-deletion-vectors-4159

Conversation

@ethan-tyler
Copy link
Copy Markdown
Collaborator

@ethan-tyler ethan-tyler commented Feb 6, 2026

Description

Adds DeltaTable.deletion_vectors() -> RecordBatchReader returning one row per data file with a deletion vector.

Schema: filepath: utf8, selection_vector: list[bool] (true = keep, false = deleted).

Reuses the existing DataFusion replay path via replay_deletion_vectors(...). Results are deterministic and sorted by filepath.

Core changes:

  • DeletionVectorSelection struct, DeltaScan::deletion_vectors(), shared
    scan_metadata_stream() helper to avoid drift between scan paths
  • Replaced internal DV expect(...) with typed error propagation

Python binding:

  • cloned_table_and_state() to avoid TOCTOU on table + snapshot
  • Chunked Arrow batch output with non-nullable list items
  • Preserves without_files guard behavior

Related Issue(s)

Documentation

cc @ion-elgreco

@github-actions github-actions Bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Feb 6, 2026
@ion-elgreco
Copy link
Copy Markdown
Collaborator

@ethan-tyler thanks for picking this up!

I think we can simplify the pr quite abit. For Python we only want to have a recordbatchreadwr with two columns: filepath, selection_vector<list[str]>

We only need a log replay as is done in here and then convert the dv_hashmap into a recordbatchreader:

https://github.com/delta-io/delta-rs/blob/main/crates%2Fcore%2Fsrc%2Fdelta_datafusion%2Ftable_provider%2Fnext%2Fscan%2Fmod.rs#L81-L82

@ethan-tyler
Copy link
Copy Markdown
Collaborator Author

@ion-elgreco - Agreed on simplifying the surface and it was actually my first pass. After implementing and validating, I saw real limitations with a filepath + selection_vector[list[bool]] boundary.

If this goes through DataFusion scan internals, that's an architectural mismatch for Polars and any other non-DF consumers.

  • cross engine API stability couples to DF internals/refactors
  • semantics defined by one engine's execution path, not a neutral Delta boundary
  • Inherent maintenance cost increases for non-DF consumers

IMO for cross engine consumers- a protocol/kernel level DV boundary (or binding owned transform) is a better feature.

Since we're already in the internals, I'd rather preserve this core and simplify UX on top:

  • I've done the complex parts with snapshot validation, descriptor parsing, bounded IO/concurrency, DV decoding, error handling.
  • We add a filepath + dv_roaring_bytes is a thin wrapper on current output.
  • filepath + selection_vector[list[bool]] can layer on top by decoding/expanding, keeping the compact backend as the efficient path.

Work is reusable either way. Happy to split into smaller PRs if scope is the concern. Lmk your thoughts and happy to discuss further.

@ion-elgreco
Copy link
Copy Markdown
Collaborator

ion-elgreco commented Feb 7, 2026

If this goes through DataFusion scan internals, that's an architectural mismatch for Polars and any other non-DF consumers.

I dont really see how this is an issue. Consumers on the Python side receive a recordbatchreader including Polars. The dv hashmap is already materialized so we can actually even return an arrow table. How it's executed in rust should have no influence how Python consumers take in the deletion vectors

IMO for cross engine consumers- a protocol/kernel level DV boundary (or binding owned transform) is a better feature.

Since we're already in the internals, I'd rather preserve this core and simplify UX on top:

  • I've done the complex parts with snapshot validation, descriptor parsing, bounded IO/concurrency, DV decoding, error handling.
  • We add a filepath + dv_roaring_bytes is a thin wrapper on current output.
  • filepath + selection_vector[list[bool]] can layer on top by decoding/expanding, keeping the compact backend as the efficient path.

The Python binding api surface should be minimal, this custom reader should never exist solely for the Python binding. That's why I'm suggesting to go through the replay_files.

@ethan-tyler
Copy link
Copy Markdown
Collaborator Author

ok that's fair, I still think there’s value in this approach in a different context but I’m aligned. I’ll shelf this and push the simplified version​​​​​​​​​​​​​​​​. Thanks for the discussion, appreciate you engaging on this

@ethan-tyler ethan-tyler force-pushed the feat/python-deletion-vectors-4159 branch from e864e0d to d425cdb Compare February 7, 2026 12:23
@ethan-tyler ethan-tyler marked this pull request as ready for review February 7, 2026 12:31
Expose DeltaTable.deletion_vectors() in Python backed by a new core replay API.

- Add DeltaScan::deletion_vectors() with deterministic filepath ordering and named DeletionVectorSelection output.

- Reuse shared scan metadata stream setup and document replay/drain semantics.

- Build Arrow RecordBatchReader output in Python with filepath URI + selection_vector list[bool], chunked batching, non-null list items, and without_files guard.

- Tighten concurrency by cloning table+state under one lock and using one SessionContext state for registration + scan.

- Strengthen core/Python tests for exact DV values, determinism, eager snapshot path, URI contract, empty results, and error path.

- Replace an internal DV invariant expect() with typed error propagation instead of panic.

Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
@ethan-tyler ethan-tyler force-pushed the feat/python-deletion-vectors-4159 branch from d425cdb to df987a4 Compare February 7, 2026 13:31
Comment on lines +189 to +192
ctx = match ctx.error_or() {
Ok(ctx) => ctx,
Err(err) => return Poll::Ready(Some(Err(err.into()))),
};
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the same thing done here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol right, missed this in my refactoring. Thanks for the call out, will fix it

Comment thread python/src/lib.rs Outdated
fn deletion_vector_schema() -> Arc<arrow::datatypes::Schema> {
use arrow::datatypes::{DataType, Field, Schema};
Arc::new(Schema::new(vec![
Field::new("filepath", DataType::Utf8, false),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use large utf8 or utf8view to be on safer side

@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 7, 2026

Codecov Report

❌ Patch coverage is 74.04580% with 68 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.62%. Comparing base (25b4968) to head (409f93f).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
python/src/lib.rs 61.46% 42 Missing ⚠️
...re/src/delta_datafusion/table_provider/next/mod.rs 77.21% 2 Missing and 16 partials ⚠️
...c/delta_datafusion/table_provider/next/scan/mod.rs 79.31% 3 Missing and 3 partials ⚠️
...elta_datafusion/table_provider/next/scan/replay.rs 95.55% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4168      +/-   ##
==========================================
- Coverage   76.63%   76.62%   -0.01%     
==========================================
  Files         166      166              
  Lines       46796    47030     +234     
  Branches    46796    47030     +234     
==========================================
+ Hits        35861    36036     +175     
- Misses       9211     9252      +41     
- Partials     1724     1742      +18     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- simplify replay scan-file error handling chain

- propagate file-selection path parse failures instead of swallowing

- keep filepath as LargeUtf8 with inline rationale and matching schema test

- present filepath as str in Python docstring

- apply Python formatting updates for test-minimal

Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
@ethan-tyler
Copy link
Copy Markdown
Collaborator Author

ethan-tyler commented Feb 7, 2026

@ion-elgreco is codecov a blocker on this one? Happy to add more tests if needed, just lmk.

@ion-elgreco ion-elgreco enabled auto-merge (squash) February 7, 2026 16:13
@ion-elgreco ion-elgreco merged commit ecc8beb into delta-io:main Feb 7, 2026
28 of 29 checks passed
@ion-elgreco
Copy link
Copy Markdown
Collaborator

@ion-elgreco is codecov a blocker on this one? Happy to add more tests if needed, just lmk.

Nope :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

binding/python Issues for the Python package binding/rust Issues for the Rust crate

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Expose deletion vectors information to Python

2 participants