feat: expose DV metadata and payloads as Arrow streams by ethan-tyler · Pull Request #4168 · delta-io/delta-rs

ethan-tyler · 2026-02-06T05:14:53Z

Description

Adds DeltaTable.deletion_vectors() -> RecordBatchReader returning one row per data file with a deletion vector.

Schema: filepath: utf8, selection_vector: list[bool] (true = keep, false = deleted).

Reuses the existing DataFusion replay path via replay_deletion_vectors(...). Results are deterministic and sorted by filepath.

Core changes:

DeletionVectorSelection struct, DeltaScan::deletion_vectors(), shared
scan_metadata_stream() helper to avoid drift between scan paths
Replaced internal DV expect(...) with typed error propagation

Python binding:

cloned_table_and_state() to avoid TOCTOU on table + snapshot
Chunked Arrow batch output with non-nullable list items
Preserves without_files guard behavior

Related Issue(s)

Closes Expose deletion vectors information to Python #4159

Documentation

cc @ion-elgreco

ion-elgreco · 2026-02-06T05:41:47Z

@ethan-tyler thanks for picking this up!

I think we can simplify the pr quite abit. For Python we only want to have a recordbatchreadwr with two columns: filepath, selection_vector<list[str]>

We only need a log replay as is done in here and then convert the dv_hashmap into a recordbatchreader:

https://github.com/delta-io/delta-rs/blob/main/crates%2Fcore%2Fsrc%2Fdelta_datafusion%2Ftable_provider%2Fnext%2Fscan%2Fmod.rs#L81-L82

ethan-tyler · 2026-02-06T14:55:41Z

@ion-elgreco - Agreed on simplifying the surface and it was actually my first pass. After implementing and validating, I saw real limitations with a filepath + selection_vector[list[bool]] boundary.

If this goes through DataFusion scan internals, that's an architectural mismatch for Polars and any other non-DF consumers.

cross engine API stability couples to DF internals/refactors
semantics defined by one engine's execution path, not a neutral Delta boundary
Inherent maintenance cost increases for non-DF consumers

IMO for cross engine consumers- a protocol/kernel level DV boundary (or binding owned transform) is a better feature.

Since we're already in the internals, I'd rather preserve this core and simplify UX on top:

I've done the complex parts with snapshot validation, descriptor parsing, bounded IO/concurrency, DV decoding, error handling.
We add a filepath + dv_roaring_bytes is a thin wrapper on current output.
filepath + selection_vector[list[bool]] can layer on top by decoding/expanding, keeping the compact backend as the efficient path.

Work is reusable either way. Happy to split into smaller PRs if scope is the concern. Lmk your thoughts and happy to discuss further.

ion-elgreco · 2026-02-07T01:43:53Z

If this goes through DataFusion scan internals, that's an architectural mismatch for Polars and any other non-DF consumers.

I dont really see how this is an issue. Consumers on the Python side receive a recordbatchreader including Polars. The dv hashmap is already materialized so we can actually even return an arrow table. How it's executed in rust should have no influence how Python consumers take in the deletion vectors

IMO for cross engine consumers- a protocol/kernel level DV boundary (or binding owned transform) is a better feature.

Since we're already in the internals, I'd rather preserve this core and simplify UX on top:

I've done the complex parts with snapshot validation, descriptor parsing, bounded IO/concurrency, DV decoding, error handling.

We add a filepath + dv_roaring_bytes is a thin wrapper on current output.

filepath + selection_vector[list[bool]] can layer on top by decoding/expanding, keeping the compact backend as the efficient path.

The Python binding api surface should be minimal, this custom reader should never exist solely for the Python binding. That's why I'm suggesting to go through the replay_files.

ethan-tyler · 2026-02-07T02:22:39Z

ok that's fair, I still think there’s value in this approach in a different context but I’m aligned. I’ll shelf this and push the simplified version. Thanks for the discussion, appreciate you engaging on this

Expose DeltaTable.deletion_vectors() in Python backed by a new core replay API. - Add DeltaScan::deletion_vectors() with deterministic filepath ordering and named DeletionVectorSelection output. - Reuse shared scan metadata stream setup and document replay/drain semantics. - Build Arrow RecordBatchReader output in Python with filepath URI + selection_vector list[bool], chunked batching, non-null list items, and without_files guard. - Tighten concurrency by cloning table+state under one lock and using one SessionContext state for registration + scan. - Strengthen core/Python tests for exact DV values, determinism, eager snapshot path, URI contract, empty results, and error path. - Replace an internal DV invariant expect() with typed error propagation instead of panic. Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>

Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>

ion-elgreco · 2026-02-07T13:19:23Z

+                ctx = match ctx.error_or() {
+                    Ok(ctx) => ctx,
+                    Err(err) => return Poll::Ready(Some(Err(err.into()))),
+                };


Why is the same thing done here?

lol right, missed this in my refactoring. Thanks for the call out, will fix it

ion-elgreco · 2026-02-07T13:21:01Z

+fn deletion_vector_schema() -> Arc<arrow::datatypes::Schema> {
+    use arrow::datatypes::{DataType, Field, Schema};
+    Arc::new(Schema::new(vec![
+        Field::new("filepath", DataType::Utf8, false),


Let's use large utf8 or utf8view to be on safer side

codecov · 2026-02-07T13:35:04Z

Codecov Report

❌ Patch coverage is 74.04580% with 68 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.62%. Comparing base (25b4968) to head (409f93f).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
python/src/lib.rs	61.46%	42 Missing ⚠️
...re/src/delta_datafusion/table_provider/next/mod.rs	77.21%	2 Missing and 16 partials ⚠️
...c/delta_datafusion/table_provider/next/scan/mod.rs	79.31%	3 Missing and 3 partials ⚠️
...elta_datafusion/table_provider/next/scan/replay.rs	95.55%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4168      +/-   ##
==========================================
- Coverage   76.63%   76.62%   -0.01%     
==========================================
  Files         166      166              
  Lines       46796    47030     +234     
  Branches    46796    47030     +234     
==========================================
+ Hits        35861    36036     +175     
- Misses       9211     9252      +41     
- Partials     1724     1742      +18

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- simplify replay scan-file error handling chain - propagate file-selection path parse failures instead of swallowing - keep filepath as LargeUtf8 with inline rationale and matching schema test - present filepath as str in Python docstring - apply Python formatting updates for test-minimal Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>

ethan-tyler · 2026-02-07T16:02:55Z

@ion-elgreco is codecov a blocker on this one? Happy to add more tests if needed, just lmk.

ion-elgreco · 2026-02-07T16:13:30Z

@ion-elgreco is codecov a blocker on this one? Happy to add more tests if needed, just lmk.

Nope :)

github-project-automation Bot added this to delta-rust Feb 6, 2026

github-actions Bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Feb 6, 2026

ethan-tyler force-pushed the feat/python-deletion-vectors-4159 branch from e864e0d to d425cdb Compare February 7, 2026 12:23

ethan-tyler marked this pull request as ready for review February 7, 2026 12:31

ethan-tyler requested review from hntd187, ion-elgreco, roeap and rtyler as code owners February 7, 2026 12:31

ethan-tyler added 2 commits February 7, 2026 08:28

fix(core): adapt DV replay to FileSelection-aware ScanFileStream

df987a4

Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>

ethan-tyler force-pushed the feat/python-deletion-vectors-4159 branch from d425cdb to df987a4 Compare February 7, 2026 13:31

ion-elgreco reviewed Feb 7, 2026

View reviewed changes

ion-elgreco enabled auto-merge (squash) February 7, 2026 16:13

ion-elgreco approved these changes Feb 7, 2026

View reviewed changes

ion-elgreco merged commit ecc8beb into delta-io:main Feb 7, 2026
28 of 29 checks passed

github-project-automation Bot moved this to Done in delta-rust Feb 7, 2026

kdn36 mentioned this pull request Feb 26, 2026

[Bug]: Python delection_vectors() truncates bool list. #4235

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: expose DV metadata and payloads as Arrow streams#4168

feat: expose DV metadata and payloads as Arrow streams#4168
ion-elgreco merged 3 commits intodelta-io:mainfrom
ethan-tyler:feat/python-deletion-vectors-4159

ethan-tyler commented Feb 6, 2026 •

edited

Loading

Uh oh!

ion-elgreco commented Feb 6, 2026

Uh oh!

ethan-tyler commented Feb 6, 2026

Uh oh!

ion-elgreco commented Feb 7, 2026 •

edited

Loading

Uh oh!

ethan-tyler commented Feb 7, 2026

Uh oh!

ion-elgreco Feb 7, 2026

Uh oh!

ethan-tyler Feb 7, 2026

Uh oh!

ion-elgreco Feb 7, 2026

Uh oh!

codecov Bot commented Feb 7, 2026 •

edited

Loading

Uh oh!

ethan-tyler commented Feb 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

ion-elgreco commented Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ethan-tyler commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

Documentation

Uh oh!

ion-elgreco commented Feb 6, 2026

Uh oh!

ethan-tyler commented Feb 6, 2026

Uh oh!

ion-elgreco commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ethan-tyler commented Feb 7, 2026

Uh oh!

ion-elgreco Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

ethan-tyler Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

ion-elgreco Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ethan-tyler commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ion-elgreco commented Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ethan-tyler commented Feb 6, 2026 •

edited

Loading

ion-elgreco commented Feb 7, 2026 •

edited

Loading

codecov Bot commented Feb 7, 2026 •

edited

Loading

ethan-tyler commented Feb 7, 2026 •

edited

Loading