fix(datafusion): handle coalesced multi-file batches in next-scan by ethan-tyler · Pull Request #4112 · delta-io/delta-rs

ethan-tyler · 2026-01-21T20:28:02Z

Description

Fix next-scan execution when upstream coalescing produces batches with rows from multiple files.

Changes:

Split incoming batches into contiguous file_id runs before applying DV masks/transforms
Buffer fan-out outputs via VecDeque to preserve row order
Return internal_datafusion_err! on unexpected file_id column type instead of panicking
Add tests for interleaved file IDs, fanout, and invalid/null file_id paths

Related Issue(s)

Documentation

Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>

ethan-tyler · 2026-01-21T20:43:57Z

@roeap - This handles the case where DF coalesces batches across file boundaries. Splits on file_id runs, applies DV/transforms per chunk and queues the fanout output.

Nice tie in with the DML refactors, since those implicitly assume per file processing remains correct even when DF gets aggressive with batching.

Do we want “mixed file_id batches” to be a supported long term contract for DeltaScanExec, or should the scan try to enforce single file batches and do its own coalescing for perf?

codecov · 2026-01-21T20:52:38Z

Codecov Report

❌ Patch coverage is 86.44068% with 40 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.86%. Comparing base (054c18a) to head (ba765ce).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
.../delta_datafusion/table_provider/next/scan/exec.rs	86.44%	20 Missing and 20 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4112      +/-   ##
==========================================
+ Coverage   75.81%   75.86%   +0.05%     
==========================================
  Files         165      165              
  Lines       44437    44677     +240     
  Branches    44437    44677     +240     
==========================================
+ Hits        33689    33896     +207     
- Misses       9058     9074      +16     
- Partials     1690     1707      +17

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

roeap · 2026-01-21T20:53:54Z

Do we want “mixed file_id batches” to be a supported long term contract for DeltaScanExec, or should the scan try to enforce single file batches and do its own coalescing for perf?

There are many performance sesnsitive users (myself included :)) of delta-rs, so in general performance is always welcome :).

That said, right now we have to establish a baseline through this migration where in many places we need to clean up "what is happening" before going into too deep with optimisation.

So for now I would go the simpler route, and start optimizing once we have consolidated the pre and post kernel world :)

ethan-tyler · 2026-01-21T21:03:45Z

There are many performance sesnsitive users (myself included :)) of delta-rs, so in general performance is always welcome :).

That said, right now we have to establish a baseline through this migration where in many places we need to clean up "what is happening" before going into too deep with optimisation.

So for now I would go the simpler route, and start optimizing once we have consolidated the pre and post kernel world :)

I agree - just wanted to see what your thoughts were on it. Thanks for the context and will keep this in mind post migration.

roeap

Took a first look and looking really good.

Reading through this I started thinking about some related question. Originally this was build assuming that we would always have the DataSourceExec, when in fact an optimizer could push really anything as a an input which produces records out of order.

I am thinking about replacing some of the log repay stuff with datafusion plans or using datafusion plans inside the kernel engine. In these cases however data arriving out of order would be very bad, as it messes up log replay / action reconciliation.

Since we haven't seen any indication otherwise in as many years I'll aussume this worked so far :).

Would it suffice to set required_input_ordering on a table provider when scanning logs to avoid that whatever optimization want to push down that could produce data out of order would be prohibited?

In this specific case here we are as you mention looking at cross file coalescing so we would still assume data is in order? ... well, we kind of have to for DVs :).

roeap

👍

ethan-tyler · 2026-01-22T05:46:50Z

@roeap thanks for the review and merge!

On the ordering question - yes, for this specific fix we're only dealing with batch boundary changes from coalescing, which preserves input order. DV row position semantics are maintained.

The broader concern is completely valid. If an optimizer introduces operators that can reorder rows (repartition, interleave, etc.), DV application and log replay risk breaking. Making the ordering contract explicit via DF plan ordering requirements/guarantees seems like the right direction. For log replay, we'd need a total order (e.g. (version, action_index)), which likely implies single partition processing or an ordered merge.

I did a lot of research on this tonight - I'm going to open an issue to document this and discuss options. Would love your input there.

fix(datafusion): handle coalesced multi-file batches in next-scan

ba765ce

Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>

github-project-automation Bot added this to delta-rust Jan 21, 2026

github-actions Bot added the binding/rust Issues for the Rust crate label Jan 21, 2026

ethan-tyler marked this pull request as ready for review January 21, 2026 21:10

ethan-tyler requested review from hntd187, roeap and rtyler as code owners January 21, 2026 21:10

roeap reviewed Jan 21, 2026

View reviewed changes

roeap approved these changes Jan 21, 2026

View reviewed changes

roeap merged commit c36d665 into delta-io:main Jan 21, 2026
26 checks passed

github-project-automation Bot moved this to Done in delta-rust Jan 21, 2026

This was referenced Jan 22, 2026

[Feature]: Matched file filtering via IN (...) doesn't scale for large file sets #4113

Closed

Next-scan ordering contracts and DV masking hazard mitigation #4115

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(datafusion): handle coalesced multi-file batches in next-scan#4112

fix(datafusion): handle coalesced multi-file batches in next-scan#4112
roeap merged 1 commit intodelta-io:mainfrom
ethan-tyler:fix/next-scan-coalesced-batches

ethan-tyler commented Jan 21, 2026

Uh oh!

ethan-tyler commented Jan 21, 2026

Uh oh!

codecov Bot commented Jan 21, 2026 •

edited

Loading

Uh oh!

roeap commented Jan 21, 2026

Uh oh!

ethan-tyler commented Jan 21, 2026

Uh oh!

roeap left a comment •

edited

Loading

Uh oh!

roeap left a comment

Uh oh!

Uh oh!

ethan-tyler commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ethan-tyler commented Jan 21, 2026

Description

Related Issue(s)

Documentation

Uh oh!

ethan-tyler commented Jan 21, 2026

Uh oh!

codecov Bot commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

roeap commented Jan 21, 2026

Uh oh!

ethan-tyler commented Jan 21, 2026

Uh oh!

roeap left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roeap left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ethan-tyler commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jan 21, 2026 •

edited

Loading

roeap left a comment •

edited

Loading