Skip to content

fix(datafusion): handle coalesced multi-file batches in next-scan#4112

Merged
roeap merged 1 commit intodelta-io:mainfrom
ethan-tyler:fix/next-scan-coalesced-batches
Jan 21, 2026
Merged

fix(datafusion): handle coalesced multi-file batches in next-scan#4112
roeap merged 1 commit intodelta-io:mainfrom
ethan-tyler:fix/next-scan-coalesced-batches

Conversation

@ethan-tyler
Copy link
Copy Markdown
Collaborator

Description

Fix next-scan execution when upstream coalescing produces batches with rows from multiple files.

Changes:

  • Split incoming batches into contiguous file_id runs before applying DV masks/transforms
  • Buffer fan-out outputs via VecDeque to preserve row order
  • Return internal_datafusion_err! on unexpected file_id column type instead of panicking
  • Add tests for interleaved file IDs, fanout, and invalid/null file_id paths

Related Issue(s)

Documentation

Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
@github-actions github-actions Bot added the binding/rust Issues for the Rust crate label Jan 21, 2026
@ethan-tyler
Copy link
Copy Markdown
Collaborator Author

@roeap - This handles the case where DF coalesces batches across file boundaries. Splits on file_id runs, applies DV/transforms per chunk and queues the fanout output.

Nice tie in with the DML refactors, since those implicitly assume per file processing remains correct even when DF gets aggressive with batching.

Do we want “mixed file_id batches” to be a supported long term contract for DeltaScanExec, or should the scan try to enforce single file batches and do its own coalescing for perf?

@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 21, 2026

Codecov Report

❌ Patch coverage is 86.44068% with 40 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.86%. Comparing base (054c18a) to head (ba765ce).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
.../delta_datafusion/table_provider/next/scan/exec.rs 86.44% 20 Missing and 20 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4112      +/-   ##
==========================================
+ Coverage   75.81%   75.86%   +0.05%     
==========================================
  Files         165      165              
  Lines       44437    44677     +240     
  Branches    44437    44677     +240     
==========================================
+ Hits        33689    33896     +207     
- Misses       9058     9074      +16     
- Partials     1690     1707      +17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@roeap
Copy link
Copy Markdown
Collaborator

roeap commented Jan 21, 2026

Do we want “mixed file_id batches” to be a supported long term contract for DeltaScanExec, or should the scan try to enforce single file batches and do its own coalescing for perf?

There are many performance sesnsitive users (myself included :)) of delta-rs, so in general performance is always welcome :).

That said, right now we have to establish a baseline through this migration where in many places we need to clean up "what is happening" before going into too deep with optimisation.

So for now I would go the simpler route, and start optimizing once we have consolidated the pre and post kernel world :)

@ethan-tyler
Copy link
Copy Markdown
Collaborator Author

There are many performance sesnsitive users (myself included :)) of delta-rs, so in general performance is always welcome :).

That said, right now we have to establish a baseline through this migration where in many places we need to clean up "what is happening" before going into too deep with optimisation.

So for now I would go the simpler route, and start optimizing once we have consolidated the pre and post kernel world :)

I agree - just wanted to see what your thoughts were on it. Thanks for the context and will keep this in mind post migration.

@ethan-tyler ethan-tyler marked this pull request as ready for review January 21, 2026 21:10
Copy link
Copy Markdown
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a first look and looking really good.

Reading through this I started thinking about some related question. Originally this was build assuming that we would always have the DataSourceExec, when in fact an optimizer could push really anything as a an input which produces records out of order.

I am thinking about replacing some of the log repay stuff with datafusion plans or using datafusion plans inside the kernel engine. In these cases however data arriving out of order would be very bad, as it messes up log replay / action reconciliation.

Since we haven't seen any indication otherwise in as many years I'll aussume this worked so far :).

Would it suffice to set required_input_ordering on a table provider when scanning logs to avoid that whatever optimization want to push down that could produce data out of order would be prohibited?

In this specific case here we are as you mention looking at cross file coalescing so we would still assume data is in order? ... well, we kind of have to for DVs :).

Copy link
Copy Markdown
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@roeap roeap merged commit c36d665 into delta-io:main Jan 21, 2026
26 checks passed
@ethan-tyler
Copy link
Copy Markdown
Collaborator Author

@roeap thanks for the review and merge!

On the ordering question - yes, for this specific fix we're only dealing with batch boundary changes from coalescing, which preserves input order. DV row position semantics are maintained.

The broader concern is completely valid. If an optimizer introduces operators that can reorder rows (repartition, interleave, etc.), DV application and log replay risk breaking. Making the ordering contract explicit via DF plan ordering requirements/guarantees seems like the right direction. For log replay, we'd need a total order (e.g. (version, action_index)), which likely implies single partition processing or an ordered merge.

I did a lot of research on this tonight - I'm going to open an issue to document this and discuss options. Would love your input there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

binding/rust Issues for the Rust crate

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants