fix(datafusion): handle coalesced multi-file batches in next-scan#4112
Conversation
Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
|
@roeap - This handles the case where DF coalesces batches across file boundaries. Splits on file_id runs, applies DV/transforms per chunk and queues the fanout output. Nice tie in with the DML refactors, since those implicitly assume per file processing remains correct even when DF gets aggressive with batching. Do we want “mixed file_id batches” to be a supported long term contract for DeltaScanExec, or should the scan try to enforce single file batches and do its own coalescing for perf? |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4112 +/- ##
==========================================
+ Coverage 75.81% 75.86% +0.05%
==========================================
Files 165 165
Lines 44437 44677 +240
Branches 44437 44677 +240
==========================================
+ Hits 33689 33896 +207
- Misses 9058 9074 +16
- Partials 1690 1707 +17 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There are many performance sesnsitive users (myself included :)) of delta-rs, so in general performance is always welcome :). That said, right now we have to establish a baseline through this migration where in many places we need to clean up "what is happening" before going into too deep with optimisation. So for now I would go the simpler route, and start optimizing once we have consolidated the pre and post kernel world :) |
I agree - just wanted to see what your thoughts were on it. Thanks for the context and will keep this in mind post migration. |
There was a problem hiding this comment.
Took a first look and looking really good.
Reading through this I started thinking about some related question. Originally this was build assuming that we would always have the DataSourceExec, when in fact an optimizer could push really anything as a an input which produces records out of order.
I am thinking about replacing some of the log repay stuff with datafusion plans or using datafusion plans inside the kernel engine. In these cases however data arriving out of order would be very bad, as it messes up log replay / action reconciliation.
Since we haven't seen any indication otherwise in as many years I'll aussume this worked so far :).
Would it suffice to set required_input_ordering on a table provider when scanning logs to avoid that whatever optimization want to push down that could produce data out of order would be prohibited?
In this specific case here we are as you mention looking at cross file coalescing so we would still assume data is in order? ... well, we kind of have to for DVs :).
|
@roeap thanks for the review and merge! On the ordering question - yes, for this specific fix we're only dealing with batch boundary changes from coalescing, which preserves input order. DV row position semantics are maintained. The broader concern is completely valid. If an optimizer introduces operators that can reorder rows (repartition, interleave, etc.), DV application and log replay risk breaking. Making the ordering contract explicit via DF plan ordering requirements/guarantees seems like the right direction. For log replay, we'd need a total order (e.g. I did a lot of research on this tonight - I'm going to open an issue to document this and discuss options. Would love your input there. |
Description
Fix next-scan execution when upstream coalescing produces batches with rows from multiple files.
Changes:
internal_datafusion_err!on unexpected file_id column type instead of panickingRelated Issue(s)
Documentation