chore(deps): upgrade datafusion to 52.0.0 by ethan-tyler · Pull Request #4092 · delta-io/delta-rs

ethan-tyler · 2026-01-18T08:02:08Z

Description

Minimal upgrade to compile and pass existing tests on DF52.
Supersedes #4054.

Changes

Dependencies:

Bump datafusion* crates to 52.0.0

DF52 API migrations:

Remove SchemaAdapter (DF52 removed it)
Update scan builders to TableSchema + FileScanConfigBuilder (legacy, next-scan, CDF paths)
Update ColumnStatistics struct literals
Remove session_state_from_session shim, error if caller supplies non-SessionState (no silent fallback)

DF52/Arrow57 scalar compatibility:

Normalize Utf8View/BinaryView literals in predicates for pushdown compatibility
Arrow57 requires matching scalar types for comparison kernels

Correctness fix:
DF52 enables batch coalescing by default, producing batches with rows from multiple files. Next-scan assumed one file per batch.

Split batches by contiguous file_id runs before applying per-file DVs/transforms
Queue multiple output batches from single input
Add test for dictionary-based file_id extraction

Test updates:

Loosen stats assertions for DF52 Precision::Inexact

Follow Ups (PR Splits):

PhysicalExprAdapter / expr adapter work
schema_force_view_types changes
Schema evolution extras
Python surface-area changes

Verification:

cargo test -p deltalake-core --features datafusion --test integration_datafusion

Related Issue(s)

Supersedes chore(deps): upgrade to DataFusion 52 #4054

github-actions · 2026-01-18T08:02:41Z

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

ion-elgreco · 2026-01-18T10:39:22Z

"Arrow57 requires matching scalar types for comparison kernels"

@alamb is there a reason why arrow 57 is more restrictive with literal comparison?

codecov · 2026-01-18T10:45:08Z

Codecov Report

❌ Patch coverage is 73.34495% with 153 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.07%. Comparing base (87a0754) to head (5aa47c0).
⚠️ Report is 11 commits behind head on main.

Files with missing lines	Patch %	Lines
python/src/reader.rs	0.00%	43 Missing ⚠️
...c/delta_datafusion/table_provider/next/scan/mod.rs	86.66%	12 Missing and 20 partials ⚠️
python/src/writer.rs	0.00%	21 Missing ⚠️
.../delta_datafusion/table_provider/next/scan/plan.rs	76.19%	15 Missing and 5 partials ⚠️
.../delta_datafusion/table_provider/next/scan/exec.rs	66.66%	10 Missing and 2 partials ⚠️
python/src/lib.rs	0.00%	8 Missing ⚠️
...a_datafusion/table_provider/next/scan/exec_meta.rs	50.00%	4 Missing and 2 partials ⚠️
crates/core/src/operations/load_cdf.rs	92.50%	3 Missing ⚠️
python/src/datafusion.rs	0.00%	3 Missing ⚠️
crates/core/src/delta_datafusion/table_provider.rs	96.42%	1 Missing and 1 partial ⚠️
... and 2 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4092      +/-   ##
==========================================
- Coverage   76.17%   76.07%   -0.11%     
==========================================
  Files         165      164       -1     
  Lines       44692    45073     +381     
  Branches    44692    45073     +381     
==========================================
+ Hits        34046    34291     +245     
- Misses       9024     9140     +116     
- Partials     1622     1642      +20

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

ion-elgreco · 2026-01-18T11:03:25Z

+            column_statistics: reordered_stats,
+        };
+
+        // Add unknown stats for file_column if present (it's added as partition field but not in original schema)


Just curious, what does datafusion do if you have a column in the schema but no statistics at all, not even a mention of the column?

Good question - Statistics.column_statistics is positional, must align 1:1 with TableSchema order. Missing entries break downstream assumptions, we remap by name and fill gaps with ColumnStatistics::new_unknown().

ion-elgreco · 2026-01-18T11:54:32Z

+        base: &Schema,
+    ) -> Result<SchemaRef> {
+        // IMPORTANT: This schema is used for Parquet reading and predicate pushdown.
+        // It must NOT apply view-type conversions (Utf8View/BinaryView) because:


So what happens if you do not that?

Pushdown errors on type mismatch. Parquet stores Utf8/Binary, if we told ParquetSource to operate on Utf8View/BinaryView, the comparison kernels would see different types. We keep base types for the Parquet schema, apply view conversions only in the result schema, and normalize view-typed literals before pushdown.

ion-elgreco · 2026-01-18T12:11:32Z

            "id": Array(
                ["1", "2", "3", "4", "5"],
-                type=ArrowField("id", DataType.string_view(), nullable=True),
+                type=ArrowField("id", DataType.string(), nullable=True),


We should revert to view types for Python

Agreed. The DataType.string() changes were a workaround. The Rust side literal normalization is already in place, reverting to DataType.string_view() should work.

alamb · 2026-01-18T12:40:19Z

"Arrow57 requires matching scalar types for comparison kernels"

@alamb is there a reason why arrow 57 is more restrictive with literal comparison?

I am not sure what this is referring to

I don't think arrow changed in this regards

DataFusion has a bunch of code to coerce literals to the corrct type, and I think some of the work to change how pushdown worked results in maybe not calling that code implicitly anymore -- you may have to call it explicitly

Maybe @adriangb can point you at the correct APIs etc (e.g. the SchemaAdapter stuff)

ethan-tyler · 2026-01-18T13:43:48Z

"Arrow57 requires matching scalar types for comparison kernels"
@alamb is there a reason why arrow 57 is more restrictive with literal comparison?

I am not sure what this is referring to

I don't think arrow changed in this regards

DataFusion has a bunch of code to coerce literals to the corrct type, and I think some of the work to change how pushdown worked results in maybe not calling that code implicitly anymore -- you may have to call it explicitly

Maybe @adriangb can point you at the correct APIs etc (e.g. the SchemaAdapter stuff)

@ion-elgreco: With DF52, view types are more prevalent (schema_force_view_types defaults true). In next-scan we keep the Parquet pushdown schema as base Utf8/Binary and apply view conversions only in the result schema. Predicates may contain view-typed literals from the query, we normalize those to base types before pushdown.

@alamb: In delta-rs we build pushdown predicates and call logical2physical directly, which bypasses the analyzer's coercion passes. DF's physical planning assumes coercion is already done and logical2physical won't coerce for us, we normalize literals to match the scan schema explicitly before calling it. If there's a canonical DF52 API for predicate coercion we should use instead, happy to switch.

Will update the PR description to clarify this is a pushdown plumbing issue, not Arrow strictness.

alamb · 2026-01-18T14:59:18Z

@alamb: In delta-rs we build pushdown predicates and call logical2physical directly, which bypasses the analyzer's coercion passes. DF's physical planning assumes coercion is already done and logical2physical won't coerce for us, we normalize literals to match the scan schema explicitly before calling it. If there's a canonical DF52 API for predicate coercion we should use instead, happy to switch.

DataFusion now does something similar to this in 52

Specifically, it uses PhysicalExprAdapter to rewrite a physical expression that was in terms of the table schema (logical schema) in to an expression in terms of the file schema (that might have different types than the table schema)

There is a corresponding PhysicalExprSimplifier that then performs some base optimizations on this expression (like unwraping column casts and applying the cast to the literal)

So it sounds like (without looking at the code in detail) you may be able to use those two methods now instead of custom logic in delta-rs

ethan-tyler · 2026-01-18T20:28:01Z

DataFusion now does something similar to this in 52

Specifically, it uses PhysicalExprAdapter to rewrite a physical expression that was in terms of the table schema (logical schema) in to an expression in terms of the file schema (that might have different types than the table schema)

There is a corresponding PhysicalExprSimplifier that then performs some base optimizations on this expression (like unwraping column casts and applying the cast to the literal)

So it sounds like (without looking at the code in detail) you may be able to use those two methods now instead of custom logic in delta-rs

Thanks @alamb! Super helpful context, and I agree this is the right direction.

Why I went with the literal normalization shim for now: next-scan builds the Parquet pushdown predicate already in file-schema space (base Utf8/Binary), then calls logical2physical. The adapter expects a table-schema expression to translate and using it would require restructuring the predicate planning flow.

I had adapter wiring in #4054, but per guidance I split it to keep the DF52 upgrade reviewable. This PR is the minimal extraction; the adapter approach is planned as a focused follow up.

Question for that follow up: Would you recommend wiring DefaultPhysicalExprAdapterFactory and PhysicalExprSimplifier directly (mirroring DF's Parquet scan path), or is a custom adapter more appropriate given next-scan's kernel/DV constraints?

ethan-tyler · 2026-01-18T20:40:18Z

@roeap @ion-elgreco - confirming alignment on path forward:

This PR: Minimal DF52 upgrade (API migrations, batch coalescing fix) and a narrow, temporary shim to normalize Utf8View/BinaryView literals for next-scan Parquet pushdown. Keeps base-typed Parquet read schema and avoids type mismatch failures.

Proposed plan:

Re-introduce PhysicalExprAdapterFactory and PhysicalExprSimplifier pipeline from chore(deps): upgrade to DataFusion 52 #4054 as a focused PR for next-scan pushdown correctness and robustness
Decide and document schema_force_view_types behavior in next-scan (provider schema vs parquet read schema). This touches kernel/DV compatibility and perf tradeoffs
Expand schema evolution and pushdown regression tests (nested fields, missing cols, view/base casts) once the adapter path lands

Does this plan work, or would you prefer different ordering or scope?

alamb · 2026-01-19T15:13:14Z

BTW @adriangb added a helper here for batch conversion here:

Add BatchAdapter to simplify using PhysicalExprAdapter / Projector apache/datafusion#19877

This will be released as part of the 52.1.0 release, which I anticipate being released towards the end of this week, if that is helpful

Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>

Remove SchemaAdapter usage and switch scans to DF52 TableSchema + updated FileScanConfigBuilder. Update ColumnStatistics literals and align stats expectations with DF52 precision. Adjust ops session handling and python datafusion glue for DF52. Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>

DataFusion-ffi 52 requires a TaskContextProvider + codec when exporting TableProvider from the Python binding. Also mark byte_size statistics as absent when building PartitionedFile stats. Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>

The DF52 scan visitor expects to extract the KernelScanPlan from DeltaScanExec; keep a small accessor to avoid depending on struct field visibility. Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>

Enable view types by default and normalize predicate pushdown against Parquet base types while keeping the synthetic file_id as Dictionary<UInt16, Utf8>. Also update Python tests to expect view types, relax DF statistics assertions, and disable hash-join IN-list pushdown to avoid a DF52 dictionary panic. Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>

roeap

Sorry for so many comments @ethan-tyler - it's just that we are reverting some things that were at the top of my list when migrating schema providers. I.e. not needing to do so much schema gymnastics in larger parts of the codebase and emitting nicer physical types (i.e. string views).

Over the migration things got continuously better in that regard (not perfect though 😆), and keep getting better as we migrate.

So at least I would love to understand why we need to regress with DF52.

roeap · 2026-01-18T12:48:13Z

 impl DeltaScanStream {
-    /// Apply the per-file transformation to a RecordBatch.
-    fn batch_project(&mut self, batch: RecordBatch) -> Result<RecordBatch> {
+    fn batch_project(&mut self, batch: RecordBatch) -> Result<Vec<RecordBatch>> {


Not sure what is the exact case, but I was under the assumption that ParquetExec would not coalesce batches and we could expect that one batch always originates from one file. if this is not something we can rely on on, we should at the very least be able to rely on that consecutive roes will be from one file. In that case we might want to look into the partition kernel which we also use for when writing data to partitions.

You're right that ParquetExec typically yields single file batches and the common case is one output batch. We return Vec<RecordBatch> defensively since DF can hand us a multi file batch (e.g., after coalescing), and DV masks + per-file transforms must apply in file context. Implementation does a linear scan for file_id run boundaries and zero-copy slice() per run with minimal overhead.

roeap · 2026-01-24T07:22:43Z

+        statistics
+            .column_statistics
+            .push(ColumnStatistics::new_unknown());


I a lo al attempt i found it advantageous to push this right when we create the partitioned file and also assign the other stats.

Fixed. I moved file_id stats to PartitionedFile construction time as you suggested. We were double appending file_id stats. Now we use partition_values and with_statistics for per file stats and call compute_all_files_statistics with the full table schema.

roeap · 2026-01-24T07:25:34Z

+fn strip_view_types_from_schema(schema: &Schema) -> Schema {
+    Schema::new(
+        schema
+            .fields()
+            .iter()
+            .map(|f| {
+                let dt = strip_view_types_from_data_type(f.data_type());
+                if &dt == f.data_type() {
+                    f.as_ref().clone()
+                } else {
+                    f.as_ref().clone().with_data_type(dt)
+                }
+            })
+            .collect::<Vec<_>>(),
+    )
+}


I somehow cannot accept that we need to do this?

What starts failing if we don't. I just trying to get rid of more and more of our custom casting code. WHat spoecifically starts failing so that we need to add this?

Removed strip_view_types_from_schema. The issue was parquet pushdown breaks when DF produces Utf8View/BinaryView literals but our read schema is base Utf8/Binary. DF's DefaultPhysicalExprAdapter casts columns but not literals (upstream TODO). We now use DF's adapter plus a literal rewrite (view to base) for pushdown. I also added regression tests in scan/mod.rs.

PhysicalExprSimplifier moves the casts from columns to literals. Maybe look at BatchAdapter (either to use it or as inspo; it’s in main and will be in 52.1.0 not released yet)?

roeap · 2026-01-24T07:29:13Z

    }
+
+    #[tokio::test]
+    async fn test_view_literal_conversion_in_predicate_pushdown() -> TestResult {


this test name make me think it's the new physical expression conversion we need?

Renamed. Test covers the literal rewrite for Parquet pushdown when physical schema uses base types.

roeap · 2026-01-24T07:32:06Z

+        let plan = get_read_plan(
+            &session.state(),
+            files_by_store,
+            &arrow_schema,


the schema we pass into get_read_plan should always be the most specific physical schema around ... i.e. contain VIew types.

If this is what we are testing here we need to update schema we are passing to have view types.

Intentionally passing base types here. get_read_plan builds parquet_read_schema, which is used for parquet reads and predicate pushdown evaluation. Pushdown happens before DF's schema casts are applied, if we pass view types but the Parquet files use base types, we hit Utf8 == Utf8View mismatches during filter eval. DF coercions also aren't reliably recursive for nested types.

View types are still exposed at the DeltaScan boundary via table_schema just not in the read/pushdown path. We adapt predicates via with_expr_adapter to rewrite view literals to base for pushdown.

If you'd rather pass view types here and handle mismatches differently, open to discussing, this was the least invasive fix I found.

for the partition columns yes, but especially for the read paths they are interesting? At least that where we are getting most of the benefit i believe?

Yep - the read paths are actually the main win here. We always dictionary encode the synthetic file-id (file_url) when building PartitionedFile.partition_values and the file-id column itself is Dictionary<UInt16, Utf8>, we avoid repeating the full URL for every row. Partition cols are handled above Parquet via kernel transforms in the next provider and they don’t materially change the Parquet read plan.

roeap · 2026-01-24T07:37:39Z

+                    TableProviderFilterPushDown::Exact
+                } else {
+                    TableProviderFilterPushDown::Inexact
+                },
+                Some(expr),


DO we really want to declare more as exact. While this might be true, i did not find a precedent in datafusion where we set this to be exact for anything other then partition columns.

TableProvider implementations should only declare as exact predicates they can completely evaluate during the call to scan (which should be replaced with scan_with_args btw). Everything else gets resolved later.

Agreed with @adriangb - partition only predicates (fully evaluated during scan) now return Exact. Everything else returns Inexact, including Parquet pushdown predicates (still attempted when safe, but DF retains a post scan Filter as the correctness boundary). Renamed parquet_pushdown_exact to parquet_pushdown_enabled for clarity.

Note for completeness: the Inexact filters may be removed by the filter pushdown physical optimizer if the data source accepts them as exact…

roeap · 2026-01-24T07:38:17Z

-                .set_bool("datafusion.sql_parser.enable_ident_normalization", false),
+                .set_bool("datafusion.sql_parser.enable_ident_normalization", false)
+                .set_bool("datafusion.execution.parquet.schema_force_view_types", true)
+                // Work around DataFusion 52.0.0 bug: hash-join dynamic filtering (IN-list


do we have a link to this bug?

I can very much believe that dictionary arrays are not handled correctly (we've also found plenty of panics in DataFusion because of it) but it would be great to track the issue upstream.

I will create an issue and link it.

roeap · 2026-01-24T07:42:12Z

+        // DF52's TableSchema outputs columns as: file_schema + partition_columns
+        // Source stats are indexed by TableConfiguration.schema() field order, which may differ
+        // from the scan schema order. We need name-based remapping, not index-based.


where does this reordering come in? My hope would be that the schema we push into the various operations pushes through ...

That said, the table provider needs to go and any changes that are purely motivated by needing to migrate this provider should ideally be included in a pre-factor o.a. so that we can revert them easily. I just a few commits away from removing that thing entirely.

Stats reordering happens in table_provider.rs via FileScanConfigBuilder::with_statistics. DF52 expects column_statistics indexed as file_schema || partition_columns, but kernel returns them in TableConfiguration.schema() order. We rebuild by name lookup. Scoped to current provider and should be easy to drop when provider is removed.

but parquet should have no awareness of partition columns. at the very least in the new provider. Partition columns should be 100% handled on the table provider level.

The only partition column we assign on the parquet level is the file id columns.

At least in my mind the parquet reader would read the file as whatever schema (physical types and odering) we push into it? of course sans any partition column. Where these appear would ultimately be decided in the expression we apply from the scan metadata.

Come to think of it, could it be that some of the changes to the parquet schema in KernelScanPlan are used for this deprecated provider?

If so, can we keep that separate. It just hard to believe that DF would get so much worse in one release 😆.

Agreed Parquet shouldn't know about Delta partition columns. This remap handles
DF's TableSchema/FileScanConfig expectations. DF52 appends partition fields
after the file schema (filled from PartitionedFile.partition_values), and
Statistics.column_statistics is positional which requires name based reordering when
source stats are in TableConfiguration.schema() order. In next, Parquet partition
fields are file_id only; Delta partition columns are handled above Parquet.

KernelScanPlan parquet schema changes stay separate. The remapping fixes a
latent stats-index vs scan-schema-index mismatch that DF52 exposes.

but parquet should have no awareness of partition columns. at the very least in the new provider. Partition columns should be 100% handled on the table provider level

ParquetOpener does does have awareness of DataFusion's partition columns. Before we made that change filters like part_col = 1 OR file_col = 2 would give wrong results when filter Pushdown was enabled. Not sure if Delta partition columns are handled completely separately...

@adriangb - The ParquetOpener partition handling you're referencing (part_col = 1 OR file_col = 2 correctness) applies to DF's built in partitioning. Next-scan sidesteps this entirely- Delta partition columns are handled outside DF's partition machinery:

Parquet layer: only sees file_id (provider metadata). No Delta partition columns in parquet_read_schema.

Value injection: kernel transforms add partition values post read, bypassing FileScanConfig.partition_values.

Filter pushdown: partition predicates use kernel's stats based file skipping, not ParquetOpener's partition aware path.

Kernel owns partition semantics and injects values above the scan.

The stats remapping in the legacy provider exists because it does use DF's partition column machinery and must align with DF52's file_schema + partition_columns index expectations. This will go away once the legacy provider is removed.

roeap · 2026-01-24T07:43:22Z

                "date": Array(
                    ["2021-01-01", "2021-01-02", "2021-01-03", "2021-12-31"],
-                    ArrowField("date", type=DataType.string_view(), nullable=True),
+                    ArrowField("date", type=DataType.string(), nullable=True),


we really want to keep outputting StingViews, why is this change needed?

You're right, this change was an overcorrection and reverted. We still output StringView (via schema_force_view_types=true). The base types in parquet_read_schema are only for the pushdown path; user visible output remains view types.

roeap · 2026-01-24T07:44:57Z

+        fn reset_state(&self) -> Arc<RwLock<dyn LazyBatchGenerator>> {
+            Arc::new(RwLock::new(TestBatchGenerator::new(self.data.clone())))
+        }


where is this functionality needed (in DF?)? not sure how we would even reset that but this just copies it?=

IIRC recursive CTEs

@adriangb is right again :) -reset_state is part of DF's LazyBatchGenerator contract and DF calls it when it needs to rerun a scan from the beginning.

For this test generator, we just clone the backing Vec<RecordBatch>. For real Python Arrow streams (which can't rewind), ArrowStreamBatchGenerator::reset_state returns an ExhaustedStreamGenerator that errors with guidance to buffer input if plan re-execution is required.

Align workspace and python bindings on DataFusion 52.1.0 to pick up bugfixes and keep versions consistent. Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>

Update next-scan planning/execution to handle DF52.1 schema/view-type behavior and keep predicate pushdown stable. Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>

Make python tests robust to DF52.1 view/base string and binary types, and work around pyarrow sort limitations for view arrays. Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>

Run cargo fmt to keep formatting consistent. Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>

roeap

Thanks for sticking with this @ethan-tyler!

Just flushing some comments, but the majority looks great!

The one thing I don't quite understand where its coming from is the updates to the parquet read schema in next/scan/plan.rs. It was a lot of effort to avoid such schema gymnastics this time around as they have been haunting us for years and I hope not to repeat old mistakes in the new provider.

Are we also using this in codepath that affect the old provider?

Realising I am mostly looking at the diff - I'll dive deeper to understand what is going on.

And thanks again for your patience 😄.

roeap · 2026-01-24T22:53:26Z

-        });
+        // NOTE: `PartitionedFile::with_statistics` appends exact stats for partition columns based
+        // on `partition_values`, so partition values must be set first.
+        partitioned_file.partition_values = vec![file_value.clone()];


So by invterting the order of the assignment we get stats auto-assigned? nice 👍

roeap · 2026-01-24T22:58:27Z

+            // DataFusion's DefaultPhysicalExprAdapter inserts casts on columns, but does not
+            // currently rewrite literals to match the physical schema. Parquet predicate
+            // evaluation/pushdown happens against the physical file schema prior to those casts.
+            //
+            // See upstream TODO:
+            // datafusion-physical-expr-adapter/src/schema_rewriter.rs
+            // "TODO: ... move the cast from the column to literal expressions ..." (52.1.0).


this is something we can expect in 52.1, right? i.e. the base optimizer that does this?

No, sorry that comment is confusing and adjusted it. Adapter still casts columns but not literals, move cast from column to literal remains a TODO upstream (there was an earlier attempt). We should keep the workaround until there is an upstream fix.

roeap · 2026-01-24T22:59:21Z

-        // extended schema with missing column and different data types
-        let arrow_schema_extended = Arc::new(Schema::new(vec![
-            Field::new("id", DataType::Int32, false),
-            Field::new("value", DataType::Utf8View, true),
-            Field::new("value2", DataType::Utf8View, true),
-        ]));
-        let plan = get_read_plan(
-            &session.state(),
-            files_by_store,
-            &arrow_schema_extended,
-            Some(1),
-            &file_id_field,
-            None,
-        )
-        .await?;
-        let batches = collect(plan, session.task_ctx()).await?;
-        let expected = vec![
-            "+----+-------+--------+-----------------------------+",
-            "| id | value | value2 | __delta_rs_file_id__        |",
-            "+----+-------+--------+-----------------------------+",
-            "| 1  | a     |        | memory:///test_data.parquet |",
-            "+----+-------+--------+-----------------------------+",
-        ];
-        assert_batches_sorted_eq!(&expected, &batches);
-        assert!(matches!(
-            batches[0].column(1).data_type(),
-            DataType::Utf8View
-        ));
-        assert!(matches!(
-            batches[0].column(2).data_type(),
-            DataType::Utf8View
-        ));
-


can we not fulfil this test anymore?

I don't think get_read_plan() should be expected to produce Utf8View types. The next
provider forces the Parquet read/predicate schema to base types to keep predicate pushdown stable and view types are exposed only at the DeltaScan boundary.

I updated this by:

keeping the missing column NULL/backfill assertion in the get_read_plan() test, but dropping the Utf8View type assertions there

adding a boundary level assertion in KernelScanPlan that result_schema can have view types when enabled, while parquet_read_schema stays base types (crates/core/src/delta_datafusion/table_provider/next/scan/plan.rs)

roeap · 2026-01-24T23:01:02Z

+        let plan = get_read_plan(
+            &session.state(),
+            files_by_store,
+            &arrow_schema,


for the partition columns yes, but especially for the read paths they are interesting? At least that where we are getting most of the benefit i believe?

roeap · 2026-01-24T23:02:52Z

+            // Always use base types for the Parquet read schema.
+            DataType::Utf8View => DataType::Utf8,
+            DataType::BinaryView => DataType::Binary,


i am so sorry to keep pestering about this - but we especially want them in the parquet read, no?

Why base types in parquet_read_schema are needed:

schema_force_view_types exists for the perf win, but pushdown/pruning correctness blocks it in DF 52.1.0. Utf8 == Utf8View is a hard error at the Arrow kernel level (arrow-rs #6283), and DF's physical expr adapter has known gaps around literal/cast rewriting (workaround at mod.rs:455-469, tests at :944/:1022).

For delta-rs, __delta_rs_file_id__ must be Dictionary<_, Utf8>
because Arrow can't dictionary-pack to Utf8View (plan.rs:232-242, arrow-rs
#9219). View typing the read schema without addressing this reintroduces mixed
base/view predicates on file-id.

Options I see:

Keep current behavior: Parquet read schema stays base for
stable pruning/pushdown and view types applied at DeltaScan boundary.

Enable view types with safety shim: Keep file-id as Dictionary<_, Utf8>,
adjust expr adapter to normalize view literals against base columns. Add
regression test filtering on __delta_rs_file_id__ = Utf8View(...).

Enable view types end-to-end: After arrow-rs #9220 ships, switch
file-id to Dictionary<_, Utf8View> and remove mixed schema handling.

roeap · 2026-01-24T23:03:25Z

+
+    let parquet_pushdown_enabled = scan_config.enable_parquet_pushdown
+        && !config.is_feature_enabled(&TableFeature::RowTracking)
+        && !config.is_feature_enabled(&TableFeature::DeletionVectors);


oh yes, this would have been a nice way to mess up data without this check!!

roeap · 2026-01-24T23:08:44Z

+        // DF52's TableSchema outputs columns as: file_schema + partition_columns
+        // Source stats are indexed by TableConfiguration.schema() field order, which may differ
+        // from the scan schema order. We need name-based remapping, not index-based.


but parquet should have no awareness of partition columns. at the very least in the new provider. Partition columns should be 100% handled on the table provider level.

The only partition column we assign on the parquet level is the file id columns.

At least in my mind the parquet reader would read the file as whatever schema (physical types and odering) we push into it? of course sans any partition column. Where these appear would ultimately be decided in the expression we apply from the scan metadata.

roeap · 2026-01-24T23:11:42Z

+        // DF52's TableSchema outputs columns as: file_schema + partition_columns
+        // Source stats are indexed by TableConfiguration.schema() field order, which may differ
+        // from the scan schema order. We need name-based remapping, not index-based.


Come to think of it, could it be that some of the changes to the parquet schema in KernelScanPlan are used for this deprecated provider?

If so, can we keep that separate. It just hard to believe that DF would get so much worse in one release 😆.

roeap

I nailed it down to one line 😄

and it may just be the case that the current upgrade changed some fundamental behaviour of the parquet reader, then it is what it is, but trying to grock it it's I still don't fully understand why this change is need to upgrade the version.

It may also just be that in fact we were still using the old schema adapter approach by default in the parquet reader, but I think I read in teh docs, that by default the new mechanics would already be used in 51?

roeap · 2026-01-24T23:51:19Z

        pushdown: TableProviderFilterPushDown::Inexact,
        kernel_predicate: None,
-        parquet_predicate: Some(expr),
+        parquet_predicate: parquet_pushdown_enabled.then_some(expr),


do we need to enciode this twice? i.e. we are checking the config parquet pushdown enabled also when we create the scan. so i think its fine we still pass the predicate if we can create it and then make the decision as we set uop the scan intead of here?

Agree on single decision point. plan.rs is currently source of truth for
enable_parquet_pushdown and DV/RowTracking gates. mod.rs has an extra RowTracking
guard that doesn't mirror the config/DV checks. Removing the then_some(...) gate
in plan.rs requires moving the full condition to the read plan setup (keeping
has_selection_vectors guard).

Here's the options I can think of:

Keep plan.rs gate, drop or convert mod.rs RowTracking check to invariant

Always carry predicate, centralize all gating in get_read_plan and remove
redundant mod.rs check

roeap · 2026-01-25T00:03:51Z

            result_schema.clone()
        };
-        let parquet_read_schema = config.physical_arrow_schema(
+        let parquet_read_schema = config.parquet_file_schema(


Sorry for all the noise.

I went through this a bit more comprehensive, and ultimately it's this one line change and its implications that got me worried. This is fairly direct translation from delta kernel and might eventually contain additional metadata columns that need to get passed through that relate to specific features. So any transformation on this is worry-some. (we do transform the physical types as optimization). In any case this should not be processed based on partition columns - the "physical" specifically refers to only being related to physical data. partition columns should have no role at this level.

The idea is that the schema we specify here flows through and gets amended. We do need to look into kernel expression evaluation to fully make sure we don't do additional costly transformations there.

Agreed - partition columns shouldn’t influence the Parquet/physical read schema. I updated parquet_read_schema to be derived only from scan.physical_schema() and removed the partition column dependent wrapping from parquet_file_schema.

The only remaining transformation there is Utf8View/BinaryView -> Utf8/Binary to keep DF predicate pushdown stable and partition columns are still injected above Parquet via the kernel transform path. This should also be safer if kernel later adds additional metadata columns to the physical schema, since we’re no longer keying off partition columns at this level.

Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>

We're going to clean more up in main

ethan-tyler requested review from hntd187, ion-elgreco, roeap and rtyler as code owners January 18, 2026 08:02

github-project-automation Bot added this to delta-rust Jan 18, 2026

github-actions Bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Jan 18, 2026

ion-elgreco reviewed Jan 18, 2026

View reviewed changes

ion-elgreco previously requested changes Jan 18, 2026

View reviewed changes

ethan-tyler changed the title ~~chore(deps): DataFusion 52 upgrade~~ chore(deps): upgrade datafusion to 52.0.0 Jan 19, 2026

ethan-tyler force-pushed the df52-upgrade-minimal branch from a7b9e84 to 3e0ebec Compare January 19, 2026 05:41

ethan-tyler requested a review from ion-elgreco January 19, 2026 05:44

rtyler assigned roeap Jan 19, 2026

ethan-tyler mentioned this pull request Jan 19, 2026

Prepare for DF52 release apache/datafusion-python#1337

Merged

ethan-tyler added 4 commits January 23, 2026 17:58

chore(deps): bump datafusion to 52.0.0

c88fe29

Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>

fix(datafusion): restore DeltaScanExec delta_plan accessor

9d125aa

The DF52 scan visitor expects to extract the KernelScanPlan from DeltaScanExec; keep a small accessor to avoid depending on struct field visibility. Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>

ethan-tyler force-pushed the df52-upgrade-minimal branch from 2af403b to 9d125aa Compare January 23, 2026 23:00

roeap reviewed Jan 24, 2026

View reviewed changes

rtyler and others added 5 commits January 24, 2026 06:57

Merge branch 'main' into df52-upgrade-minimal

9e53434

chore(deps): bump datafusion to 52.1.0

6b11c13

Align workspace and python bindings on DataFusion 52.1.0 to pick up bugfixes and keep versions consistent. Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>

fix(datafusion): align scan/pushdown with DF52.1

2d682f8

Update next-scan planning/execution to handle DF52.1 schema/view-type behavior and keep predicate pushdown stable. Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>

test(python): normalize view types in assertions

d660425

Make python tests robust to DF52.1 view/base string and binary types, and work around pyarrow sort limitations for view arrays. Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>

chore(fmt): rustfmt

195a737

Run cargo fmt to keep formatting consistent. Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>

ethan-tyler requested a review from roeap January 24, 2026 22:50

roeap reviewed Jan 24, 2026

View reviewed changes

roeap reviewed Jan 25, 2026

View reviewed changes

ethan-tyler and others added 4 commits January 25, 2026 01:16

fix(datafusion): keep parquet read schema on base types

2b798d9

Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>

test(python): avoid pyarrow view-type kernels in CDF assertions

35429f3

Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>

Merge branch 'main' into df52-upgrade-minimal

4f7c980

Merge branch 'main' into df52-upgrade-minimal

5aa47c0

roeap approved these changes Jan 26, 2026

View reviewed changes

rtyler merged commit 949e45c into delta-io:main Jan 26, 2026
28 of 29 checks passed

github-project-automation Bot moved this to Done in delta-rust Jan 26, 2026

Conversation

ethan-tyler commented Jan 18, 2026

Description

Changes

Related Issue(s)

Uh oh!

github-actions Bot commented Jan 18, 2026

Uh oh!

ion-elgreco commented Jan 18, 2026

Uh oh!

codecov Bot commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jan 18, 2026

Uh oh!

ethan-tyler commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Jan 18, 2026

Uh oh!

ethan-tyler commented Jan 18, 2026

Uh oh!

ethan-tyler commented Jan 18, 2026

Uh oh!

alamb commented Jan 19, 2026

Uh oh!

roeap left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jan 18, 2026 •

edited

Loading

ethan-tyler commented Jan 18, 2026 •

edited

Loading

adriangb Jan 24, 2026 •

edited

Loading