chore: Remove config option for `native_iceberg_compat` by andygrove · Pull Request #4019 · apache/datafusion-comet

andygrove · 2026-04-21T13:37:53Z

Which issue does this PR close?

Partially address #4020.

Rationale for this change

Removing support for native_iceberg_compat reduces the maintenance burden of the project.

This PR makes it impossible for users to select native_iceberg_compat and stops running tests for that scan impl.

Subsequent PRs will remove the implementation code.

What changes are included in this PR?

COMET_NATIVE_SCAN_IMPL config now only allows auto or native_datafusion
CometScanRule no longer uses the value of config option COMET_NATIVE_SCAN_IMPL and just uses native_datafusion scan
Tests have been updated to stop testing with native_iceberg_compat
Tests that were specific to native_iceberg_compat have been removed
Golden files regenerated - we no longer generate different versions per scan impl
Updated documentation

How are these changes tested?

comphead · 2026-04-21T19:53:53Z

Thanks @andygrove 👀

mbutrovich

Thanks @andygrove, let's get that CI matrix and tech debt down.

comphead

it is LGTM, I was thinking if we need it for some debug/benchmark usecases but nothing comes to my head. So let it go

parthchandra · 2026-04-21T22:19:57Z

Is there still a functionality gap between native_iceberg_compat and native_datafusion? I see some tests with IgnoreCometNativeDataFusion in the diff files.

andygrove · 2026-04-21T22:50:06Z

Is there still a functionality gap between native_iceberg_compat and native_datafusion? I see some tests with IgnoreCometNativeDataFusion in the diff files.

% grep "IgnoreCometNativeDataFusion(\"" dev/diffs/3.5.8.diff
+    IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3442")) {
+    IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3720")) {
+    IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3720")) {
+    IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3720")) {
+    IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3720")) {
+    IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3720")) {
+      IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3720")) {
+    IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3720")) {

So #3720 and #3442. We should be able resolve #3442 now that #4011 is merged.

andygrove · 2026-04-22T21:12:59Z

Is there still a functionality gap between native_iceberg_compat and native_datafusion? I see some tests with IgnoreCometNativeDataFusion in the diff files.

% grep "IgnoreCometNativeDataFusion(\"" dev/diffs/3.5.8.diff
+    IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3442")) {
+    IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3720")) {
+    IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3720")) {
+    IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3720")) {
+    IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3720")) {
+    IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3720")) {
+      IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3720")) {
+    IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3720")) {

So #3720 and #3442. We should be able resolve #3442 now that #4011 is merged.

#3443 is fixed in #4038, so that just leaves #3720 which is just that Comet is more lenient with type widening and/or throws different exception than Spark. Not a correctness issue.

andygrove · 2026-04-22T21:30:07Z

@parthchandra ok if we merge this one?

parthchandra · 2026-04-23T01:31:26Z

My general recommendation would be that we enable ignored tests before dropping native_iceberg_compat.
Also, the description of #3720 seems to indicate that issue is more complex than just mismatched error messages (I can be convinced that it is not a serious problem after all). We do have the framework to match Spark error messages but perhaps it is not necessary to have error messages match exactly.

Here's Claude's summary of ignored tests -

1. `IgnoreCometNativeDataFusion` — skipped for native_datafusion and auto

AdaptiveQueryExecSuite (#3321)

Test Name	Diffs
`join key with multiple references on the filtering plan`	4.0.1
`SPARK-43402: FileSourceScanExec supports push down data filter with scalar subquery`	4.0.1
`alter temporary view should follow current storeAnalyzedPlanForView config`	4.0.1

AdaptiveQueryExecSuite (#3442)

Test Name	Diffs
`static scan metrics`	3.4.3, 3.5.8, 4.0.1

FileBasedDataSourceSuite (#3321)

Test Name	Diffs
`Enabling/disabling ignoreMissingFiles using parquet` (conditionally tagged only when `format == "parquet"`)	4.0.1
`Enabling/disabling ignoreCorruptFiles`	4.0.1

ParquetFilterSuite (3.4.3 only)

Test Name	Diffs
`filter pushdown - StringPredicate` (tagged `IgnoreCometNativeDataFusion` in 3.4.3; `IgnoreCometNativeScan` in 3.5.8/4.0.1)	3.4.3

ParquetSchemaSuite (#3720)

Test Name	Diffs
`SPARK-35640: read binary as timestamp should throw schema incompatible error`	3.4.3, 3.5.8, 4.0.1
`SPARK-35640: int as long should throw schema incompatible error`	3.4.3, 3.5.8
`SPARK-47447: read TimestampLTZ as TimestampNTZ`	4.0.1
`SPARK-36182: can't read TimestampLTZ as TimestampNTZ`	3.4.3, 3.5.8
`SPARK-34212 Parquet should read decimals correctly`	3.4.3, 3.5.8, 4.0.1
`row group skipping doesn't overflow when reading into larger type`	3.4.3, 3.5.8, 4.0.1

ParquetSchemaEvolutionSuite (#3720)

Test Name	Diffs
`schema mismatch failure error message for parquet vectorized reader`	3.4.3, 3.5.8, 4.0.1
`SPARK-45604: schema mismatch failure error on timestamp_ntz to array<timestamp_ntz>`	3.4.3, 3.5.8, 4.0.1

ParquetTypeWideningSuite (#3321)

Test Name	Diffs
`parquet widening conversion DateType -> TimestampNTZType` (conditionally tagged)	4.0.1
`unsupported parquet conversion $fromType -> $toType` (multiple type combos)	4.0.1
`unsupported parquet timestamp conversion $fromType ($outputTimestampType) -> $toType`	4.0.1
`parquet decimal precision change Decimal($fromPrecision, 2) -> Decimal($toPrecision, 2)`	4.0.1
`parquet decimal precision and scale change Decimal($fromPrecision, $fromScale) -> Decimal($toPrecision, $toScale)`	4.0.1

2. `assume()` — runtime skip

ParquetRowIndexSuite (#3886) — 4.0.1 only

Test Name	Condition
`invalid row index column type - ${conf.desc}`	Skipped when `COMET_NATIVE_SCAN_IMPL` is `SCAN_NATIVE_DATAFUSION` or `SCAN_AUTO`. Comet throws `RuntimeException` instead of `SparkException`.

CometExpressionSuite — Comet's own test suite

Test Name	Condition
`get_struct_field - select primitive fields`	Skipped when `scanImpl == SCAN_AUTO && Spark 4.0+`
`get_struct_field - select subset of struct`	Skipped when `scanImpl == SCAN_AUTO && Spark 4.0+`
`get_struct_field - read entire struct`	Skipped when `scanImpl == SCAN_AUTO && Spark 4.0+`

Summary by Tracking Issue

Issue	Count	Description
#3321	~12	Schema evolution, corrupt/missing files, AQE, type widening
#3720	~8	Schema mismatch errors, decimal reads, row group skipping
#3442	1	Static scan metrics with DPP
#3886	1	Row index column type error type mismatch
(no issue)	5	Filter pushdown / accumulator tests (`IgnoreCometNativeScan`)
(no issue)	3	`get_struct_field` tests (auto + Spark 4.0+ only)

parthchandra · 2026-04-23T01:42:59Z

More Claude analysis on schema mismatch. Claude recommends we explicitly check that the following tests fail with a different message instead of actually succeeding (because the results will be wrong) -

Description	Test Name(s)	Diffs
`binary -> timestamp`	`SPARK-35640: read binary as timestamp should throw schema incompatible error`	3.4.3, 3.5.8, 4.0.1
`timestamp_ntz -> array<timestamp_ntz>`	`SPARK-45604: schema mismatch failure error on timestamp_ntz to array<timestamp_ntz>`	3.4.3, 3.5.8, 4.0.1
`string read as int`	`schema mismatch failure error message for parquet vectorized reader`	3.4.3, 3.5.8, 4.0.1
decimal precision/scale overflow	`SPARK-34212 Parquet should read decimals correctly`	3.4.3, 3.5.8, 4.0.1
int->bigint row group overflow	`row group skipping doesn't overflow when reading into larger type`	3.4.3, 3.5.8, 4.0.1
int read as long (schema evolution disabled)	`SPARK-35640: int as long should throw schema incompatible error`	3.4.3, 3.5.8
`TimestampLTZ -> TimestampNTZ`	`SPARK-47447: read TimestampLTZ as TimestampNTZ` (4.0.1), `SPARK-36182: can't read TimestampLTZ as TimestampNTZ` (3.4.3, 3.5.8)	3.4.3, 3.5.8, 4.0.1

parthchandra · 2026-04-23T01:46:27Z

One final word from Claude -
We lose partial acceleration for queries that can't use native_datafusion:

Queries using input_file_name() — JVM scan + native downstream ops
Queries with metadata columns — JVM scan + native downstream ops
Queries with ignoreMissingFiles=true — JVM scan + native downstream ops
Queries with row indexes or field IDs — JVM scan + native downstream ops
( I was under the impression these had been implemented).

andygrove · 2026-04-23T12:35:36Z

My general recommendation would be that we enable ignored tests before dropping native_iceberg_compat. Also, the description of #3720 seems to indicate that issue is more complex than just mismatched error messages (I can be convinced that it is not a serious problem after all). We do have the framework to match Spark error messages but perhaps it is not necessary to have error messages match exactly.

Alright, I have moved this to draft for now. I'll take another more detailed look at #3720 soon. Thanks @parthchandra.

andygrove added 7 commits April 21, 2026 07:17

delete golden files for specific scans

19c9a81

do not set scan in plan stability suite

3941cb4

remove more native_iceberg_compat

e80f3ec

golden files

a3b02bd

make COMET_NATIVE_SCAN_IMPL config internal

021a72c

fix workflow

722bb49

test: remove iceberg_compat-specific filter pushdown test

c43832e

andygrove mentioned this pull request Apr 21, 2026

Remove native_iceberg_compat scan #4020

Open

4 tasks

andygrove added 3 commits April 21, 2026 07:59

fix workflow

1e30c2b

spotless

acfda6e

fix: restore SCAN_NATIVE_ICEBERG_COMPAT as deprecated constant

4acf869

andygrove marked this pull request as ready for review April 21, 2026 15:35

mbutrovich approved these changes Apr 21, 2026

View reviewed changes

comphead approved these changes Apr 21, 2026

View reviewed changes

upmerge

3a49c96

Merge branch 'main' into remove-iceberg-compat

3549895

andygrove marked this pull request as draft April 23, 2026 12:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Remove config option for `native_iceberg_compat`#4019

chore: Remove config option for `native_iceberg_compat`#4019
andygrove wants to merge 12 commits intoapache:mainfrom
andygrove:remove-iceberg-compat

andygrove commented Apr 21, 2026 •

edited by mbutrovich

Loading

Uh oh!

comphead commented Apr 21, 2026

Uh oh!

mbutrovich left a comment

Uh oh!

comphead left a comment

Uh oh!

parthchandra commented Apr 21, 2026

Uh oh!

andygrove commented Apr 21, 2026 •

edited

Loading

Uh oh!

andygrove commented Apr 22, 2026

Uh oh!

andygrove commented Apr 22, 2026

Uh oh!

parthchandra commented Apr 23, 2026

Uh oh!

parthchandra commented Apr 23, 2026

Uh oh!

parthchandra commented Apr 23, 2026

Uh oh!

andygrove commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

andygrove commented Apr 21, 2026 • edited by mbutrovich Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

comphead commented Apr 21, 2026

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

parthchandra commented Apr 21, 2026

Uh oh!

andygrove commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andygrove commented Apr 22, 2026

Uh oh!

andygrove commented Apr 22, 2026

Uh oh!

parthchandra commented Apr 23, 2026

1. IgnoreCometNativeDataFusion — skipped for native_datafusion and auto

AdaptiveQueryExecSuite (#3321)

AdaptiveQueryExecSuite (#3442)

FileBasedDataSourceSuite (#3321)

ParquetFilterSuite (3.4.3 only)

ParquetSchemaSuite (#3720)

ParquetSchemaEvolutionSuite (#3720)

ParquetTypeWideningSuite (#3321)

2. assume() — runtime skip

ParquetRowIndexSuite (#3886) — 4.0.1 only

CometExpressionSuite — Comet's own test suite

Summary by Tracking Issue

Uh oh!

parthchandra commented Apr 23, 2026

Uh oh!

parthchandra commented Apr 23, 2026

Uh oh!

andygrove commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

andygrove commented Apr 21, 2026 •

edited by mbutrovich

Loading

andygrove commented Apr 21, 2026 •

edited

Loading

1. `IgnoreCometNativeDataFusion` — skipped for native_datafusion and auto

2. `assume()` — runtime skip