feat: add MapSort expression support for Spark 4.0 by andygrove · Pull Request #4076 · apache/datafusion-comet

andygrove · 2026-04-24T23:59:39Z

Which issue does this PR close?

Closes #1941
Closes #3171

Rationale for this change

Spark 4.0 introduces MapSort, used for normalizing map values when they appear in shuffle hash partitioning keys, in try_element_at, and in other contexts where map ordering must be deterministic. Without native support, queries that touch maps in any of these positions fall back to Spark, which forces the entire enclosing operator off Comet (e.g. an entire shuffle exchange).

What changes are included in this PR?

New native scalar function map_sort in native/spark-expr/src/map_funcs/map_sort.rs that sorts map entries by key in ascending order, registered via comet_scalar_funcs.rs.
Wire MapSort into the Spark 4.0 CometExprShim so the expression is converted to the new scalar function during serde.
The columnar shuffle on map array element test in CometColumnarShuffleSuite now expects shuffle fallback on Spark 4.0+: the new shuffle-key normalization wraps mapsort inside transform(arr, x -> mapsort(x)), and Comet does not currently support ArrayTransform with a lambda body. Answer correctness is still verified via checkSparkAnswer.

How are these changes tested?

New unit tests in native/spark-expr/src/map_funcs/map_sort.rs cover sorting on each supported key type, null handling, and empty maps.
Existing CometColumnarShuffleSuite tests for map shuffle keys all pass under the Spark 4.0 profile (41/41).

Add native map_sort scalar function that sorts map entries by key in ascending order, and wire it up via the Spark 4.0 CometExprShim so that MapSort expressions are accelerated instead of falling back to Spark. Re-enable all CometColumnarShuffleSuite map tests that were skipped for Spark 4.0. Closes apache#1941 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Spark 4.0 normalizes shuffle keys containing array<map> via transform(arr, x -> mapsort(x)), which Comet does not yet support because ArrayTransform with a lambda body has no serde. Mark the columnar shuffle on map array element test as expecting the fallback on Spark 4.0+ while still verifying answer correctness.

The MapSort serde for Spark 4.0 called scalarFunctionExprToProto without a return type. The Rust planner then looked up "map_sort" in the session UDF registry to infer the type, but map_sort is only handled via the create_comet_physical_fun match dispatch, not registered as a UDF, causing "There is no UDF named 'map_sort' in the registry" at execution time (e.g., group-by on a map column in CollationSuite). Pass ms.dataType explicitly via scalarFunctionExprToProtoWithReturnType, matching the pattern used by ceil, floor, and other scalar functions.

andygrove and others added 2 commits April 24, 2026 17:44

andygrove mentioned this pull request Apr 25, 2026

Expressions added via CometExprShim bypass the CometExpressionSerde framework #4077

Open

andygrove marked this pull request as ready for review April 25, 2026 00:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add MapSort expression support for Spark 4.0#4076

feat: add MapSort expression support for Spark 4.0#4076
andygrove wants to merge 3 commits intoapache:mainfrom
andygrove:feat/map-sort-spark4

andygrove commented Apr 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andygrove commented Apr 24, 2026 •

edited

Loading