feat: add MapSort expression support for Spark 4.0#4076
Open
andygrove wants to merge 3 commits intoapache:mainfrom
Open
feat: add MapSort expression support for Spark 4.0#4076andygrove wants to merge 3 commits intoapache:mainfrom
andygrove wants to merge 3 commits intoapache:mainfrom
Conversation
Add native map_sort scalar function that sorts map entries by key in ascending order, and wire it up via the Spark 4.0 CometExprShim so that MapSort expressions are accelerated instead of falling back to Spark. Re-enable all CometColumnarShuffleSuite map tests that were skipped for Spark 4.0. Closes apache#1941 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Spark 4.0 normalizes shuffle keys containing array<map> via transform(arr, x -> mapsort(x)), which Comet does not yet support because ArrayTransform with a lambda body has no serde. Mark the columnar shuffle on map array element test as expecting the fallback on Spark 4.0+ while still verifying answer correctness.
The MapSort serde for Spark 4.0 called scalarFunctionExprToProto without a return type. The Rust planner then looked up "map_sort" in the session UDF registry to infer the type, but map_sort is only handled via the create_comet_physical_fun match dispatch, not registered as a UDF, causing "There is no UDF named 'map_sort' in the registry" at execution time (e.g., group-by on a map column in CollationSuite). Pass ms.dataType explicitly via scalarFunctionExprToProtoWithReturnType, matching the pattern used by ceil, floor, and other scalar functions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #1941
Closes #3171
Rationale for this change
Spark 4.0 introduces
MapSort, used for normalizing map values when they appear in shuffle hash partitioning keys, intry_element_at, and in other contexts where map ordering must be deterministic. Without native support, queries that touch maps in any of these positions fall back to Spark, which forces the entire enclosing operator off Comet (e.g. an entire shuffle exchange).What changes are included in this PR?
map_sortinnative/spark-expr/src/map_funcs/map_sort.rsthat sorts map entries by key in ascending order, registered viacomet_scalar_funcs.rs.MapSortinto the Spark 4.0CometExprShimso the expression is converted to the new scalar function during serde.columnar shuffle on map array elementtest inCometColumnarShuffleSuitenow expects shuffle fallback on Spark 4.0+: the new shuffle-key normalization wrapsmapsortinsidetransform(arr, x -> mapsort(x)), and Comet does not currently supportArrayTransformwith a lambda body. Answer correctness is still verified viacheckSparkAnswer.How are these changes tested?
native/spark-expr/src/map_funcs/map_sort.rscover sorting on each supported key type, null handling, and empty maps.CometColumnarShuffleSuitetests for map shuffle keys all pass under the Spark 4.0 profile (41/41).