Skip to content

[Feature] Support Spark expression: sort_array #3159

@andygrove

Description

@andygrove

What is the problem the feature request solves?

Note: This issue was generated with AI assistance. The specification details have been extracted from Spark documentation and may need verification.

Comet does not currently support the Spark sort_array function, causing queries using this function to fall back to Spark's JVM execution instead of running natively on DataFusion.

The SortArray expression sorts the elements of an array in either ascending or descending order. It returns a new array with the same elements sorted according to the specified ordering, with null values handled according to a consistent null-first or null-last policy.

Supporting this expression would allow more Spark workloads to benefit from Comet's native acceleration.

Describe the potential solution

Spark Specification

Syntax:

sort_array(array[, ascendingOrder])
sort_array(col("array_column"))
sort_array(col("array_column"), lit(false)) // descending order

Arguments:

Argument Type Description
base ArrayType The input array to be sorted
ascendingOrder BooleanType Optional. If true (default), sorts in ascending order; if false, sorts in descending order. Must be a foldable expression (constant).

Return Type: Returns an ArrayType with the same element type and nullability as the input array.

Supported Data Types:
Supports arrays containing any orderable data types:

  • Numeric types (IntegerType, LongType, DoubleType, FloatType, etc.)
  • String types (StringType)
  • Date and timestamp types
  • Binary types
  • Does NOT support arrays containing non-orderable types like MapType or complex nested structures

Edge Cases:

  • Null arrays: Returns null if the input array is null
  • Empty arrays: Returns an empty array of the same type
  • Arrays with all nulls: Returns an array with all nulls in the same positions (nulls are equal in comparison)
  • Mixed null and non-null elements: Nulls are consistently placed at the beginning (ascending) or end (descending)
  • Non-foldable ascendingOrder: Throws DataTypeMismatch error - the ordering parameter must be a constant
  • Non-orderable element types: Throws DataTypeMismatch error during type checking

Examples:

-- Basic ascending sort (default)
SELECT sort_array(array(3, 1, 4, 1, 5)) AS sorted;
-- Result: [1, 1, 3, 4, 5]

-- Descending sort
SELECT sort_array(array('d', 'c', 'b', 'a', null), false) AS sorted_desc;
-- Result: ['d', 'c', 'b', 'a', null]

-- With null values (ascending)
SELECT sort_array(array(3, null, 1, null, 2)) AS sorted_with_nulls;
-- Result: [null, null, 1, 2, 3]
// DataFrame API usage
import org.apache.spark.sql.functions._

df.select(sort_array(col("numbers"))).show()
df.select(sort_array(col("strings"), lit(false))).show()

// With column reference for array
df.withColumn("sorted_values", sort_array(col("value_array"))).show()

Implementation Approach

See the Comet guide on adding new expressions for detailed instructions.

  1. Scala Serde: Add expression handler in spark/src/main/scala/org/apache/comet/serde/
  2. Register: Add to appropriate map in QueryPlanSerde.scala
  3. Protobuf: Add message type in native/proto/src/proto/expr.proto if needed
  4. Rust: Implement in native/spark-expr/src/ (check if DataFusion has built-in support first)

Additional context

Difficulty: Medium
Spark Expression Class: org.apache.spark.sql.catalyst.expressions.SortArray

Related:

  • array_sort - Alternative function name in some Spark versions
  • array_max, array_min - For finding extremes without full sorting
  • shuffle - For randomizing array element order
  • reverse - For reversing array element order

This issue was auto-generated from Spark reference documentation.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions