Skip to content

feat: fix array_compact for Spark 4.0 and correct return type metadata#3796

Open
andygrove wants to merge 11 commits intoapache:mainfrom
andygrove:fix-array-compact
Open

feat: fix array_compact for Spark 4.0 and correct return type metadata#3796
andygrove wants to merge 11 commits intoapache:mainfrom
andygrove:fix-array-compact

Conversation

@andygrove
Copy link
Copy Markdown
Member

Summary

Fixes several issues with the array_compact expression found during an audit:

  • Spark 4.0 support: On Spark 4.0, array_compact rewrites to KnownNotContainsNull(ArrayFilter(IsNotNull)). Because KnownNotContainsNull was not handled by Comet, the expression always fell back to Spark on 4.0. This PR adds a case in the Spark 4.0 shim that unwraps KnownNotContainsNull, serializes the inner ArrayFilter using DataFusion's array_remove_all, and propagates the containsNull=false return type correctly.

  • Return type bug: CometArrayCompact.convert was hardcoding ArrayType(elementType = elementType) as the return type passed to DataFusion, which always set containsNull=true. Changed to use expr.dataType so the correct nullability metadata is emitted — matching Spark's behavior where containsNull=false after compaction on Spark 4.0.

  • Incorrect Incompatible classification: CometArrayCompact.getSupportLevel was returning Incompatible(None) with no explanation. The implementation is semantically correct (DataFusion's array_remove_all(arr, null) behaves the same as Spark's ArrayFilter(IsNotNull)), so this is changed to Compatible().

  • SQL test improvements: Added string, double, and array<array<int>> (nested) type coverage; changed spark_answer_only to query to verify native execution.

  • Remove Scala test skip: The assume(\!isSpark40Plus) guard with a // TODO fix for Spark 4.0.0 comment is removed now that Spark 4.0 is supported.

  • Docs: Updated expressions.md to mark ArrayCompact as supported.

- Handle KnownNotContainsNull wrapper in Spark 4.0 shim so that
  array_compact runs natively on Spark 4.0 (previously always fell back)
- Fix return type passed to DataFusion: use expr.dataType instead of
  hardcoded ArrayType(elementType) so that containsNull=false is
  correctly propagated on Spark 4.0
- Mark CometArrayCompact as Compatible() instead of Incompatible(None)
- Expand SQL test coverage: add string, double, and nested array types;
  change spark_answer_only to query to verify native execution
- Remove assume(\!isSpark40Plus) skip from Scala test
- Update expressions.md: ArrayCompact is now supported
…DataFusion

DataFusion's array_remove_all function always returns a list with nullable
elements (nullable=true), but array_compact's Spark dataType has
containsNull=false. Passing the Spark type as the promised return type caused
a runtime type-mismatch assertion in DataFusion's ScalarFunctionExpr. Fix by
passing ArrayType(elementType, containsNull=true) in both the Spark 3.x
CometArrayCompact serde and the Spark 4.0 KnownNotContainsNull shim.
@andygrove andygrove marked this pull request as ready for review March 26, 2026 03:01
object CometArrayCompact extends CometExpressionSerde[Expression] {

override def getSupportLevel(expr: Expression): SupportLevel = Incompatible(None)
override def getSupportLevel(expr: Expression): SupportLevel = Compatible()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove getSupportLevel entirely

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

andygrove and others added 5 commits April 8, 2026 12:26
- arrays.scala: use spark_array_compact UDF (main's approach) instead of
  array_remove_all with containsNull=true workaround, since DF 53 changed
  array_remove_all to return NULL when element arg is NULL
- expressions.md: both ArrayAppend and ArrayCompact are Spark-compatible (Yes)
- array_compact.sql: keep multi-column table queries from HEAD, merge in
  additional literal tests (double, nested) from main

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Arrow's NullArray::nulls() returns None, causing is_null() to return
false for all elements. Use logical_nulls() instead, which correctly
reports all elements as null for NullArray, fixing array_compact for
array<void> inputs like array_compact(array(NULL, NULL, NULL)).
@andygrove
Copy link
Copy Markdown
Member Author

andygrove commented Apr 13, 2026

Thanks for the approval @kazuyukitanimura. I had to make more changes after unmerging. Could you take another look?

edit: there are CI failures now ... let me fix those first

…ll(ArrayFilter(IsNotNull))

DataFusion's array_remove_all(arr, null) returns NULL for the whole row
on DF 53. Use spark_array_compact instead, matching what
CometArrayCompact.convert already does for Spark 3.x.
# Conflicts:
#	docs/source/user-guide/latest/expressions.md
#	native/Cargo.lock
#	spark/src/main/scala/org/apache/comet/serde/arrays.scala
#	spark/src/main/spark-4.0/org/apache/comet/shims/CometExprShim.scala
# Conflicts:
#	docs/source/user-guide/latest/expressions.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants