chore: audit array_intersect and expand SQL test coverage by andygrove · Pull Request #4071 · apache/datafusion-comet

andygrove · 2026-04-24T18:44:21Z

Which issue does this PR close?

Closes #.

Rationale for this change

ArrayIntersect was marked Incompatible(None) with no explanation and had minimal test coverage: one SQL file exercising only array<int>, plus three Scala assertions. The audit surfaced three concerns that are worth making explicit:

DataFusion's array_intersect builds its hash lookup from the shorter input and probes the longer one, so element order in the result follows the longer side. Spark follows the left side. Results therefore diverge when the right argument is longer than the left, but no test exercised this case and the Incompatible marker had no reason text, so users had to set allowIncompatible=true on faith.
Neither getIncompatibleReasons() nor getUnsupportedReasons() was overridden, so even once a reason is added to getSupportLevel it would not show up in the generated Compatibility Guide.
Spark 4.0 introduces collated StringType. DataFusion compares rows by bytes, so array_intersect on a column with non-UTF8_BINARY collation would silently return a wrong answer instead of falling back.

What changes are included in this PR?

Serde (CometArrayIntersect): add a reason string describing the probe-order divergence and surface it through both getSupportLevel and getIncompatibleReasons. Return Unsupported when the element type (recursively) contains a StringType with non-default collation, and surface that reason through getUnsupportedReasons too.
Shim (CometTypeShim): add hasNonDefaultStringCollation(dt) that walks nested ArrayType / MapType / StructType. The Spark 4.0 variant returns true for any StringType whose collationId != UTF8_BINARY_COLLATION_ID; the Spark 3.x variant is stubbed to false.
SQL test (array_intersect.sql): extend from 4 int-only queries to ~20 queries covering boolean, tinyint, smallint, bigint, float, double (NaN / +-Infinity / -0), decimal, date, timestamp, string (incl. multibyte UTF-8), binary, nested array<array<int>>, duplicate elements, self-intersection, all-NULL and both-NULL arrays, CASE WHEN inputs, and a right-longer-than-left query wrapped in sort_array so the documented ordering divergence stays a stable assertion.
Support doc: add audit sub-bullets under array_intersect in spark_expressions_support.md for Spark 3.4.3 / 3.5.8 / 4.0.1 dated 2026-04-24, noting the ordering incompatibility and the 4.0 collation fallback.
Skill (audit-comet-expression): add a reminder to the skill instructions that whenever getSupportLevel returns a reason, the corresponding getIncompatibleReasons() / getUnsupportedReasons() override must also be added so the Compatibility Guide picks it up.

How are these changes tested?

CometSqlFileTestSuite [array_intersect] and CometArrayExpressionSuite (39 tests) both pass locally. make succeeds on the Spark 3.4, 3.5, and 4.0 profiles, and cargo clippy --all-targets --workspace -- -D warnings is clean.

Audit the ArrayIntersect expression against Spark 3.4.3, 3.5.8, and 4.0.1. The only cross-version change is a cosmetic argument added to an internal error path in 4.0; the runtime semantics are unchanged. Record why the expression is Incompatible: DataFusion's array_intersect probes the longer of the two inputs, so element order can diverge from Spark when the right argument is longer than the left. Expose this reason through getSupportLevel and also getIncompatibleReasons so it appears in the generated Compatibility Guide. Add a Spark 4.0 fallback for collated string inputs: DataFusion compares rows by bytes and cannot honour collation, so getSupportLevel returns Unsupported when the element type (recursively) contains a StringType with non-UTF8_BINARY collation. A new hasNonDefaultStringCollation helper on CometTypeShim is stubbed to false in Spark 3.x. Expand array_intersect.sql from four int-only queries to coverage for boolean, tinyint, smallint, bigint, float, double (incl. NaN, Infinity, -Infinity, -0), decimal, date, timestamp, string (incl. multibyte UTF-8), binary, nested array<array<int>>, duplicate elements, self-intersection, all-NULL and both-NULL arrays, CASE-WHEN inputs, and a right-longer-than-left case wrapped in sort_array to normalise the documented ordering divergence. Update the audit-comet-expression skill to remind future audits to override both getIncompatibleReasons and getUnsupportedReasons whenever getSupportLevel returns a reason, so that the Compatibility Guide picks them up.

# Conflicts: # common/src/main/spark-4.0/org/apache/comet/shims/CometTypeShim.scala

Spark's QueryTest.prepareRow only converts top-level Array[_] to Seq, so inner Array[Byte] elements from array<binary> columns keep their default Java toString ([B@<hex>). The toString-based sort in prepareAnswer then produces non-deterministic orderings between Spark and Comet runs, causing spurious mismatches. Recurse into nested arrays so toString is stable.

andygrove added 2 commits April 24, 2026 12:43

Merge remote-tracking branch 'apache/main' into audit-array-intersect

bb6da77

# Conflicts: # common/src/main/spark-4.0/org/apache/comet/shims/CometTypeShim.scala

andygrove marked this pull request as draft April 24, 2026 21:18

upmerge

d78af9b

andygrove marked this pull request as ready for review April 24, 2026 22:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: audit array_intersect and expand SQL test coverage#4071

chore: audit array_intersect and expand SQL test coverage#4071
andygrove wants to merge 4 commits intoapache:mainfrom
andygrove:audit-array-intersect

andygrove commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented Apr 24, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant