Skip to content

chore: audit array_intersect and expand SQL test coverage#4071

Open
andygrove wants to merge 4 commits intoapache:mainfrom
andygrove:audit-array-intersect
Open

chore: audit array_intersect and expand SQL test coverage#4071
andygrove wants to merge 4 commits intoapache:mainfrom
andygrove:audit-array-intersect

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #.

Rationale for this change

ArrayIntersect was marked Incompatible(None) with no explanation and had minimal test coverage: one SQL file exercising only array<int>, plus three Scala assertions. The audit surfaced three concerns that are worth making explicit:

  1. DataFusion's array_intersect builds its hash lookup from the shorter input and probes the longer one, so element order in the result follows the longer side. Spark follows the left side. Results therefore diverge when the right argument is longer than the left, but no test exercised this case and the Incompatible marker had no reason text, so users had to set allowIncompatible=true on faith.
  2. Neither getIncompatibleReasons() nor getUnsupportedReasons() was overridden, so even once a reason is added to getSupportLevel it would not show up in the generated Compatibility Guide.
  3. Spark 4.0 introduces collated StringType. DataFusion compares rows by bytes, so array_intersect on a column with non-UTF8_BINARY collation would silently return a wrong answer instead of falling back.

What changes are included in this PR?

  • Serde (CometArrayIntersect): add a reason string describing the probe-order divergence and surface it through both getSupportLevel and getIncompatibleReasons. Return Unsupported when the element type (recursively) contains a StringType with non-default collation, and surface that reason through getUnsupportedReasons too.
  • Shim (CometTypeShim): add hasNonDefaultStringCollation(dt) that walks nested ArrayType / MapType / StructType. The Spark 4.0 variant returns true for any StringType whose collationId != UTF8_BINARY_COLLATION_ID; the Spark 3.x variant is stubbed to false.
  • SQL test (array_intersect.sql): extend from 4 int-only queries to ~20 queries covering boolean, tinyint, smallint, bigint, float, double (NaN / +-Infinity / -0), decimal, date, timestamp, string (incl. multibyte UTF-8), binary, nested array<array<int>>, duplicate elements, self-intersection, all-NULL and both-NULL arrays, CASE WHEN inputs, and a right-longer-than-left query wrapped in sort_array so the documented ordering divergence stays a stable assertion.
  • Support doc: add audit sub-bullets under array_intersect in spark_expressions_support.md for Spark 3.4.3 / 3.5.8 / 4.0.1 dated 2026-04-24, noting the ordering incompatibility and the 4.0 collation fallback.
  • Skill (audit-comet-expression): add a reminder to the skill instructions that whenever getSupportLevel returns a reason, the corresponding getIncompatibleReasons() / getUnsupportedReasons() override must also be added so the Compatibility Guide picks it up.

How are these changes tested?

CometSqlFileTestSuite [array_intersect] and CometArrayExpressionSuite (39 tests) both pass locally. make succeeds on the Spark 3.4, 3.5, and 4.0 profiles, and cargo clippy --all-targets --workspace -- -D warnings is clean.

Audit the ArrayIntersect expression against Spark 3.4.3, 3.5.8, and 4.0.1. The
only cross-version change is a cosmetic argument added to an internal error
path in 4.0; the runtime semantics are unchanged.

Record why the expression is Incompatible: DataFusion's array_intersect probes
the longer of the two inputs, so element order can diverge from Spark when the
right argument is longer than the left. Expose this reason through
getSupportLevel and also getIncompatibleReasons so it appears in the generated
Compatibility Guide.

Add a Spark 4.0 fallback for collated string inputs: DataFusion compares rows
by bytes and cannot honour collation, so getSupportLevel returns Unsupported
when the element type (recursively) contains a StringType with non-UTF8_BINARY
collation. A new hasNonDefaultStringCollation helper on CometTypeShim is
stubbed to false in Spark 3.x.

Expand array_intersect.sql from four int-only queries to coverage for boolean,
tinyint, smallint, bigint, float, double (incl. NaN, Infinity, -Infinity, -0),
decimal, date, timestamp, string (incl. multibyte UTF-8), binary, nested
array<array<int>>, duplicate elements, self-intersection, all-NULL and
both-NULL arrays, CASE-WHEN inputs, and a right-longer-than-left case wrapped
in sort_array to normalise the documented ordering divergence.

Update the audit-comet-expression skill to remind future audits to override
both getIncompatibleReasons and getUnsupportedReasons whenever getSupportLevel
returns a reason, so that the Compatibility Guide picks them up.
# Conflicts:
#	common/src/main/spark-4.0/org/apache/comet/shims/CometTypeShim.scala
@andygrove andygrove marked this pull request as draft April 24, 2026 21:18
@andygrove andygrove marked this pull request as ready for review April 24, 2026 22:21
Spark's QueryTest.prepareRow only converts top-level Array[_] to Seq, so
inner Array[Byte] elements from array<binary> columns keep their default
Java toString ([B@<hex>). The toString-based sort in prepareAnswer then
produces non-deterministic orderings between Spark and Comet runs, causing
spurious mismatches. Recurse into nested arrays so toString is stable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant