chore: audit array_intersect and expand SQL test coverage#4071
Open
andygrove wants to merge 4 commits intoapache:mainfrom
Open
chore: audit array_intersect and expand SQL test coverage#4071andygrove wants to merge 4 commits intoapache:mainfrom
andygrove wants to merge 4 commits intoapache:mainfrom
Conversation
Audit the ArrayIntersect expression against Spark 3.4.3, 3.5.8, and 4.0.1. The only cross-version change is a cosmetic argument added to an internal error path in 4.0; the runtime semantics are unchanged. Record why the expression is Incompatible: DataFusion's array_intersect probes the longer of the two inputs, so element order can diverge from Spark when the right argument is longer than the left. Expose this reason through getSupportLevel and also getIncompatibleReasons so it appears in the generated Compatibility Guide. Add a Spark 4.0 fallback for collated string inputs: DataFusion compares rows by bytes and cannot honour collation, so getSupportLevel returns Unsupported when the element type (recursively) contains a StringType with non-UTF8_BINARY collation. A new hasNonDefaultStringCollation helper on CometTypeShim is stubbed to false in Spark 3.x. Expand array_intersect.sql from four int-only queries to coverage for boolean, tinyint, smallint, bigint, float, double (incl. NaN, Infinity, -Infinity, -0), decimal, date, timestamp, string (incl. multibyte UTF-8), binary, nested array<array<int>>, duplicate elements, self-intersection, all-NULL and both-NULL arrays, CASE-WHEN inputs, and a right-longer-than-left case wrapped in sort_array to normalise the documented ordering divergence. Update the audit-comet-expression skill to remind future audits to override both getIncompatibleReasons and getUnsupportedReasons whenever getSupportLevel returns a reason, so that the Compatibility Guide picks them up.
# Conflicts: # common/src/main/spark-4.0/org/apache/comet/shims/CometTypeShim.scala
Spark's QueryTest.prepareRow only converts top-level Array[_] to Seq, so inner Array[Byte] elements from array<binary> columns keep their default Java toString ([B@<hex>). The toString-based sort in prepareAnswer then produces non-deterministic orderings between Spark and Comet runs, causing spurious mismatches. Recurse into nested arrays so toString is stable.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #.
Rationale for this change
ArrayIntersectwas markedIncompatible(None)with no explanation and had minimal test coverage: one SQL file exercising onlyarray<int>, plus three Scala assertions. The audit surfaced three concerns that are worth making explicit:array_intersectbuilds its hash lookup from the shorter input and probes the longer one, so element order in the result follows the longer side. Spark follows the left side. Results therefore diverge when the right argument is longer than the left, but no test exercised this case and theIncompatiblemarker had noreasontext, so users had to setallowIncompatible=trueon faith.getIncompatibleReasons()norgetUnsupportedReasons()was overridden, so even once a reason is added togetSupportLevelit would not show up in the generated Compatibility Guide.StringType. DataFusion compares rows by bytes, soarray_intersecton a column with non-UTF8_BINARY collation would silently return a wrong answer instead of falling back.What changes are included in this PR?
CometArrayIntersect): add areasonstring describing the probe-order divergence and surface it through bothgetSupportLevelandgetIncompatibleReasons. ReturnUnsupportedwhen the element type (recursively) contains aStringTypewith non-default collation, and surface that reason throughgetUnsupportedReasonstoo.CometTypeShim): addhasNonDefaultStringCollation(dt)that walks nestedArrayType/MapType/StructType. The Spark 4.0 variant returns true for anyStringTypewhosecollationId != UTF8_BINARY_COLLATION_ID; the Spark 3.x variant is stubbed tofalse.array_intersect.sql): extend from 4 int-only queries to ~20 queries covering boolean, tinyint, smallint, bigint, float, double (NaN / +-Infinity / -0), decimal, date, timestamp, string (incl. multibyte UTF-8), binary, nestedarray<array<int>>, duplicate elements, self-intersection, all-NULL and both-NULL arrays,CASE WHENinputs, and a right-longer-than-left query wrapped insort_arrayso the documented ordering divergence stays a stable assertion.array_intersectinspark_expressions_support.mdfor Spark 3.4.3 / 3.5.8 / 4.0.1 dated 2026-04-24, noting the ordering incompatibility and the 4.0 collation fallback.audit-comet-expression): add a reminder to the skill instructions that whenevergetSupportLevelreturns a reason, the correspondinggetIncompatibleReasons()/getUnsupportedReasons()override must also be added so the Compatibility Guide picks it up.How are these changes tested?
CometSqlFileTestSuite [array_intersect]andCometArrayExpressionSuite(39 tests) both pass locally.makesucceeds on the Spark 3.4, 3.5, and 4.0 profiles, andcargo clippy --all-targets --workspace -- -D warningsis clean.