feat: support regular BuildRight+LeftAnti hash join#4073
Draft
viirya wants to merge 7 commits intoapache:mainfrom
Draft
feat: support regular BuildRight+LeftAnti hash join#4073viirya wants to merge 7 commits intoapache:mainfrom
viirya wants to merge 7 commits intoapache:mainfrom
Conversation
DataFusion 53.0.0 introduced null-aware anti-join support in HashJoinExec via a new boolean parameter to try_new(). This enables Comet to offload Spark's NOT IN subquery pattern (BroadcastHashJoinExec with isNullAwareAntiJoin=true) to the native layer. Changes: - Add null_aware_anti_join field (tag 6) to HashJoin proto message - Pass join.null_aware_anti_join to HashJoinExec::try_new() in planner.rs - Allow BuildRight+LeftAnti joins through when isNullAwareAntiJoin=true on BroadcastHashJoinExec; regular BuildRight+LeftAnti on ShuffledHashJoinExec remains blocked (issue apache#457) - Set NullAwareAntiJoin field in the proto builder from operators.scala Null-aware semantics: if the build side contains ANY null, all left rows are suppressed regardless of match status (SQL NOT IN with nullable column). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…join is supported Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DataFusion HashJoinExec supports LeftAnti in all build-side configurations, and the planner's existing swap_inputs logic handles BuildRight correctly. The original guard in operators.scala was added without a stated reason; the referenced issue apache#457 was specifically about null-aware anti-join (fixed via DataFusion 53), not regular BuildRight+LeftAnti. Changes: - Remove the BuildRight+LeftAnti rejection in CometHashJoin - Allow SortMergeJoin+BuildRight+LeftAnti to be rewritten to ShuffledHashJoin - Re-enable the previously-disabled SHJ BuildRight+LeftAnti test Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When build_side is BuildRight, the planner calls swap_inputs() which transforms LeftAnti into RightAnti. DataFusion only supports null_aware=true for LeftAnti, so for null-aware anti-join we must skip the swap. Spark's BuildRight semantics (right side is build) already match DataFusion's default for LeftAnti, so no swap is required. Add BroadcastHashJoin tests covering: - LeftAnti + NOT IN subquery (null-aware), including NULL on right side (suppresses all left rows) and NULL on left side - LeftAnti + BROADCAST hint (non-null-aware), with and without NULLs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DataFusion 53.1's JoinSelection rewrites null-aware joins to CollectLeft because Partitioned mode only tracks per-partition null/emptiness state and can produce wrong NOT IN results across partitions. Comet executes the physical plan directly (without JoinSelection), so the mode must be chosen here at plan construction. Regenerate TPC-DS plan-stability golden files for q69 and q87 under approved-plans-v1_4-spark3_5 and approved-plans-v1_4-spark4_0 now that BuildRight+LeftAnti runs natively. approved-plans-v1_4 (Spark 3.4) still needs regeneration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Now that BuildRight+LeftAnti runs natively, q69 and q87 produce different plans across all approved-plans variants (v1_4, v1_4-spark3_5, v1_4-spark4_0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
34b8e9d to
83f9c76
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #457.
Rationale for this change
DataFusion 53 introduced null-aware anti-join support in HashJoinExec (fixing apache/datafusion#10583), so Comet can now offload Spark's NOT IN subquery pattern (BroadcastHashJoinExec with isNullAwareAntiJoin=true) to the native layer instead of falling back to Spark. The original BuildRight + LeftAnti rejection (issue #457) was added as a workaround for that null-aware bug; with the bug fixed upstream, the rejection can be removed entirely, enabling all BuildRight + LeftAnti cases (BHJ null-aware, BHJ regular, SHJ regular) to run natively.
What changes are included in this PR?
How are these changes tested?
Unit tests