Skip to content

feat: support regular BuildRight+LeftAnti hash join#4073

Draft
viirya wants to merge 7 commits intoapache:mainfrom
viirya:feat-null-aware-anti-join
Draft

feat: support regular BuildRight+LeftAnti hash join#4073
viirya wants to merge 7 commits intoapache:mainfrom
viirya:feat-null-aware-anti-join

Conversation

@viirya
Copy link
Copy Markdown
Member

@viirya viirya commented Apr 24, 2026

Which issue does this PR close?

Closes #457.

Rationale for this change

DataFusion 53 introduced null-aware anti-join support in HashJoinExec (fixing apache/datafusion#10583), so Comet can now offload Spark's NOT IN subquery pattern (BroadcastHashJoinExec with isNullAwareAntiJoin=true) to the native layer instead of falling back to Spark. The original BuildRight + LeftAnti rejection (issue #457) was added as a workaround for that null-aware bug; with the bug fixed upstream, the rejection can be removed entirely, enabling all BuildRight + LeftAnti cases (BHJ null-aware, BHJ regular, SHJ regular) to run natively.

What changes are included in this PR?

  • Add null_aware_anti_join field to HashJoin proto and pass it to DataFusion's HashJoinExec::try_new()
  • Set NullAwareAntiJoin from BroadcastHashJoinExec.isNullAwareAntiJoin in the Scala serializer
  • Skip swap_inputs() for null-aware anti-join (DataFusion only allows null_aware=true with LeftAnti; swap would turn it into RightAnti)
  • Remove the blanket BuildRight + LeftAnti is not supported rejection in CometHashJoin and RewriteJoin
  • Add BroadcastHashJoin tests for both null-aware (NOT IN subquery) and non-null-aware (BROADCAST hint with LEFT ANTI JOIN) cases; re-enable the previously-disabled SHJ BuildRight + LeftAnti test

How are these changes tested?

Unit tests

@viirya viirya marked this pull request as draft April 24, 2026 19:36
viirya and others added 7 commits April 24, 2026 23:34
DataFusion 53.0.0 introduced null-aware anti-join support in HashJoinExec
via a new boolean parameter to try_new(). This enables Comet to offload
Spark's NOT IN subquery pattern (BroadcastHashJoinExec with
isNullAwareAntiJoin=true) to the native layer.

Changes:
- Add null_aware_anti_join field (tag 6) to HashJoin proto message
- Pass join.null_aware_anti_join to HashJoinExec::try_new() in planner.rs
- Allow BuildRight+LeftAnti joins through when isNullAwareAntiJoin=true
  on BroadcastHashJoinExec; regular BuildRight+LeftAnti on
  ShuffledHashJoinExec remains blocked (issue apache#457)
- Set NullAwareAntiJoin field in the proto builder from operators.scala

Null-aware semantics: if the build side contains ANY null, all left rows
are suppressed regardless of match status (SQL NOT IN with nullable column).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…join is supported

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DataFusion HashJoinExec supports LeftAnti in all build-side configurations,
and the planner's existing swap_inputs logic handles BuildRight correctly.
The original guard in operators.scala was added without a stated reason;
the referenced issue apache#457 was specifically about null-aware anti-join
(fixed via DataFusion 53), not regular BuildRight+LeftAnti.

Changes:
- Remove the BuildRight+LeftAnti rejection in CometHashJoin
- Allow SortMergeJoin+BuildRight+LeftAnti to be rewritten to ShuffledHashJoin
- Re-enable the previously-disabled SHJ BuildRight+LeftAnti test

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When build_side is BuildRight, the planner calls swap_inputs() which
transforms LeftAnti into RightAnti. DataFusion only supports
null_aware=true for LeftAnti, so for null-aware anti-join we must
skip the swap. Spark's BuildRight semantics (right side is build) already
match DataFusion's default for LeftAnti, so no swap is required.

Add BroadcastHashJoin tests covering:
- LeftAnti + NOT IN subquery (null-aware), including NULL on right side
  (suppresses all left rows) and NULL on left side
- LeftAnti + BROADCAST hint (non-null-aware), with and without NULLs

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DataFusion 53.1's JoinSelection rewrites null-aware joins to CollectLeft
because Partitioned mode only tracks per-partition null/emptiness state
and can produce wrong NOT IN results across partitions. Comet executes
the physical plan directly (without JoinSelection), so the mode must be
chosen here at plan construction.

Regenerate TPC-DS plan-stability golden files for q69 and q87 under
approved-plans-v1_4-spark3_5 and approved-plans-v1_4-spark4_0 now that
BuildRight+LeftAnti runs natively. approved-plans-v1_4 (Spark 3.4) still
needs regeneration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Now that BuildRight+LeftAnti runs natively, q69 and q87 produce different
plans across all approved-plans variants (v1_4, v1_4-spark3_5,
v1_4-spark4_0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@viirya viirya force-pushed the feat-null-aware-anti-join branch from 34b8e9d to 83f9c76 Compare April 25, 2026 06:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Comet doesn't support Spark BroadcastHashJoinExec if it is null-aware anti-join

1 participant