Skip to content

[branch-52] Cherry-pick apache/datafusion#21068#99

Merged
LiaCastaneda merged 2 commits intobranch-52from
lia.castaneda/cherry-pick/apache-pr-21068-20260406
Apr 7, 2026
Merged

[branch-52] Cherry-pick apache/datafusion#21068#99
LiaCastaneda merged 2 commits intobranch-52from
lia.castaneda/cherry-pick/apache-pr-21068-20260406

Conversation

@LiaCastaneda
Copy link
Copy Markdown

cherry-picks apache#21068

…e#21068)

* Closes apache#20492.

`HashJoinExec` currently continues polling and consuming the probe side
even after the build side has completed with zero rows.

For join types whose output is guaranteed to be empty when the build
side is empty, this work is unnecessary. In practice, it can trigger
large avoidable scans and extra compute despite producing no output.
This is especially costly for cases such as INNER, LEFT, LEFT SEMI, LEFT
ANTI, LEFT MARK, and RIGHT SEMI joins.

This change makes the stream state machine aware of that condition so
execution can terminate as soon as the build side is known to be empty
and no probe rows are needed to determine the final result.

The change also preserves the existing behavior for join types that
still require probe-side rows even when the build side is empty, such as
RIGHT, FULL, RIGHT ANTI, and RIGHT MARK joins.

* Added `JoinType::empty_build_side_produces_empty_result` to centralize
logic determining when an empty build side guarantees empty output.
* Updated `HashJoinStream` state transitions to:

* Skip transitioning to `FetchProbeBatch` when the build side is empty
and output is deterministically empty.
  * Immediately complete the stream in such cases.
* Refactored logic in `build_batch_empty_build_side` to reuse the new
helper method and simplify match branches.
* Ensured probe-side consumption still occurs for join types that
require probe rows (e.g., RIGHT, FULL).
* Added helper `state_after_build_ready` to unify post-build decision
logic.
* Introduced reusable helper for constructing hash joins with dynamic
filters in tests.

Yes, comprehensive tests have been added:

* Verified that probe side is **not consumed** when:

  * Build side is empty
  * Join type guarantees empty output
* Verified that probe side **is still consumed** when required by join
semantics (e.g., RIGHT, FULL joins)
* Covered both filtered and non-filtered joins
* Added tests ensuring correct behavior with dynamic filters
* Added regression test ensuring correct behavior after partition bounds
reporting

These tests validate both correctness and the intended optimization
behavior.

No API changes.

However, this introduces a performance optimization:

* Queries involving joins with empty build sides may complete
significantly faster
* Reduced unnecessary IO and compute

No behavioral changes in query results.

This PR includes LLM-generated code and comments. All LLM-generated
content has been manually reviewed and tested.

(cherry picked from commit 6c5e241)
…ng helpers

The cherry-pick of apache PR apache#21068 incorrectly included null-aware
anti-join code (referencing nonexistent fields `null_aware`,
`probe_side_non_empty`, `probe_side_has_null` on `HashJoinStream`/
`JoinLeftData`) from a different PR. Also fixes:
- `.map()` -> `.hash_map()` to match this branch's `JoinLeftData` API
- Replace `new_empty_schema_batch()` (undefined in this branch) with
  an inline `RecordBatch::try_new_with_options` equivalent

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@LiaCastaneda LiaCastaneda force-pushed the lia.castaneda/cherry-pick/apache-pr-21068-20260406 branch from b682bfe to 6b9fbb8 Compare April 6, 2026 08:53
@LiaCastaneda LiaCastaneda merged commit 4a36e6b into branch-52 Apr 7, 2026
60 checks passed
@gabotechs gabotechs changed the title Cherry-pick apache/datafusion#21068 [branch-52] Cherry-pick apache/datafusion#21068 Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants