feat: add BuildHasher variants for hash_utils#21820
Open
xudong963 wants to merge 1 commit intoapache:mainfrom
Open
feat: add BuildHasher variants for hash_utils#21820xudong963 wants to merge 1 commit intoapache:mainfrom
xudong963 wants to merge 1 commit intoapache:mainfrom
Conversation
Member
Author
|
run benchmark with_hashes |
|
🤖 Criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing xudong963/hash-buildhasher (0823c8d) to fd093fb (merge-base) diff File an issue against this benchmark runner |
|
🤖 Criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
Member
Author
|
@Dandandan, this PR resolves the regression for the #21429. It would be sweet to get your thoughts! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
This PR adds
BuildHasher-based variants forhash_utilsso callers can compute row hashes with a caller-provided hash builder instead of always using DataFusion's defaultRandomState.The main constraint is performance:
with_hashesis a hot path, especially for string, dictionary, and nested array hashing. A previous version in #21429 caused measurable regressions in the defaultRandomStatepath, for examplelarge_utf8: single, no nullsregressed from roughly26.7usto36.3us, andlarge_utf8: multiple, no nullsfrom roughly112usto127us.This version keeps the default path performance-oriented by avoiding a fully generic
BuildHasherrewrite of the existing hot loops.What changes are included in this PR?
This PR adds:
with_hashes_with_hashercreate_hashes_with_hasherThe implementation intentionally uses a hybrid design:
RandomStateleaf hot paths remain specialized.BuildHasherleaf paths live separately inhash_utils/build_hasher.rs.The trade-off is that there is still some duplication for primitive/string/binary leaf loops. That duplication is intentional: those are the hottest loops, and keeping them separate prevents the existing
RandomStatepath from becoming generic overBuildHasheror being perturbed by the custom-hasher implementation.Are these changes tested?
Yes.