Conversation
Signed-off-by: 0xPoe <techregister@pm.me>
|
Review Complete Findings: 6 issues |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughAdded a fixed-size HyperLogLog (HLL) sketch and integrated it across sampling, merge, protobuf, and statistics-building paths; also added a go.mod replace mapping for Changes
Sequence DiagramsequenceDiagram
participant RowSampler as RowSampleCollector
participant HLL as HLLSketch
participant SC as SampleCollector
participant Builder as StatisticsBuilder
RowSampler->>HLL: InsertValue / InsertRowValue (per row)
RowSampler->>SC: Attach sample (include HLLSketch)
RowSampler->>SC: Merge per-task collectors (merge HLLSketch)
SC->>Builder: Serialize/Send (includes HllSketch proto)
Builder->>SC: Call EstimatedNDV() (prefers HLL)
Builder->>Builder: Build histograms/topN using NDV
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Command failed Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
pkg/statistics/sample.go (1)
131-135: Verify HLLSketch merge behavior when source is nil but destination exists.The current logic copies
rc.HLLSketchwhenc.HLLSketch == nil, but ifrc.HLLSketchis also nil,c.HLLSketchwill remain nil. This is likely fine, but differs from the FM sketch merge which merges unconditionally (relying on internal nil checks).For consistency, consider whether you want:
- Current behavior (skip if source is nil) - cleaner but subtly different from FM pattern
- Unconditional merge with nil checks inside MergeHLLSketch (matches FM pattern)
Both work correctly; just flagging the subtle difference.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/statistics/sample.go` around lines 131 - 135, The HLL merge branch in the sample accumulation currently only copies rc.HLLSketch when c.HLLSketch is nil, which leaves behavior asymmetric with the FM sketch; change this to perform an unconditional merge so MergeHLLSketch handles nil checks internally (or alternately add an explicit nil check on rc.HLLSketch before Copy) — update the code around c.HLLSketch and rc.HLLSketch to call MergeHLLSketch unconditionally (or modify MergeHLLSketch to be nil-safe) so behavior matches the FM sketch merge pattern and is consistent across sketches.pkg/statistics/hllsketch.go (1)
163-176: Proto deserialization silently ignoresBucketBitsfield.
HLLSketchFromProtoignoresprotoSketch.BucketBits, always using the hardcodedhllBucketCount. If a proto was serialized with a different bucket configuration, deserialization would produce incorrect results.Consider either:
- Validating that
protoSketch.BucketBits == hllBucketBits, or- Returning an error if they don't match
♻️ Proposed validation
func HLLSketchFromProto(protoSketch *tipb.HllSketch) *HLLSketch { if protoSketch == nil { return nil } + // Validate bucket configuration matches + if protoSketch.BucketBits != 0 && protoSketch.BucketBits != hllBucketBits { + // Log warning or handle mismatch + return nil + } sketch := NewHLLSketch() for i, reg := range protoSketch.Registers { if i >= hllBucketCount { break } sketch.registers[i] = uint8(reg) } return sketch }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/statistics/hllsketch.go` around lines 163 - 176, HLLSketchFromProto currently ignores protoSketch.BucketBits and always uses the package constant hllBucketCount/hllBucketBits; update the function to validate the incoming proto's BucketBits against the expected hllBucketBits and fail fast if they differ: inside HLLSketchFromProto (which constructs an HLLSketch via NewHLLSketch and fills sketch.registers) check protoSketch != nil then compare protoSketch.BucketBits to hllBucketBits and either return nil (or an error value if you prefer) or log/propagate an error when they mismatch so deserialization cannot silently produce an incorrectly sized HLLSketch.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@go.mod`:
- Line 373: The go.mod contains a local replace directive
"github.com/pingcap/tipb => ../tipb" which breaks builds outside your
environment; remove the local replace, update go.mod to point to a released tipb
version that includes the HLLSketch message and the new fields on
RowSampleCollector and SampleCollector, run `make bazel_prepare` to refresh
Bazel metadata, and commit the updated go.mod/go.sum plus the generated Bazel
metadata so CI and other developers can build successfully (look for references
to HLLSketch, RowSampleCollector, SampleCollector and the
github.com/pingcap/tipb module while making these changes).
In `@pkg/statistics/hllsketch.go`:
- Around line 96-107: The loop is shadowing the outer buffer variable `b` by
using `b, err := codec.EncodeValue(...)`, so `b = b[:0]` is ineffective and
causes extra allocations; change the short-declare to assignment (`b, err =
codec.EncodeValue(...)`) so the existing buffer is reused, and ensure the
surrounding loop in the function that calls sc.ErrCtx(), codec.EncodeValue, and
hashFunc.Write uses the outer `b` (and handles `err`) rather than creating a new
shadowed `b`.
In `@pkg/statistics/row_sampler.go`:
- Around line 306-316: The build fails because the tipb RowSampleCollector
protobuf lacks the HllSketch field used by code (see tipb.RowSampleCollector and
the HllSketch field referenced when assigning pbHLLSketches); add a repeated
HllSketch field (with the correct tipb.HllSketch message type) to the
RowSampleCollector proto, regenerate the Go protobuf bindings, publish the
updated tipb module version, and update go.mod to depend on that released
version (removing any local replace) so tipb.RowSampleCollector in Go includes
the new HllSketch field.
---
Nitpick comments:
In `@pkg/statistics/hllsketch.go`:
- Around line 163-176: HLLSketchFromProto currently ignores
protoSketch.BucketBits and always uses the package constant
hllBucketCount/hllBucketBits; update the function to validate the incoming
proto's BucketBits against the expected hllBucketBits and fail fast if they
differ: inside HLLSketchFromProto (which constructs an HLLSketch via
NewHLLSketch and fills sketch.registers) check protoSketch != nil then compare
protoSketch.BucketBits to hllBucketBits and either return nil (or an error value
if you prefer) or log/propagate an error when they mismatch so deserialization
cannot silently produce an incorrectly sized HLLSketch.
In `@pkg/statistics/sample.go`:
- Around line 131-135: The HLL merge branch in the sample accumulation currently
only copies rc.HLLSketch when c.HLLSketch is nil, which leaves behavior
asymmetric with the FM sketch; change this to perform an unconditional merge so
MergeHLLSketch handles nil checks internally (or alternately add an explicit nil
check on rc.HLLSketch before Copy) — update the code around c.HLLSketch and
rc.HLLSketch to call MergeHLLSketch unconditionally (or modify MergeHLLSketch to
be nil-safe) so behavior matches the FM sketch merge pattern and is consistent
across sketches.
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
go.modpkg/executor/analyze_col_v2.gopkg/statistics/builder.gopkg/statistics/hllsketch.gopkg/statistics/row_sampler.gopkg/statistics/sample.go
| errCtx := sc.ErrCtx() | ||
| for _, v := range values { | ||
| b = b[:0] | ||
| b, err := codec.EncodeValue(sc.TimeZone(), b, v) | ||
| err = errCtx.HandleError(err) | ||
| if err != nil { | ||
| return err | ||
| } | ||
| _, err = hashFunc.Write(b) | ||
| if err != nil { | ||
| return err | ||
| } |
There was a problem hiding this comment.
Variable shadowing bug causes unnecessary allocations.
Line 99 uses := which shadows the outer b variable. The b = b[:0] on line 98 becomes pointless as the := creates a new variable b, potentially causing new allocations each iteration.
🐛 Proposed fix
func (s *HLLSketch) InsertRowValue(sc *stmtctx.StatementContext, values []types.Datum) error {
b := make([]byte, 0, 8)
hashFunc := murmur3Pool.Get().(hash.Hash64)
hashFunc.Reset()
defer murmur3Pool.Put(hashFunc)
errCtx := sc.ErrCtx()
+ var err error
for _, v := range values {
b = b[:0]
- b, err := codec.EncodeValue(sc.TimeZone(), b, v)
+ b, err = codec.EncodeValue(sc.TimeZone(), b, v)
err = errCtx.HandleError(err)
if err != nil {
return err
}
_, err = hashFunc.Write(b)
if err != nil {
return err
}
}
s.insertHashValue(hashFunc.Sum64())
return nil
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| errCtx := sc.ErrCtx() | |
| for _, v := range values { | |
| b = b[:0] | |
| b, err := codec.EncodeValue(sc.TimeZone(), b, v) | |
| err = errCtx.HandleError(err) | |
| if err != nil { | |
| return err | |
| } | |
| _, err = hashFunc.Write(b) | |
| if err != nil { | |
| return err | |
| } | |
| errCtx := sc.ErrCtx() | |
| var err error | |
| for _, v := range values { | |
| b = b[:0] | |
| b, err = codec.EncodeValue(sc.TimeZone(), b, v) | |
| err = errCtx.HandleError(err) | |
| if err != nil { | |
| return err | |
| } | |
| _, err = hashFunc.Write(b) | |
| if err != nil { | |
| return err | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/statistics/hllsketch.go` around lines 96 - 107, The loop is shadowing the
outer buffer variable `b` by using `b, err := codec.EncodeValue(...)`, so `b =
b[:0]` is ineffective and causes extra allocations; change the short-declare to
assignment (`b, err = codec.EncodeValue(...)`) so the existing buffer is reused,
and ensure the surrounding loop in the function that calls sc.ErrCtx(),
codec.EncodeValue, and hashFunc.Write uses the outer `b` (and handles `err`)
rather than creating a new shadowed `b`.
Signed-off-by: 0xPoe <techregister@pm.me>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
go.mod (1)
373-373: Fork dependency is appropriate for DNM development, but plan the upstream merge path.Using
github.com/0xPoe/tipb(your fork) addresses the previous local path issue and allows CI/CD to build successfully. This approach is acceptable for a "Do Not Merge" PR while you develop and test the HLLSketch feature.Before this PR can be merged, the HLLSketch protobuf changes (new
HllSketchmessage type and fields onRowSampleCollector/SampleCollector) must be:
- Submitted and merged to the upstream
github.com/pingcap/tipbrepository- Released as an official tipb version
- Referenced here by updating this replace directive to point to the official release
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@go.mod` at line 373, The go.mod currently replaces github.com/pingcap/tipb with your fork github.com/0xPoe/tipb to allow DNM development; to prepare for merge, open a PR against upstream github.com/pingcap/tipb containing the HllSketch protobuf changes (new HllSketch message and the added fields on RowSampleCollector and SampleCollector), get that merged and released as an official tipb version, then update this module by removing or changing the replace directive for github.com/0xPoe/tipb and point the require to the official released tag (or remove the replace entirely) so the code references the upstream tipb release instead of your fork.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@go.mod`:
- Line 373: The go.mod currently replaces github.com/pingcap/tipb with your fork
github.com/0xPoe/tipb to allow DNM development; to prepare for merge, open a PR
against upstream github.com/pingcap/tipb containing the HllSketch protobuf
changes (new HllSketch message and the added fields on RowSampleCollector and
SampleCollector), get that merged and released as an official tipb version, then
update this module by removing or changing the replace directive for
github.com/0xPoe/tipb and point the require to the official released tag (or
remove the replace entirely) so the code references the upstream tipb release
instead of your fork.
Codecov Report✅ All modified and coverable lines are covered by tests. Please upload reports for the commit ff22f11 to get more accurate results. Additional details and impacted files@@ Coverage Diff @@
## master #66615 +/- ##
================================================
- Coverage 77.6757% 77.2933% -0.3824%
================================================
Files 2008 1938 -70
Lines 549227 536685 -12542
================================================
- Hits 426616 414822 -11794
- Misses 120951 121595 +644
+ Partials 1660 268 -1392
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
pkg/statistics/hllsketch.go
Outdated
| if w == 0 { | ||
| rank = uint8(64 - hllBucketBits + 1) | ||
| } else { | ||
| rank = uint8(bits.LeadingZeros64(w) + 1) |
There was a problem hiding this comment.
[P1] Incorrect HLL rank calculation causes systematic NDV over-estimation
Why: rank is computed as bits.LeadingZeros64(w) + 1 where w = hashVal >> hllBucketBits. The shift already introduces hllBucketBits (=4) implicit leading zeros, so LeadingZeros64(w) inflates every rank by +4. This systematically over-estimates NDV by a large factor, skewing query plans for all tables analyzed with this HLL feature.
Evidence: pkg/statistics/hllsketch.go:59 (w := hashVal >> hllBucketBits); pkg/statistics/hllsketch.go:64 (rank = uint8(bits.LeadingZeros64(w) + 1)). Fix: use bits.LeadingZeros64(w<<hllBucketBits) + 1 or equivalently operate on the unshifted value's low bits.
| return errors.Trace(err) | ||
| } | ||
| if c.HLLSketch != nil { | ||
| if err := c.HLLSketch.InsertValue(sc, d); err != nil { |
There was a problem hiding this comment.
[P2] Double encode+hash per value in FM and HLL sketch insertion wastes CPU
Why: Each non-null datum goes through two independent codec.EncodeValue + Murmur3 hash cycles — once for FMSketch.InsertValue and once for HLLSketch.InsertValue — because each sketch re-encodes from the raw value independently. Since NewHLLSketch() is always allocated per column (no nil bypass), this doubles encode+hash work for every sampled value across all ANALYZE collection paths.
Evidence: pkg/statistics/sample.go:201 (c.FMSketch.InsertValue(sc, d)) and pkg/statistics/sample.go:205 (c.HLLSketch.InsertValue(sc, d)); both fmsketch.go:137 and hllsketch.go:72 independently call codec.EncodeValue + hashFunc.Write + Sum64(). Same pattern at pkg/statistics/row_sampler.go:254+259. Fix: compute encoded bytes and hash once per datum and pass the shared hash into both sketches.
| return err | ||
| } | ||
| if s.HLLSketches[i] != nil { | ||
| err = s.HLLSketches[i].InsertValue(sc, col) |
There was a problem hiding this comment.
[P2] Double encode+hash per value in row sampler FM and HLL sketch insertion
Why: baseCollector.collectColumns calls both FMSketches[i].InsertValue and HLLSketches[i].InsertValue for each non-null column value, and both independently run codec.EncodeValue + Murmur3 hashing. With NewHLLSketch() always appended at row_sampler.go:168, there is no path to avoid the redundant work.
Evidence: pkg/statistics/row_sampler.go:254 (FM insert) and pkg/statistics/row_sampler.go:259 (HLL insert); both sketch InsertValue implementations re-encode+rehash independently (fmsketch.go:137, hllsketch.go:72). Fix: hash once, share between FM and HLL.
Signed-off-by: 0xPoe <techregister@pm.me>
|
@0xPoe: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@0xPoe: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
[FORMAT CHECKER NOTIFICATION] Notice: To remove the 📖 For more info, you can check the "Contribute Code" section in the development guide. Notice: To remove the For example:
📖 For more info, you can check the "Contribute Code" section in the development guide. |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
pkg/statistics/hllsketch.go (1)
98-109:⚠️ Potential issue | 🟠 MajorVariable shadowing bug causes unnecessary allocations.
Line 101 uses
:=which creates a new variablebthat shadows the outerbfrom line 93. This makesb = b[:0]on line 100 ineffective, causing a new allocation each iteration instead of reusing the buffer.🐛 Proposed fix
func (s *HLLSketch) InsertRowValue(sc *stmtctx.StatementContext, values []types.Datum) error { b := make([]byte, 0, 8) hashFunc := murmur3Pool.Get().(hash.Hash64) hashFunc.Reset() defer murmur3Pool.Put(hashFunc) errCtx := sc.ErrCtx() + var err error for _, v := range values { b = b[:0] - b, err := codec.EncodeValue(sc.TimeZone(), b, v) + b, err = codec.EncodeValue(sc.TimeZone(), b, v) err = errCtx.HandleError(err) if err != nil { return err } - _, err = hashFunc.Write(b) + _, err = hashFunc.Write(b) if err != nil { return err } } s.insertHashValue(hashFunc.Sum64()) return nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/statistics/hllsketch.go` around lines 98 - 109, The loop is shadowing the outer buffer variable b by using ":=" when calling codec.EncodeValue, preventing reuse of the slice and causing repeated allocations; change the inner declaration "b, err := codec.EncodeValue(...)" to an assignment "b, err = codec.EncodeValue(...)" so the existing b (reset by "b = b[:0]") is reused, keep the subsequent err handling via errCtx.HandleError(err) and return err as before within the loop that iterates over values and uses hashFunc.Write.
🧹 Nitpick comments (1)
pkg/statistics/hllsketch.go (1)
29-38: Consider documenting the accuracy trade-off for 16-bucket HLL.With only 16 buckets, the standard error for NDV estimates is approximately 26% (1.04/√16). This is significantly higher than typical HLL implementations (2^10 to 2^16 buckets). If this is an intentional trade-off for memory efficiency in the sampling context, a brief comment documenting the expected accuracy would help future maintainers understand the design decision.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/statistics/hllsketch.go` around lines 29 - 38, Add a brief comment next to the HLL constants and HLLSketch type documenting the accuracy trade-off of using 16 buckets: state that hllBucketBits = 4 (hllBucketCount = 16) yields a standard error ≈ 1.04/√16 ≈ 26%, and note this is an intentional memory-vs-accuracy trade-off for sampling contexts; update the comment above HLLSketch (and/or the const block referencing hllBucketBits/hllBucketCount) to include that expected accuracy figure and rationale so future maintainers can find the design decision easily.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/statistics/hllsketch.go`:
- Around line 165-178: HLLSketchFromProto must validate protoSketch.BucketBits
against the expected hllBucketBits to avoid silent misinterpretation; update
HLLSketchFromProto to check if protoSketch == nil first, then if
protoSketch.BucketBits != hllBucketBits return nil (or otherwise handle an
explicit error/early exit) before creating the sketch with NewHLLSketch and
copying protoSketch.Registers up to hllBucketCount into sketch.registers;
reference protoSketch.BucketBits, hllBucketBits, HLLSketchFromProto,
NewHLLSketch, hllBucketCount, and sketch.registers when making the change.
---
Duplicate comments:
In `@pkg/statistics/hllsketch.go`:
- Around line 98-109: The loop is shadowing the outer buffer variable b by using
":=" when calling codec.EncodeValue, preventing reuse of the slice and causing
repeated allocations; change the inner declaration "b, err :=
codec.EncodeValue(...)" to an assignment "b, err = codec.EncodeValue(...)" so
the existing b (reset by "b = b[:0]") is reused, keep the subsequent err
handling via errCtx.HandleError(err) and return err as before within the loop
that iterates over values and uses hashFunc.Write.
---
Nitpick comments:
In `@pkg/statistics/hllsketch.go`:
- Around line 29-38: Add a brief comment next to the HLL constants and HLLSketch
type documenting the accuracy trade-off of using 16 buckets: state that
hllBucketBits = 4 (hllBucketCount = 16) yields a standard error ≈ 1.04/√16 ≈
26%, and note this is an intentional memory-vs-accuracy trade-off for sampling
contexts; update the comment above HLLSketch (and/or the const block referencing
hllBucketBits/hllBucketCount) to include that expected accuracy figure and
rationale so future maintainers can find the design decision easily.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: e44c512b-230b-4eb0-af10-68595c34d044
📒 Files selected for processing (2)
pkg/statistics/fmsketch_test.gopkg/statistics/hllsketch.go
| // HLLSketchFromProto converts HLLSketch from its protobuf representation. | ||
| func HLLSketchFromProto(protoSketch *tipb.HllSketch) *HLLSketch { | ||
| if protoSketch == nil { | ||
| return nil | ||
| } | ||
| sketch := NewHLLSketch() | ||
| for i, reg := range protoSketch.Registers { | ||
| if i >= hllBucketCount { | ||
| break | ||
| } | ||
| sketch.registers[i] = uint8(reg) | ||
| } | ||
| return sketch | ||
| } |
There was a problem hiding this comment.
Validate BucketBits from proto to prevent silent data misinterpretation.
HLLSketchFromProto does not validate that protoSketch.BucketBits matches the expected hllBucketBits constant. If a proto with a different bucket configuration is received (e.g., from a different TiDB version or TiKV/TiFlash component), the register values will be misinterpreted, leading to incorrect NDV estimates.
🛡️ Proposed fix to add validation
func HLLSketchFromProto(protoSketch *tipb.HllSketch) *HLLSketch {
if protoSketch == nil {
return nil
}
+ if protoSketch.BucketBits != hllBucketBits {
+ // Incompatible bucket configuration; return nil to trigger fallback to FMSketch
+ return nil
+ }
sketch := NewHLLSketch()
for i, reg := range protoSketch.Registers {
if i >= hllBucketCount {
break
}
sketch.registers[i] = uint8(reg)
}
return sketch
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/statistics/hllsketch.go` around lines 165 - 178, HLLSketchFromProto must
validate protoSketch.BucketBits against the expected hllBucketBits to avoid
silent misinterpretation; update HLLSketchFromProto to check if protoSketch ==
nil first, then if protoSketch.BucketBits != hllBucketBits return nil (or
otherwise handle an explicit error/early exit) before creating the sketch with
NewHLLSketch and copying protoSketch.Registers up to hllBucketCount into
sketch.registers; reference protoSketch.BucketBits, hllBucketBits,
HLLSketchFromProto, NewHLLSketch, hllBucketCount, and sketch.registers when
making the change.
What problem does this PR solve?
Issue Number: close #xxx
Problem Summary:
What changed and how does it work?
Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
Summary by CodeRabbit
New Features
Refactor
Tests
Chores