DNM: feat: sampling with hll by 0xPoe · Pull Request #66615 · pingcap/tidb

0xPoe · 2026-03-01T18:41:54Z

What problem does this PR solve?

Issue Number: close #xxx

Problem Summary:

What changed and how does it work?

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No need to test
- I checked and no code files have been changed.

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Summary by CodeRabbit

New Features
- Added a HyperLogLog (HLL) sketch for improved distinct-value (NDV) estimation, with creation, update, merge and serialization support.
Refactor
- Switched NDV sourcing to prefer HLL-based estimates when available.
- Threaded HLL sketches through sampling, collectors, merging and statistics construction so they persist across serialization/deserialization.
Tests
- Added unit test coverage validating HLL sketch behavior.
Chores
- Updated module resolution mapping in go.mod.

Signed-off-by: 0xPoe <techregister@pm.me>

pantheon-ai · 2026-03-01T18:41:58Z

Review Complete

Findings: 6 issues
Posted: 3
Duplicates/Skipped: 3

ti-chi-bot · 2026-03-01T18:42:03Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign bb7133, fixdb for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS
pkg/statistics/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-03-01T18:42:51Z

📝 Walkthrough

Walkthrough

Added a fixed-size HyperLogLog (HLL) sketch and integrated it across sampling, merge, protobuf, and statistics-building paths; also added a go.mod replace mapping for github.com/pingcap/tipb to github.com/0xPoe/tipb v0.0.0-20260301184210-6ae1d47e641e.

Changes

Cohort / File(s)	Summary
Module Configuration `go.mod`	Added replace directive mapping `github.com/pingcap/tipb` to `github.com/0xPoe/tipb v0.0.0-20260301184210-6ae1d47e641e`.
HLL Sketch Implementation `pkg/statistics/hllsketch.go`	New HLLSketch type (16 buckets) with constructors, insert (value/row), merge, NDV estimation, proto serialization/deserialization, copy, and memory accounting.
Row-level Samplers `pkg/statistics/row_sampler.go`	Propagated HLLSketch into base/reservoir/Bernoulli collectors: init, insert, merge, proto conversions, and memory updates; added nil-safety and pooling.
Sample Collector & Proto `pkg/statistics/sample.go`	Added `HLLSketch *HLLSketch` to SampleCollector, `EstimatedNDV()` preferring HLL, and extended merge/proto conversion to include HLL.
Executor Sampling Integration `pkg/executor/analyze_col_v2.go`	Threaded HLLSketch through per-row collectors, per-task collectors, merge workers, and final statistic composition so HLL content reaches build phase.
Statistics Builder `pkg/statistics/builder.go`	Switched NDV source from `FMSketch.NDV()` to `collector.EstimatedNDV()` where used in BuildColumn and BuildHistAndTopN.
Tests `pkg/statistics/fmsketch_test.go`	Added `TestHLLSketchRankIgnoresBucketBits` validating rank computation behavior for HLL register updates.

Sequence Diagram

sequenceDiagram
    participant RowSampler as RowSampleCollector
    participant HLL as HLLSketch
    participant SC as SampleCollector
    participant Builder as StatisticsBuilder

    RowSampler->>HLL: InsertValue / InsertRowValue (per row)
    RowSampler->>SC: Attach sample (include HLLSketch)
    RowSampler->>SC: Merge per-task collectors (merge HLLSketch)
    SC->>Builder: Serialize/Send (includes HllSketch proto)
    Builder->>SC: Call EstimatedNDV() (prefers HLL)
    Builder->>Builder: Build histograms/topN using NDV

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

statistics: simplify FMSketch by removing swiss map and pool #66590 — Related changes around FMSketch usage and NDV handling affecting stats logic and tests.
ddl, executor, statistics, planner: remove extended stats feature #66362 — Changes in statistics builder/NDV computation that overlap with EstimatedNDV usage.

Suggested reviewers

mjonss
hawkingrei
bb7133

Poem

🐰 I hash and hop through buckets small,
Registers hum and counts may sprawl,
Merge my maxes, carry them through,
Tiny sketches making NDV true.
A rabbit's nod to stats anew.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	PR description is largely incomplete; Issue Number placeholder not filled, Problem Summary and implementation details missing, all test/side-effect/documentation checkboxes unchecked, and no rationale provided.	Complete the description template: fill Issue Number, provide Problem Summary and implementation details, check relevant test boxes, document any performance/compatibility changes, and add a release note.
Title check	❓ Inconclusive	Title uses 'DNM' prefix (Do Not Merge) and references HLL feature, but is vague; 'sampling with hll' lacks specifics about which components or the primary change.	Remove 'DNM' if merging is intended, and clarify the main change (e.g., 'Add HyperLogLog sketch support for sampling statistics') for better scannability.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 81.82% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Command failed

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

pkg/statistics/sample.go (1)
131-135: Verify HLLSketch merge behavior when source is nil but destination exists.

The current logic copies rc.HLLSketch when c.HLLSketch == nil, but if rc.HLLSketch is also nil, c.HLLSketch will remain nil. This is likely fine, but differs from the FM sketch merge which merges unconditionally (relying on internal nil checks).

For consistency, consider whether you want:

Current behavior (skip if source is nil) - cleaner but subtly different from FM pattern

Unconditional merge with nil checks inside MergeHLLSketch (matches FM pattern)

Both work correctly; just flagging the subtle difference.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/statistics/sample.go` around lines 131 - 135, The HLL merge branch in the
sample accumulation currently only copies rc.HLLSketch when c.HLLSketch is nil,
which leaves behavior asymmetric with the FM sketch; change this to perform an
unconditional merge so MergeHLLSketch handles nil checks internally (or
alternately add an explicit nil check on rc.HLLSketch before Copy) — update the
code around c.HLLSketch and rc.HLLSketch to call MergeHLLSketch unconditionally
(or modify MergeHLLSketch to be nil-safe) so behavior matches the FM sketch
merge pattern and is consistent across sketches.
pkg/statistics/hllsketch.go (1)
163-176: Proto deserialization silently ignores BucketBits field.

HLLSketchFromProto ignores protoSketch.BucketBits, always using the hardcoded hllBucketCount. If a proto was serialized with a different bucket configuration, deserialization would produce incorrect results.

Consider either:

Validating that protoSketch.BucketBits == hllBucketBits, or

Returning an error if they don't match
♻️ Proposed validation
 func HLLSketchFromProto(protoSketch *tipb.HllSketch) *HLLSketch {
 	if protoSketch == nil {
 		return nil
 	}
+	// Validate bucket configuration matches
+	if protoSketch.BucketBits != 0 && protoSketch.BucketBits != hllBucketBits {
+		// Log warning or handle mismatch
+		return nil
+	}
 	sketch := NewHLLSketch()
 	for i, reg := range protoSketch.Registers {
 		if i >= hllBucketCount {
 			break
 		}
 		sketch.registers[i] = uint8(reg)
 	}
 	return sketch
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/statistics/hllsketch.go` around lines 163 - 176, HLLSketchFromProto
currently ignores protoSketch.BucketBits and always uses the package constant
hllBucketCount/hllBucketBits; update the function to validate the incoming
proto's BucketBits against the expected hllBucketBits and fail fast if they
differ: inside HLLSketchFromProto (which constructs an HLLSketch via
NewHLLSketch and fills sketch.registers) check protoSketch != nil then compare
protoSketch.BucketBits to hllBucketBits and either return nil (or an error value
if you prefer) or log/propagate an error when they mismatch so deserialization
cannot silently produce an incorrectly sized HLLSketch.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@go.mod`:
- Line 373: The go.mod contains a local replace directive
"github.com/pingcap/tipb => ../tipb" which breaks builds outside your
environment; remove the local replace, update go.mod to point to a released tipb
version that includes the HLLSketch message and the new fields on
RowSampleCollector and SampleCollector, run `make bazel_prepare` to refresh
Bazel metadata, and commit the updated go.mod/go.sum plus the generated Bazel
metadata so CI and other developers can build successfully (look for references
to HLLSketch, RowSampleCollector, SampleCollector and the
github.com/pingcap/tipb module while making these changes).

In `@pkg/statistics/hllsketch.go`:
- Around line 96-107: The loop is shadowing the outer buffer variable `b` by
using `b, err := codec.EncodeValue(...)`, so `b = b[:0]` is ineffective and
causes extra allocations; change the short-declare to assignment (`b, err =
codec.EncodeValue(...)`) so the existing buffer is reused, and ensure the
surrounding loop in the function that calls sc.ErrCtx(), codec.EncodeValue, and
hashFunc.Write uses the outer `b` (and handles `err`) rather than creating a new
shadowed `b`.

In `@pkg/statistics/row_sampler.go`:
- Around line 306-316: The build fails because the tipb RowSampleCollector
protobuf lacks the HllSketch field used by code (see tipb.RowSampleCollector and
the HllSketch field referenced when assigning pbHLLSketches); add a repeated
HllSketch field (with the correct tipb.HllSketch message type) to the
RowSampleCollector proto, regenerate the Go protobuf bindings, publish the
updated tipb module version, and update go.mod to depend on that released
version (removing any local replace) so tipb.RowSampleCollector in Go includes
the new HllSketch field.

---

Nitpick comments:
In `@pkg/statistics/hllsketch.go`:
- Around line 163-176: HLLSketchFromProto currently ignores
protoSketch.BucketBits and always uses the package constant
hllBucketCount/hllBucketBits; update the function to validate the incoming
proto's BucketBits against the expected hllBucketBits and fail fast if they
differ: inside HLLSketchFromProto (which constructs an HLLSketch via
NewHLLSketch and fills sketch.registers) check protoSketch != nil then compare
protoSketch.BucketBits to hllBucketBits and either return nil (or an error value
if you prefer) or log/propagate an error when they mismatch so deserialization
cannot silently produce an incorrectly sized HLLSketch.

In `@pkg/statistics/sample.go`:
- Around line 131-135: The HLL merge branch in the sample accumulation currently
only copies rc.HLLSketch when c.HLLSketch is nil, which leaves behavior
asymmetric with the FM sketch; change this to perform an unconditional merge so
MergeHLLSketch handles nil checks internally (or alternately add an explicit nil
check on rc.HLLSketch before Copy) — update the code around c.HLLSketch and
rc.HLLSketch to call MergeHLLSketch unconditionally (or modify MergeHLLSketch to
be nil-safe) so behavior matches the FM sketch merge pattern and is consistent
across sketches.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9246e72 and b5a3b22.

📒 Files selected for processing (6)

go.mod
pkg/executor/analyze_col_v2.go
pkg/statistics/builder.go
pkg/statistics/hllsketch.go
pkg/statistics/row_sampler.go
pkg/statistics/sample.go

go.mod

coderabbitai · 2026-03-01T18:50:37Z

pkg/statistics/hllsketch.go

+	errCtx := sc.ErrCtx()
+	for _, v := range values {
+		b = b[:0]
+		b, err := codec.EncodeValue(sc.TimeZone(), b, v)
+		err = errCtx.HandleError(err)
+		if err != nil {
+			return err
+		}
+		_, err = hashFunc.Write(b)
+		if err != nil {
+			return err
+		}


⚠️ Potential issue | 🟠 Major

Variable shadowing bug causes unnecessary allocations.

Line 99 uses := which shadows the outer b variable. The b = b[:0] on line 98 becomes pointless as the := creates a new variable b, potentially causing new allocations each iteration.

🐛 Proposed fix

func (s *HLLSketch) InsertRowValue(sc *stmtctx.StatementContext, values []types.Datum) error { b := make([]byte, 0, 8) hashFunc := murmur3Pool.Get().(hash.Hash64) hashFunc.Reset() defer murmur3Pool.Put(hashFunc) errCtx := sc.ErrCtx() + var err error for _, v := range values { b = b[:0] - b, err := codec.EncodeValue(sc.TimeZone(), b, v) + b, err = codec.EncodeValue(sc.TimeZone(), b, v) err = errCtx.HandleError(err) if err != nil { return err } _, err = hashFunc.Write(b) if err != nil { return err } } s.insertHashValue(hashFunc.Sum64()) return nil }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

errCtx := sc.ErrCtx()

for _, v := range values {

b = b[:0]

b, err := codec.EncodeValue(sc.TimeZone(), b, v)

err = errCtx.HandleError(err)

if err != nil {

return err

}

_, err = hashFunc.Write(b)

if err != nil {

return err

}

errCtx := sc.ErrCtx()

var err error

for _, v := range values {

b = b[:0]

b, err = codec.EncodeValue(sc.TimeZone(), b, v)

err = errCtx.HandleError(err)

if err != nil {

return err

}

_, err = hashFunc.Write(b)

if err != nil {

return err

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/statistics/hllsketch.go` around lines 96 - 107, The loop is shadowing the outer buffer variable `b` by using `b, err := codec.EncodeValue(...)`, so `b = b[:0]` is ineffective and causes extra allocations; change the short-declare to assignment (`b, err = codec.EncodeValue(...)`) so the existing buffer is reused, and ensure the surrounding loop in the function that calls sc.ErrCtx(), codec.EncodeValue, and hashFunc.Write uses the outer `b` (and handles `err`) rather than creating a new shadowed `b`.

pkg/statistics/row_sampler.go

Signed-off-by: 0xPoe <techregister@pm.me>

coderabbitai

🧹 Nitpick comments (1)

go.mod (1)
373-373: Fork dependency is appropriate for DNM development, but plan the upstream merge path.

Using github.com/0xPoe/tipb (your fork) addresses the previous local path issue and allows CI/CD to build successfully. This approach is acceptable for a "Do Not Merge" PR while you develop and test the HLLSketch feature.

Before this PR can be merged, the HLLSketch protobuf changes (new HllSketch message type and fields on RowSampleCollector/SampleCollector) must be:

Submitted and merged to the upstream github.com/pingcap/tipb repository

Released as an official tipb version

Referenced here by updating this replace directive to point to the official release
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@go.mod` at line 373, The go.mod currently replaces github.com/pingcap/tipb
with your fork github.com/0xPoe/tipb to allow DNM development; to prepare for
merge, open a PR against upstream github.com/pingcap/tipb containing the
HllSketch protobuf changes (new HllSketch message and the added fields on
RowSampleCollector and SampleCollector), get that merged and released as an
official tipb version, then update this module by removing or changing the
replace directive for github.com/0xPoe/tipb and point the require to the
official released tag (or remove the replace entirely) so the code references
the upstream tipb release instead of your fork.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@go.mod`:
- Line 373: The go.mod currently replaces github.com/pingcap/tipb with your fork
github.com/0xPoe/tipb to allow DNM development; to prepare for merge, open a PR
against upstream github.com/pingcap/tipb containing the HllSketch protobuf
changes (new HllSketch message and the added fields on RowSampleCollector and
SampleCollector), get that merged and released as an official tipb version, then
update this module by removing or changing the replace directive for
github.com/0xPoe/tipb and point the require to the official released tag (or
remove the replace entirely) so the code references the upstream tipb release
instead of your fork.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b5a3b22 and 93bd39b.

⛔ Files ignored due to path filters (1)

go.sum is excluded by !**/*.sum

📒 Files selected for processing (1)

go.mod

codecov · 2026-03-01T19:26:31Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.2933%. Comparing base (9246e72) to head (93bd39b).
⚠️ Report is 78 commits behind head on master.

⚠️ Current head 93bd39b differs from pull request most recent head ff22f11

Please upload reports for the commit ff22f11 to get more accurate results.

Additional details and impacted files

@@               Coverage Diff                @@
##             master     #66615        +/-   ##
================================================
- Coverage   77.6757%   77.2933%   -0.3824%     
================================================
  Files          2008       1938        -70     
  Lines        549227     536685     -12542     
================================================
- Hits         426616     414822     -11794     
- Misses       120951     121595       +644     
+ Partials       1660        268      -1392

Flag	Coverage Δ
integration	`39.1556% <ø> (-9.0408%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`56.7974% <ø> (ø)`
parser	`∅ <ø> (∅)`
br	`48.7598% <ø> (-12.1162%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

pantheon-ai · 2026-03-01T19:28:17Z

pkg/statistics/hllsketch.go

+	if w == 0 {
+		rank = uint8(64 - hllBucketBits + 1)
+	} else {
+		rank = uint8(bits.LeadingZeros64(w) + 1)


[P1] Incorrect HLL rank calculation causes systematic NDV over-estimation

Why: rank is computed as bits.LeadingZeros64(w) + 1 where w = hashVal >> hllBucketBits. The shift already introduces hllBucketBits (=4) implicit leading zeros, so LeadingZeros64(w) inflates every rank by +4. This systematically over-estimates NDV by a large factor, skewing query plans for all tables analyzed with this HLL feature.

Evidence: pkg/statistics/hllsketch.go:59 (w := hashVal >> hllBucketBits); pkg/statistics/hllsketch.go:64 (rank = uint8(bits.LeadingZeros64(w) + 1)). Fix: use bits.LeadingZeros64(w<<hllBucketBits) + 1 or equivalently operate on the unshifted value's low bits.

pantheon-ai · 2026-03-01T19:39:15Z

pkg/statistics/sample.go

 			return errors.Trace(err)
 		}
+		if c.HLLSketch != nil {
+			if err := c.HLLSketch.InsertValue(sc, d); err != nil {


[P2] Double encode+hash per value in FM and HLL sketch insertion wastes CPU

Why: Each non-null datum goes through two independent codec.EncodeValue + Murmur3 hash cycles — once for FMSketch.InsertValue and once for HLLSketch.InsertValue — because each sketch re-encodes from the raw value independently. Since NewHLLSketch() is always allocated per column (no nil bypass), this doubles encode+hash work for every sampled value across all ANALYZE collection paths.

Evidence: pkg/statistics/sample.go:201 (c.FMSketch.InsertValue(sc, d)) and pkg/statistics/sample.go:205 (c.HLLSketch.InsertValue(sc, d)); both fmsketch.go:137 and hllsketch.go:72 independently call codec.EncodeValue + hashFunc.Write + Sum64(). Same pattern at pkg/statistics/row_sampler.go:254+259. Fix: compute encoded bytes and hash once per datum and pass the shared hash into both sketches.

pantheon-ai · 2026-03-01T19:39:15Z

pkg/statistics/row_sampler.go

 			return err
 		}
+		if s.HLLSketches[i] != nil {
+			err = s.HLLSketches[i].InsertValue(sc, col)


[P2] Double encode+hash per value in row sampler FM and HLL sketch insertion

Why: baseCollector.collectColumns calls both FMSketches[i].InsertValue and HLLSketches[i].InsertValue for each non-null column value, and both independently run codec.EncodeValue + Murmur3 hashing. With NewHLLSketch() always appended at row_sampler.go:168, there is no path to avoid the redundant work.

Evidence: pkg/statistics/row_sampler.go:254 (FM insert) and pkg/statistics/row_sampler.go:259 (HLL insert); both sketch InsertValue implementations re-encode+rehash independently (fmsketch.go:137, hllsketch.go:72). Fix: hash once, share between FM and HLL.

Signed-off-by: 0xPoe <techregister@pm.me>

tiprow · 2026-03-12T17:39:01Z

@0xPoe: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
fast_test_tiprow	`ff22f11`	link	true	`/test fast_test_tiprow`
tidb_parser_test	`ff22f11`	link	true	`/test tidb_parser_test`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

ti-chi-bot · 2026-03-12T17:49:49Z

@0xPoe: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-mysql-client-test	`ff22f11`	link	true	`/test pull-mysql-client-test`
pull-build-next-gen	`ff22f11`	link	true	`/test pull-build-next-gen`
idc-jenkins-ci-tidb/mysql-test	`ff22f11`	link	true	`/test mysql-test`
idc-jenkins-ci-tidb/build	`ff22f11`	link	true	`/test build`
idc-jenkins-ci-tidb/check_dev	`ff22f11`	link	true	`/test check-dev`
idc-jenkins-ci-tidb/check_dev_2	`ff22f11`	link	true	`/test check-dev2`
pull-integration-realcluster-test-next-gen	`ff22f11`	link	true	`/test pull-integration-realcluster-test-next-gen`
pull-integration-e2e-test	`ff22f11`	link	true	`/test pull-integration-e2e-test`
idc-jenkins-ci-tidb/unit-test	`ff22f11`	link	true	`/test unit-test`
pull-lightning-integration-test	`ff22f11`	link	true	`/test pull-lightning-integration-test`
pull-mysql-client-test-next-gen	`ff22f11`	link	true	`/test pull-mysql-client-test-next-gen`
pull-unit-test-next-gen	`ff22f11`	link	true	`/test pull-unit-test-next-gen`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

ti-chi-bot · 2026-03-12T17:54:37Z

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

_{📖 For more info, you can check the "Contribute Code" section in the development guide.}

Notice: To remove the do-not-merge/needs-tests-checked label, please finished the tests then check the finished items in description.

For example:

Tests

Unit test

Integration test

Manual test (add detailed scripts or steps below)

No code

_{📖 For more info, you can check the "Contribute Code" section in the development guide.}

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

pkg/statistics/hllsketch.go (1)

98-109: ⚠️ Potential issue | 🟠 Major

Variable shadowing bug causes unnecessary allocations.

Line 101 uses := which creates a new variable b that shadows the outer b from line 93. This makes b = b[:0] on line 100 ineffective, causing a new allocation each iteration instead of reusing the buffer.

🐛 Proposed fix

 func (s *HLLSketch) InsertRowValue(sc *stmtctx.StatementContext, values []types.Datum) error {
 	b := make([]byte, 0, 8)
 	hashFunc := murmur3Pool.Get().(hash.Hash64)
 	hashFunc.Reset()
 	defer murmur3Pool.Put(hashFunc)

 	errCtx := sc.ErrCtx()
+	var err error
 	for _, v := range values {
 		b = b[:0]
-		b, err := codec.EncodeValue(sc.TimeZone(), b, v)
+		b, err = codec.EncodeValue(sc.TimeZone(), b, v)
 		err = errCtx.HandleError(err)
 		if err != nil {
 			return err
 		}
-		_, err = hashFunc.Write(b)
+		_, err = hashFunc.Write(b)
 		if err != nil {
 			return err
 		}
 	}
 	s.insertHashValue(hashFunc.Sum64())
 	return nil
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/statistics/hllsketch.go` around lines 98 - 109, The loop is shadowing the
outer buffer variable b by using ":=" when calling codec.EncodeValue, preventing
reuse of the slice and causing repeated allocations; change the inner
declaration "b, err := codec.EncodeValue(...)" to an assignment "b, err =
codec.EncodeValue(...)" so the existing b (reset by "b = b[:0]") is reused, keep
the subsequent err handling via errCtx.HandleError(err) and return err as before
within the loop that iterates over values and uses hashFunc.Write.

🧹 Nitpick comments (1)

pkg/statistics/hllsketch.go (1)
29-38: Consider documenting the accuracy trade-off for 16-bucket HLL.

With only 16 buckets, the standard error for NDV estimates is approximately 26% (1.04/√16). This is significantly higher than typical HLL implementations (2^10 to 2^16 buckets). If this is an intentional trade-off for memory efficiency in the sampling context, a brief comment documenting the expected accuracy would help future maintainers understand the design decision.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/statistics/hllsketch.go` around lines 29 - 38, Add a brief comment next
to the HLL constants and HLLSketch type documenting the accuracy trade-off of
using 16 buckets: state that hllBucketBits = 4 (hllBucketCount = 16) yields a
standard error ≈ 1.04/√16 ≈ 26%, and note this is an intentional
memory-vs-accuracy trade-off for sampling contexts; update the comment above
HLLSketch (and/or the const block referencing hllBucketBits/hllBucketCount) to
include that expected accuracy figure and rationale so future maintainers can
find the design decision easily.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/statistics/hllsketch.go`:
- Around line 165-178: HLLSketchFromProto must validate protoSketch.BucketBits
against the expected hllBucketBits to avoid silent misinterpretation; update
HLLSketchFromProto to check if protoSketch == nil first, then if
protoSketch.BucketBits != hllBucketBits return nil (or otherwise handle an
explicit error/early exit) before creating the sketch with NewHLLSketch and
copying protoSketch.Registers up to hllBucketCount into sketch.registers;
reference protoSketch.BucketBits, hllBucketBits, HLLSketchFromProto,
NewHLLSketch, hllBucketCount, and sketch.registers when making the change.

---

Duplicate comments:
In `@pkg/statistics/hllsketch.go`:
- Around line 98-109: The loop is shadowing the outer buffer variable b by using
":=" when calling codec.EncodeValue, preventing reuse of the slice and causing
repeated allocations; change the inner declaration "b, err :=
codec.EncodeValue(...)" to an assignment "b, err = codec.EncodeValue(...)" so
the existing b (reset by "b = b[:0]") is reused, keep the subsequent err
handling via errCtx.HandleError(err) and return err as before within the loop
that iterates over values and uses hashFunc.Write.

---

Nitpick comments:
In `@pkg/statistics/hllsketch.go`:
- Around line 29-38: Add a brief comment next to the HLL constants and HLLSketch
type documenting the accuracy trade-off of using 16 buckets: state that
hllBucketBits = 4 (hllBucketCount = 16) yields a standard error ≈ 1.04/√16 ≈
26%, and note this is an intentional memory-vs-accuracy trade-off for sampling
contexts; update the comment above HLLSketch (and/or the const block referencing
hllBucketBits/hllBucketCount) to include that expected accuracy figure and
rationale so future maintainers can find the design decision easily.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: e44c512b-230b-4eb0-af10-68595c34d044

📥 Commits

Reviewing files that changed from the base of the PR and between 93bd39b and ff22f11.

📒 Files selected for processing (2)

pkg/statistics/fmsketch_test.go
pkg/statistics/hllsketch.go

coderabbitai · 2026-03-12T17:58:22Z

pkg/statistics/hllsketch.go

+// HLLSketchFromProto converts HLLSketch from its protobuf representation.
+func HLLSketchFromProto(protoSketch *tipb.HllSketch) *HLLSketch {
+	if protoSketch == nil {
+		return nil
+	}
+	sketch := NewHLLSketch()
+	for i, reg := range protoSketch.Registers {
+		if i >= hllBucketCount {
+			break
+		}
+		sketch.registers[i] = uint8(reg)
+	}
+	return sketch
+}


⚠️ Potential issue | 🟡 Minor

Validate BucketBits from proto to prevent silent data misinterpretation.

HLLSketchFromProto does not validate that protoSketch.BucketBits matches the expected hllBucketBits constant. If a proto with a different bucket configuration is received (e.g., from a different TiDB version or TiKV/TiFlash component), the register values will be misinterpreted, leading to incorrect NDV estimates.

🛡️ Proposed fix to add validation

func HLLSketchFromProto(protoSketch *tipb.HllSketch) *HLLSketch { if protoSketch == nil { return nil } + if protoSketch.BucketBits != hllBucketBits { + // Incompatible bucket configuration; return nil to trigger fallback to FMSketch + return nil + } sketch := NewHLLSketch() for i, reg := range protoSketch.Registers { if i >= hllBucketCount { break } sketch.registers[i] = uint8(reg) } return sketch }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/statistics/hllsketch.go` around lines 165 - 178, HLLSketchFromProto must validate protoSketch.BucketBits against the expected hllBucketBits to avoid silent misinterpretation; update HLLSketchFromProto to check if protoSketch == nil first, then if protoSketch.BucketBits != hllBucketBits return nil (or otherwise handle an explicit error/early exit) before creating the sketch with NewHLLSketch and copying protoSketch.Registers up to hllBucketCount into sketch.registers; reference protoSketch.BucketBits, hllBucketBits, HLLSketchFromProto, NewHLLSketch, hllBucketCount, and sketch.registers when making the change.

feat: sampling with hll

b5a3b22

Signed-off-by: 0xPoe <techregister@pm.me>

ti-chi-bot bot added do-not-merge/needs-linked-issue do-not-merge/needs-tests-checked release-note-none Denotes a PR that doesn't merit a release note. labels Mar 1, 2026

ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. component/statistics sig/planner SIG: Planner labels Mar 1, 2026

coderabbitai bot reviewed Mar 1, 2026

View reviewed changes

fix: use git dep

93bd39b

Signed-off-by: 0xPoe <techregister@pm.me>

coderabbitai bot reviewed Mar 1, 2026

View reviewed changes

pantheon-ai bot reviewed Mar 1, 2026

View reviewed changes

0xPoe closed this Mar 2, 2026

fix: rank issue

ff22f11

Signed-off-by: 0xPoe <techregister@pm.me>

0xPoe reopened this Mar 12, 2026

coderabbitai bot reviewed Mar 12, 2026

View reviewed changes

0xPoe closed this Mar 13, 2026

0xPoe deleted the poe-patch-sampling-with-hll-without-f1 branch April 1, 2026 18:52

Conversation

0xPoe commented Mar 1, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What changed and how does it work?

Check List

Release note

Summary by CodeRabbit

Uh oh!

pantheon-ai bot commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ti-chi-bot bot commented Mar 1, 2026

Uh oh!

coderabbitai bot commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pantheon-ai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

pantheon-ai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

pantheon-ai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

tiprow bot commented Mar 12, 2026

Uh oh!

ti-chi-bot bot commented Mar 12, 2026

Uh oh!

ti-chi-bot bot commented Mar 12, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

0xPoe commented Mar 1, 2026 •

edited by coderabbitai bot

Loading

pantheon-ai bot commented Mar 1, 2026 •

edited

Loading

coderabbitai bot commented Mar 1, 2026 •

edited

Loading

codecov bot commented Mar 1, 2026 •

edited

Loading