Skip to content

DNM: feat: sampling with hll#66615

Closed
0xPoe wants to merge 3 commits intopingcap:masterfrom
0xPoe:poe-patch-sampling-with-hll-without-f1
Closed

DNM: feat: sampling with hll#66615
0xPoe wants to merge 3 commits intopingcap:masterfrom
0xPoe:poe-patch-sampling-with-hll-without-f1

Conversation

@0xPoe
Copy link
Copy Markdown
Member

@0xPoe 0xPoe commented Mar 1, 2026

What problem does this PR solve?

Issue Number: close #xxx

Problem Summary:

What changed and how does it work?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Summary by CodeRabbit

  • New Features

    • Added a HyperLogLog (HLL) sketch for improved distinct-value (NDV) estimation, with creation, update, merge and serialization support.
  • Refactor

    • Switched NDV sourcing to prefer HLL-based estimates when available.
    • Threaded HLL sketches through sampling, collectors, merging and statistics construction so they persist across serialization/deserialization.
  • Tests

    • Added unit test coverage validating HLL sketch behavior.
  • Chores

    • Updated module resolution mapping in go.mod.

Signed-off-by: 0xPoe <techregister@pm.me>
@pantheon-ai
Copy link
Copy Markdown

pantheon-ai bot commented Mar 1, 2026

Review Complete

Findings: 6 issues
Posted: 3
Duplicates/Skipped: 3

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Mar 1, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign bb7133, fixdb for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. component/statistics sig/planner SIG: Planner labels Mar 1, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 1, 2026

📝 Walkthrough

Walkthrough

Added a fixed-size HyperLogLog (HLL) sketch and integrated it across sampling, merge, protobuf, and statistics-building paths; also added a go.mod replace mapping for github.com/pingcap/tipb to github.com/0xPoe/tipb v0.0.0-20260301184210-6ae1d47e641e.

Changes

Cohort / File(s) Summary
Module Configuration
go.mod
Added replace directive mapping github.com/pingcap/tipb to github.com/0xPoe/tipb v0.0.0-20260301184210-6ae1d47e641e.
HLL Sketch Implementation
pkg/statistics/hllsketch.go
New HLLSketch type (16 buckets) with constructors, insert (value/row), merge, NDV estimation, proto serialization/deserialization, copy, and memory accounting.
Row-level Samplers
pkg/statistics/row_sampler.go
Propagated HLLSketch into base/reservoir/Bernoulli collectors: init, insert, merge, proto conversions, and memory updates; added nil-safety and pooling.
Sample Collector & Proto
pkg/statistics/sample.go
Added HLLSketch *HLLSketch to SampleCollector, EstimatedNDV() preferring HLL, and extended merge/proto conversion to include HLL.
Executor Sampling Integration
pkg/executor/analyze_col_v2.go
Threaded HLLSketch through per-row collectors, per-task collectors, merge workers, and final statistic composition so HLL content reaches build phase.
Statistics Builder
pkg/statistics/builder.go
Switched NDV source from FMSketch.NDV() to collector.EstimatedNDV() where used in BuildColumn and BuildHistAndTopN.
Tests
pkg/statistics/fmsketch_test.go
Added TestHLLSketchRankIgnoresBucketBits validating rank computation behavior for HLL register updates.

Sequence Diagram

sequenceDiagram
    participant RowSampler as RowSampleCollector
    participant HLL as HLLSketch
    participant SC as SampleCollector
    participant Builder as StatisticsBuilder

    RowSampler->>HLL: InsertValue / InsertRowValue (per row)
    RowSampler->>SC: Attach sample (include HLLSketch)
    RowSampler->>SC: Merge per-task collectors (merge HLLSketch)
    SC->>Builder: Serialize/Send (includes HllSketch proto)
    Builder->>SC: Call EstimatedNDV() (prefers HLL)
    Builder->>Builder: Build histograms/topN using NDV
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • mjonss
  • hawkingrei
  • bb7133

Poem

🐰 I hash and hop through buckets small,
Registers hum and counts may sprawl,
Merge my maxes, carry them through,
Tiny sketches making NDV true.
A rabbit's nod to stats anew.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning PR description is largely incomplete; Issue Number placeholder not filled, Problem Summary and implementation details missing, all test/side-effect/documentation checkboxes unchecked, and no rationale provided. Complete the description template: fill Issue Number, provide Problem Summary and implementation details, check relevant test boxes, document any performance/compatibility changes, and add a release note.
Title check ❓ Inconclusive Title uses 'DNM' prefix (Do Not Merge) and references HLL feature, but is vague; 'sampling with hll' lacks specifics about which components or the primary change. Remove 'DNM' if merging is intended, and clarify the main change (e.g., 'Add HyperLogLog sketch support for sampling statistics') for better scannability.
✅ Passed checks (1 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 81.82% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Command failed


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
pkg/statistics/sample.go (1)

131-135: Verify HLLSketch merge behavior when source is nil but destination exists.

The current logic copies rc.HLLSketch when c.HLLSketch == nil, but if rc.HLLSketch is also nil, c.HLLSketch will remain nil. This is likely fine, but differs from the FM sketch merge which merges unconditionally (relying on internal nil checks).

For consistency, consider whether you want:

  1. Current behavior (skip if source is nil) - cleaner but subtly different from FM pattern
  2. Unconditional merge with nil checks inside MergeHLLSketch (matches FM pattern)

Both work correctly; just flagging the subtle difference.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/statistics/sample.go` around lines 131 - 135, The HLL merge branch in the
sample accumulation currently only copies rc.HLLSketch when c.HLLSketch is nil,
which leaves behavior asymmetric with the FM sketch; change this to perform an
unconditional merge so MergeHLLSketch handles nil checks internally (or
alternately add an explicit nil check on rc.HLLSketch before Copy) — update the
code around c.HLLSketch and rc.HLLSketch to call MergeHLLSketch unconditionally
(or modify MergeHLLSketch to be nil-safe) so behavior matches the FM sketch
merge pattern and is consistent across sketches.
pkg/statistics/hllsketch.go (1)

163-176: Proto deserialization silently ignores BucketBits field.

HLLSketchFromProto ignores protoSketch.BucketBits, always using the hardcoded hllBucketCount. If a proto was serialized with a different bucket configuration, deserialization would produce incorrect results.

Consider either:

  1. Validating that protoSketch.BucketBits == hllBucketBits, or
  2. Returning an error if they don't match
♻️ Proposed validation
 func HLLSketchFromProto(protoSketch *tipb.HllSketch) *HLLSketch {
 	if protoSketch == nil {
 		return nil
 	}
+	// Validate bucket configuration matches
+	if protoSketch.BucketBits != 0 && protoSketch.BucketBits != hllBucketBits {
+		// Log warning or handle mismatch
+		return nil
+	}
 	sketch := NewHLLSketch()
 	for i, reg := range protoSketch.Registers {
 		if i >= hllBucketCount {
 			break
 		}
 		sketch.registers[i] = uint8(reg)
 	}
 	return sketch
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/statistics/hllsketch.go` around lines 163 - 176, HLLSketchFromProto
currently ignores protoSketch.BucketBits and always uses the package constant
hllBucketCount/hllBucketBits; update the function to validate the incoming
proto's BucketBits against the expected hllBucketBits and fail fast if they
differ: inside HLLSketchFromProto (which constructs an HLLSketch via
NewHLLSketch and fills sketch.registers) check protoSketch != nil then compare
protoSketch.BucketBits to hllBucketBits and either return nil (or an error value
if you prefer) or log/propagate an error when they mismatch so deserialization
cannot silently produce an incorrectly sized HLLSketch.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@go.mod`:
- Line 373: The go.mod contains a local replace directive
"github.com/pingcap/tipb => ../tipb" which breaks builds outside your
environment; remove the local replace, update go.mod to point to a released tipb
version that includes the HLLSketch message and the new fields on
RowSampleCollector and SampleCollector, run `make bazel_prepare` to refresh
Bazel metadata, and commit the updated go.mod/go.sum plus the generated Bazel
metadata so CI and other developers can build successfully (look for references
to HLLSketch, RowSampleCollector, SampleCollector and the
github.com/pingcap/tipb module while making these changes).

In `@pkg/statistics/hllsketch.go`:
- Around line 96-107: The loop is shadowing the outer buffer variable `b` by
using `b, err := codec.EncodeValue(...)`, so `b = b[:0]` is ineffective and
causes extra allocations; change the short-declare to assignment (`b, err =
codec.EncodeValue(...)`) so the existing buffer is reused, and ensure the
surrounding loop in the function that calls sc.ErrCtx(), codec.EncodeValue, and
hashFunc.Write uses the outer `b` (and handles `err`) rather than creating a new
shadowed `b`.

In `@pkg/statistics/row_sampler.go`:
- Around line 306-316: The build fails because the tipb RowSampleCollector
protobuf lacks the HllSketch field used by code (see tipb.RowSampleCollector and
the HllSketch field referenced when assigning pbHLLSketches); add a repeated
HllSketch field (with the correct tipb.HllSketch message type) to the
RowSampleCollector proto, regenerate the Go protobuf bindings, publish the
updated tipb module version, and update go.mod to depend on that released
version (removing any local replace) so tipb.RowSampleCollector in Go includes
the new HllSketch field.

---

Nitpick comments:
In `@pkg/statistics/hllsketch.go`:
- Around line 163-176: HLLSketchFromProto currently ignores
protoSketch.BucketBits and always uses the package constant
hllBucketCount/hllBucketBits; update the function to validate the incoming
proto's BucketBits against the expected hllBucketBits and fail fast if they
differ: inside HLLSketchFromProto (which constructs an HLLSketch via
NewHLLSketch and fills sketch.registers) check protoSketch != nil then compare
protoSketch.BucketBits to hllBucketBits and either return nil (or an error value
if you prefer) or log/propagate an error when they mismatch so deserialization
cannot silently produce an incorrectly sized HLLSketch.

In `@pkg/statistics/sample.go`:
- Around line 131-135: The HLL merge branch in the sample accumulation currently
only copies rc.HLLSketch when c.HLLSketch is nil, which leaves behavior
asymmetric with the FM sketch; change this to perform an unconditional merge so
MergeHLLSketch handles nil checks internally (or alternately add an explicit nil
check on rc.HLLSketch before Copy) — update the code around c.HLLSketch and
rc.HLLSketch to call MergeHLLSketch unconditionally (or modify MergeHLLSketch to
be nil-safe) so behavior matches the FM sketch merge pattern and is consistent
across sketches.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9246e72 and b5a3b22.

📒 Files selected for processing (6)
  • go.mod
  • pkg/executor/analyze_col_v2.go
  • pkg/statistics/builder.go
  • pkg/statistics/hllsketch.go
  • pkg/statistics/row_sampler.go
  • pkg/statistics/sample.go

Comment on lines +96 to +107
errCtx := sc.ErrCtx()
for _, v := range values {
b = b[:0]
b, err := codec.EncodeValue(sc.TimeZone(), b, v)
err = errCtx.HandleError(err)
if err != nil {
return err
}
_, err = hashFunc.Write(b)
if err != nil {
return err
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Variable shadowing bug causes unnecessary allocations.

Line 99 uses := which shadows the outer b variable. The b = b[:0] on line 98 becomes pointless as the := creates a new variable b, potentially causing new allocations each iteration.

🐛 Proposed fix
 func (s *HLLSketch) InsertRowValue(sc *stmtctx.StatementContext, values []types.Datum) error {
 	b := make([]byte, 0, 8)
 	hashFunc := murmur3Pool.Get().(hash.Hash64)
 	hashFunc.Reset()
 	defer murmur3Pool.Put(hashFunc)

 	errCtx := sc.ErrCtx()
+	var err error
 	for _, v := range values {
 		b = b[:0]
-		b, err := codec.EncodeValue(sc.TimeZone(), b, v)
+		b, err = codec.EncodeValue(sc.TimeZone(), b, v)
 		err = errCtx.HandleError(err)
 		if err != nil {
 			return err
 		}
 		_, err = hashFunc.Write(b)
 		if err != nil {
 			return err
 		}
 	}
 	s.insertHashValue(hashFunc.Sum64())
 	return nil
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
errCtx := sc.ErrCtx()
for _, v := range values {
b = b[:0]
b, err := codec.EncodeValue(sc.TimeZone(), b, v)
err = errCtx.HandleError(err)
if err != nil {
return err
}
_, err = hashFunc.Write(b)
if err != nil {
return err
}
errCtx := sc.ErrCtx()
var err error
for _, v := range values {
b = b[:0]
b, err = codec.EncodeValue(sc.TimeZone(), b, v)
err = errCtx.HandleError(err)
if err != nil {
return err
}
_, err = hashFunc.Write(b)
if err != nil {
return err
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/statistics/hllsketch.go` around lines 96 - 107, The loop is shadowing the
outer buffer variable `b` by using `b, err := codec.EncodeValue(...)`, so `b =
b[:0]` is ineffective and causes extra allocations; change the short-declare to
assignment (`b, err = codec.EncodeValue(...)`) so the existing buffer is reused,
and ensure the surrounding loop in the function that calls sc.ErrCtx(),
codec.EncodeValue, and hashFunc.Write uses the outer `b` (and handles `err`)
rather than creating a new shadowed `b`.

Signed-off-by: 0xPoe <techregister@pm.me>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
go.mod (1)

373-373: Fork dependency is appropriate for DNM development, but plan the upstream merge path.

Using github.com/0xPoe/tipb (your fork) addresses the previous local path issue and allows CI/CD to build successfully. This approach is acceptable for a "Do Not Merge" PR while you develop and test the HLLSketch feature.

Before this PR can be merged, the HLLSketch protobuf changes (new HllSketch message type and fields on RowSampleCollector/SampleCollector) must be:

  1. Submitted and merged to the upstream github.com/pingcap/tipb repository
  2. Released as an official tipb version
  3. Referenced here by updating this replace directive to point to the official release
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@go.mod` at line 373, The go.mod currently replaces github.com/pingcap/tipb
with your fork github.com/0xPoe/tipb to allow DNM development; to prepare for
merge, open a PR against upstream github.com/pingcap/tipb containing the
HllSketch protobuf changes (new HllSketch message and the added fields on
RowSampleCollector and SampleCollector), get that merged and released as an
official tipb version, then update this module by removing or changing the
replace directive for github.com/0xPoe/tipb and point the require to the
official released tag (or remove the replace entirely) so the code references
the upstream tipb release instead of your fork.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@go.mod`:
- Line 373: The go.mod currently replaces github.com/pingcap/tipb with your fork
github.com/0xPoe/tipb to allow DNM development; to prepare for merge, open a PR
against upstream github.com/pingcap/tipb containing the HllSketch protobuf
changes (new HllSketch message and the added fields on RowSampleCollector and
SampleCollector), get that merged and released as an official tipb version, then
update this module by removing or changing the replace directive for
github.com/0xPoe/tipb and point the require to the official released tag (or
remove the replace entirely) so the code references the upstream tipb release
instead of your fork.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b5a3b22 and 93bd39b.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (1)
  • go.mod

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.2933%. Comparing base (9246e72) to head (93bd39b).
⚠️ Report is 78 commits behind head on master.

⚠️ Current head 93bd39b differs from pull request most recent head ff22f11

Please upload reports for the commit ff22f11 to get more accurate results.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #66615        +/-   ##
================================================
- Coverage   77.6757%   77.2933%   -0.3824%     
================================================
  Files          2008       1938        -70     
  Lines        549227     536685     -12542     
================================================
- Hits         426616     414822     -11794     
- Misses       120951     121595       +644     
+ Partials       1660        268      -1392     
Flag Coverage Δ
integration 39.1556% <ø> (-9.0408%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 56.7974% <ø> (ø)
parser ∅ <ø> (∅)
br 48.7598% <ø> (-12.1162%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

if w == 0 {
rank = uint8(64 - hllBucketBits + 1)
} else {
rank = uint8(bits.LeadingZeros64(w) + 1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Incorrect HLL rank calculation causes systematic NDV over-estimation

Why: rank is computed as bits.LeadingZeros64(w) + 1 where w = hashVal >> hllBucketBits. The shift already introduces hllBucketBits (=4) implicit leading zeros, so LeadingZeros64(w) inflates every rank by +4. This systematically over-estimates NDV by a large factor, skewing query plans for all tables analyzed with this HLL feature.

Evidence: pkg/statistics/hllsketch.go:59 (w := hashVal >> hllBucketBits); pkg/statistics/hllsketch.go:64 (rank = uint8(bits.LeadingZeros64(w) + 1)). Fix: use bits.LeadingZeros64(w<<hllBucketBits) + 1 or equivalently operate on the unshifted value's low bits.

return errors.Trace(err)
}
if c.HLLSketch != nil {
if err := c.HLLSketch.InsertValue(sc, d); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Double encode+hash per value in FM and HLL sketch insertion wastes CPU

Why: Each non-null datum goes through two independent codec.EncodeValue + Murmur3 hash cycles — once for FMSketch.InsertValue and once for HLLSketch.InsertValue — because each sketch re-encodes from the raw value independently. Since NewHLLSketch() is always allocated per column (no nil bypass), this doubles encode+hash work for every sampled value across all ANALYZE collection paths.

Evidence: pkg/statistics/sample.go:201 (c.FMSketch.InsertValue(sc, d)) and pkg/statistics/sample.go:205 (c.HLLSketch.InsertValue(sc, d)); both fmsketch.go:137 and hllsketch.go:72 independently call codec.EncodeValue + hashFunc.Write + Sum64(). Same pattern at pkg/statistics/row_sampler.go:254+259. Fix: compute encoded bytes and hash once per datum and pass the shared hash into both sketches.

return err
}
if s.HLLSketches[i] != nil {
err = s.HLLSketches[i].InsertValue(sc, col)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Double encode+hash per value in row sampler FM and HLL sketch insertion

Why: baseCollector.collectColumns calls both FMSketches[i].InsertValue and HLLSketches[i].InsertValue for each non-null column value, and both independently run codec.EncodeValue + Murmur3 hashing. With NewHLLSketch() always appended at row_sampler.go:168, there is no path to avoid the redundant work.

Evidence: pkg/statistics/row_sampler.go:254 (FM insert) and pkg/statistics/row_sampler.go:259 (HLL insert); both sketch InsertValue implementations re-encode+rehash independently (fmsketch.go:137, hllsketch.go:72). Fix: hash once, share between FM and HLL.

@0xPoe 0xPoe closed this Mar 2, 2026
Signed-off-by: 0xPoe <techregister@pm.me>
@0xPoe 0xPoe reopened this Mar 12, 2026
@tiprow
Copy link
Copy Markdown

tiprow bot commented Mar 12, 2026

@0xPoe: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
fast_test_tiprow ff22f11 link true /test fast_test_tiprow
tidb_parser_test ff22f11 link true /test tidb_parser_test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Mar 12, 2026

@0xPoe: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-mysql-client-test ff22f11 link true /test pull-mysql-client-test
pull-build-next-gen ff22f11 link true /test pull-build-next-gen
idc-jenkins-ci-tidb/mysql-test ff22f11 link true /test mysql-test
idc-jenkins-ci-tidb/build ff22f11 link true /test build
idc-jenkins-ci-tidb/check_dev ff22f11 link true /test check-dev
idc-jenkins-ci-tidb/check_dev_2 ff22f11 link true /test check-dev2
pull-integration-realcluster-test-next-gen ff22f11 link true /test pull-integration-realcluster-test-next-gen
pull-integration-e2e-test ff22f11 link true /test pull-integration-e2e-test
idc-jenkins-ci-tidb/unit-test ff22f11 link true /test unit-test
pull-lightning-integration-test ff22f11 link true /test pull-lightning-integration-test
pull-mysql-client-test-next-gen ff22f11 link true /test pull-mysql-client-test-next-gen
pull-unit-test-next-gen ff22f11 link true /test pull-unit-test-next-gen

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Mar 12, 2026

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

📖 For more info, you can check the "Contribute Code" section in the development guide.


Notice: To remove the do-not-merge/needs-tests-checked label, please finished the tests then check the finished items in description.

For example:

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

📖 For more info, you can check the "Contribute Code" section in the development guide.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
pkg/statistics/hllsketch.go (1)

98-109: ⚠️ Potential issue | 🟠 Major

Variable shadowing bug causes unnecessary allocations.

Line 101 uses := which creates a new variable b that shadows the outer b from line 93. This makes b = b[:0] on line 100 ineffective, causing a new allocation each iteration instead of reusing the buffer.

🐛 Proposed fix
 func (s *HLLSketch) InsertRowValue(sc *stmtctx.StatementContext, values []types.Datum) error {
 	b := make([]byte, 0, 8)
 	hashFunc := murmur3Pool.Get().(hash.Hash64)
 	hashFunc.Reset()
 	defer murmur3Pool.Put(hashFunc)

 	errCtx := sc.ErrCtx()
+	var err error
 	for _, v := range values {
 		b = b[:0]
-		b, err := codec.EncodeValue(sc.TimeZone(), b, v)
+		b, err = codec.EncodeValue(sc.TimeZone(), b, v)
 		err = errCtx.HandleError(err)
 		if err != nil {
 			return err
 		}
-		_, err = hashFunc.Write(b)
+		_, err = hashFunc.Write(b)
 		if err != nil {
 			return err
 		}
 	}
 	s.insertHashValue(hashFunc.Sum64())
 	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/statistics/hllsketch.go` around lines 98 - 109, The loop is shadowing the
outer buffer variable b by using ":=" when calling codec.EncodeValue, preventing
reuse of the slice and causing repeated allocations; change the inner
declaration "b, err := codec.EncodeValue(...)" to an assignment "b, err =
codec.EncodeValue(...)" so the existing b (reset by "b = b[:0]") is reused, keep
the subsequent err handling via errCtx.HandleError(err) and return err as before
within the loop that iterates over values and uses hashFunc.Write.
🧹 Nitpick comments (1)
pkg/statistics/hllsketch.go (1)

29-38: Consider documenting the accuracy trade-off for 16-bucket HLL.

With only 16 buckets, the standard error for NDV estimates is approximately 26% (1.04/√16). This is significantly higher than typical HLL implementations (2^10 to 2^16 buckets). If this is an intentional trade-off for memory efficiency in the sampling context, a brief comment documenting the expected accuracy would help future maintainers understand the design decision.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/statistics/hllsketch.go` around lines 29 - 38, Add a brief comment next
to the HLL constants and HLLSketch type documenting the accuracy trade-off of
using 16 buckets: state that hllBucketBits = 4 (hllBucketCount = 16) yields a
standard error ≈ 1.04/√16 ≈ 26%, and note this is an intentional
memory-vs-accuracy trade-off for sampling contexts; update the comment above
HLLSketch (and/or the const block referencing hllBucketBits/hllBucketCount) to
include that expected accuracy figure and rationale so future maintainers can
find the design decision easily.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/statistics/hllsketch.go`:
- Around line 165-178: HLLSketchFromProto must validate protoSketch.BucketBits
against the expected hllBucketBits to avoid silent misinterpretation; update
HLLSketchFromProto to check if protoSketch == nil first, then if
protoSketch.BucketBits != hllBucketBits return nil (or otherwise handle an
explicit error/early exit) before creating the sketch with NewHLLSketch and
copying protoSketch.Registers up to hllBucketCount into sketch.registers;
reference protoSketch.BucketBits, hllBucketBits, HLLSketchFromProto,
NewHLLSketch, hllBucketCount, and sketch.registers when making the change.

---

Duplicate comments:
In `@pkg/statistics/hllsketch.go`:
- Around line 98-109: The loop is shadowing the outer buffer variable b by using
":=" when calling codec.EncodeValue, preventing reuse of the slice and causing
repeated allocations; change the inner declaration "b, err :=
codec.EncodeValue(...)" to an assignment "b, err = codec.EncodeValue(...)" so
the existing b (reset by "b = b[:0]") is reused, keep the subsequent err
handling via errCtx.HandleError(err) and return err as before within the loop
that iterates over values and uses hashFunc.Write.

---

Nitpick comments:
In `@pkg/statistics/hllsketch.go`:
- Around line 29-38: Add a brief comment next to the HLL constants and HLLSketch
type documenting the accuracy trade-off of using 16 buckets: state that
hllBucketBits = 4 (hllBucketCount = 16) yields a standard error ≈ 1.04/√16 ≈
26%, and note this is an intentional memory-vs-accuracy trade-off for sampling
contexts; update the comment above HLLSketch (and/or the const block referencing
hllBucketBits/hllBucketCount) to include that expected accuracy figure and
rationale so future maintainers can find the design decision easily.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: e44c512b-230b-4eb0-af10-68595c34d044

📥 Commits

Reviewing files that changed from the base of the PR and between 93bd39b and ff22f11.

📒 Files selected for processing (2)
  • pkg/statistics/fmsketch_test.go
  • pkg/statistics/hllsketch.go

Comment on lines +165 to +178
// HLLSketchFromProto converts HLLSketch from its protobuf representation.
func HLLSketchFromProto(protoSketch *tipb.HllSketch) *HLLSketch {
if protoSketch == nil {
return nil
}
sketch := NewHLLSketch()
for i, reg := range protoSketch.Registers {
if i >= hllBucketCount {
break
}
sketch.registers[i] = uint8(reg)
}
return sketch
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Validate BucketBits from proto to prevent silent data misinterpretation.

HLLSketchFromProto does not validate that protoSketch.BucketBits matches the expected hllBucketBits constant. If a proto with a different bucket configuration is received (e.g., from a different TiDB version or TiKV/TiFlash component), the register values will be misinterpreted, leading to incorrect NDV estimates.

🛡️ Proposed fix to add validation
 func HLLSketchFromProto(protoSketch *tipb.HllSketch) *HLLSketch {
 	if protoSketch == nil {
 		return nil
 	}
+	if protoSketch.BucketBits != hllBucketBits {
+		// Incompatible bucket configuration; return nil to trigger fallback to FMSketch
+		return nil
+	}
 	sketch := NewHLLSketch()
 	for i, reg := range protoSketch.Registers {
 		if i >= hllBucketCount {
 			break
 		}
 		sketch.registers[i] = uint8(reg)
 	}
 	return sketch
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/statistics/hllsketch.go` around lines 165 - 178, HLLSketchFromProto must
validate protoSketch.BucketBits against the expected hllBucketBits to avoid
silent misinterpretation; update HLLSketchFromProto to check if protoSketch ==
nil first, then if protoSketch.BucketBits != hllBucketBits return nil (or
otherwise handle an explicit error/early exit) before creating the sketch with
NewHLLSketch and copying protoSketch.Registers up to hllBucketCount into
sketch.registers; reference protoSketch.BucketBits, hllBucketBits,
HLLSketchFromProto, NewHLLSketch, hllBucketCount, and sketch.registers when
making the change.

@0xPoe 0xPoe closed this Mar 13, 2026
@0xPoe 0xPoe deleted the poe-patch-sampling-with-hll-without-f1 branch April 1, 2026 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/statistics do-not-merge/needs-linked-issue do-not-merge/needs-tests-checked release-note-none Denotes a PR that doesn't merit a release note. sig/planner SIG: Planner size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant