statistics: replace separate TopN merge with combined TopN+histogram merge for global stats#67501
statistics: replace separate TopN merge with combined TopN+histogram merge for global stats#67501mjonss wants to merge 47 commits intopingcap:masterfrom
Conversation
… concurrency=0 The existing TopN merge (MergePartTopN2GlobalTopN) performs O(P^2) histogram lookups for each TopN value across all partitions, which is slow for tables with many partitions. The histogram estimates it retrieves (totalRows/NDV for columns) add noise rather than accuracy, and the subsequent BinarySearchRemoveVal calls mutate histograms before they are merged, reducing histogram quality. Add MergePartTopN2GlobalTopNV2 which merges TopN counts using a simple hash map without consulting histograms. This is O(T*P) and leaves histograms unmodified for the subsequent histogram merge step. Gate the new path behind tidb_merge_partition_stats_concurrency=0 (new default). Setting it to 1+ falls back to existing V1 paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…stAndTopN Add tests that compare V1 and V2 TopN merge behavior and accuracy: - Comparison tests with hand-crafted histograms showing V1's histogram inflation vs V2's TopN-only counts, including full histogram merge - Accuracy test using BuildHistAndTopN (the real ANALYZE pipeline) to build partition stats from known data, then comparing estimated counts against true counts for both V1 and V2 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
V1 accumulates NotNullCount/NDV uniform estimates from every partition where a TopN value falls inside a histogram bucket. With 20 partitions this inflates val_rare from true=120 to 1920 (16x), causing V1 to pick the wrong value for the global TopN. V2 counts TopN entries only and correctly picks val_common (200, exact). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…d histogram sizes All four V1/V2 comparison tests now use consistent, realistic stats: - 2 TopN entries per partition (every partition has the same count) - 4 histogram buckets per partition - globalTopN=2 Updated tests: - TestMergeTopNV1V2ComparisonHistValueNotAtUpperBound - TestMergeTopNV1V2ComparisonHistValueAtUpperBound - TestMergeTopNV2BetterThanV1ManyPartitionsInflation (renamed) - TestMergeTopNV1BetterThanV2SpreadValue Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verify that MergePartTopN2GlobalTopNV2 never loses or creates counts: globalTopN.TotalCount() + sum(leftTopN counts) == sum(input TopN counts). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Keep V1 as the default TopN merge strategy. Users can opt in to V2 by setting tidb_merge_partition_stats_concurrency = 0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…und repeats V2 (MergePartTopN2GlobalTopNV2) merges TopN purely from TopN lists, avoiding V1's O(P^2) histogram lookups. However, V2 underestimates global TopN counts for values that appear in some partitions' TopN and at other partitions' histogram bucket upper bounds (the "spread value" scenario). Add MergePartTopNAndHistToGlobal which combines V2's TopN-only merge with extraction of significant Repeat counts from histogram bucket upper bounds during bucket collection. This catches the spread value scenario without V1's inflation problem, at zero extra passes. Key design: - Extract mergeHistBucketsToGlobalHist helper from MergePartitionHist2GlobalHist to share the ~200-line bucket merge core between the old and new paths - During histogram bucket collection, move each bucket's upper-bound Repeat into the TopN counter and zero it in the bucket - Skip recalculate-repeats in the shared helper (pass nil) since repeats are already handled during extraction - Wire into callers: mergeConcurrency==0 uses the new combined merge, otherwise falls through to the existing V1/concurrent flow Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…d hybrid merge Add BenchmarkFullPipeline with sub-benchmarks for all three merge approaches (V1, Concurrent, Hybrid), each running the complete TopN + histogram merge pipeline. This enables direct comparison since the existing TopN-only benchmarks excluded the histogram merge step. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…s load timeout The test was intermittently failing in CI because show stats_topn reads from the in-memory stats cache, which uses lazy loading with a default sync wait of only 100ms. Under CI load this timeout can expire before all column/index stats are loaded, producing incorrect results. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix typo "since since both a and be is" → "since both a and b are" - Handle codec.EncodeKey errors with require.NoError instead of ignoring - Use session scope for tidb_merge_partition_stats_concurrency in tests instead of global scope to avoid potential cross-test interference - Merge testIssues24349Part1/Part2/Part2V2 back into single testIssues24349 since V1 and hybrid produce identical results - Improve comments with accurate merge explanation and annotated output Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The V2 TopN-only merge function had no production callers — only tests and benchmarks. The production path uses MergePartTopNAndHistToGlobal (hybrid, concurrency=0) or MergePartTopN2GlobalTopN (V1, concurrency>=1). Remove the function, its dedicated tests, V2 sections from V1-vs-V2 comparison tests, and its benchmark. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…st helper Subtract extracted Repeat counts from totCount when moving histogram upper-bound Repeats into the TopN counter. Without this, totCount was over-estimated by the sum of extracted Repeats that ended up in globalTopN, causing slightly suboptimal histogram bucket splitting. Also remove unused findTopNMetaCountByEncoded test helper left behind after V2 test removal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace version-based naming (V1, V2, Hybrid) with descriptive terms: - "Separate" for the two-phase TopN-then-histogram merge flow - "Combined" for the single-pass TopN+histogram merge This avoids confusion since "V2" was already used for the concurrent merge path and for analyze version 2. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…pdate Replace the unreliable explain + tidb_stats_load_sync_wait approach with dom.StatsHandle().Update() to force a synchronous full stats cache refresh from storage after analyze. This is the standard pattern used by other stats tests and avoids CI timeouts with large datasets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Review Complete Findings: 0 issues ℹ️ Learn more details on Pantheon AI. |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
pkg/statistics/handle/globalstats/global_stats_async.go (1)
512-520:⚠️ Potential issue | 🟠 MajorUse the outer
ANALYZEsession's killer here.Line 513 reads
SQLKillerfrom the pooled stats session created byutil.CallWithSCtx(...), so canceling the user'sANALYZEwill not interruptMergePartTopNAndHistToGlobal. Capture the caller session's killer next tostmtCtx/analyzeVersionand thread that down instead.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/statistics/handle/globalstats/global_stats_async.go` around lines 512 - 520, The pooled stats session's SQLKiller (sctx.GetSessionVars().SQLKiller) is being used inside MergePartTopNAndHistToGlobal, so cancellations of the outer ANALYZE won't propagate; capture the caller session's killer (the outer session's SQLKiller) alongside stmtCtx and analyzeVersion before calling util.CallWithSCtx (or pass it into the closure), then replace the pooled sctx.GetSessionVars().SQLKiller reference by that captured killer when invoking statistics.MergePartTopNAndHistToGlobal (identify the call site using MergePartTopNAndHistToGlobal, wrapper.AllTopN, wrapper.AllHg, stmtCtx, analyzeVersion to locate the code).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/statistics/handle/globalstats/global_stats_test.go`:
- Around line 970-989: Pin the session to deterministic global-stats and pruning
behavior by adding tk.MustExec calls before creating the table and running
ANALYZE to set the session variables that force V2 global stats and dynamic
partition pruning and a fixed bucket/topn layout (for example, set
tidb_enable_global_stats=1, tidb_analyze_version=2,
tidb_partition_prune_mode='dynamic' and explicit stats bucket/topn values) so
the assertions about partition_name='global' and bucket layout in this test
(which uses tk.MustExec, the CREATE TABLE and ANALYZE statements) are stable and
do not depend on cluster-wide defaults.
---
Duplicate comments:
In `@pkg/statistics/handle/globalstats/global_stats_async.go`:
- Around line 512-520: The pooled stats session's SQLKiller
(sctx.GetSessionVars().SQLKiller) is being used inside
MergePartTopNAndHistToGlobal, so cancellations of the outer ANALYZE won't
propagate; capture the caller session's killer (the outer session's SQLKiller)
alongside stmtCtx and analyzeVersion before calling util.CallWithSCtx (or pass
it into the closure), then replace the pooled sctx.GetSessionVars().SQLKiller
reference by that captured killer when invoking
statistics.MergePartTopNAndHistToGlobal (identify the call site using
MergePartTopNAndHistToGlobal, wrapper.AllTopN, wrapper.AllHg, stmtCtx,
analyzeVersion to locate the code).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: adf504ec-8697-48fc-8704-8cfc2ca60bd4
📒 Files selected for processing (10)
pkg/executor/analyze_test.gopkg/sessionctx/vardef/tidb_vars.gopkg/sessionctx/variable/session.gopkg/sessionctx/variable/sysvar.gopkg/statistics/handle/globalstats/global_stats.gopkg/statistics/handle/globalstats/global_stats_async.gopkg/statistics/handle/globalstats/global_stats_test.gopkg/statistics/handle/globalstats/topn.gopkg/statistics/handle/util/pool.gopkg/statistics/handle/util/util.go
💤 Files with no reviewable changes (5)
- pkg/statistics/handle/util/util.go
- pkg/sessionctx/variable/session.go
- pkg/sessionctx/vardef/tidb_vars.go
- pkg/statistics/handle/globalstats/topn.go
- pkg/sessionctx/variable/sysvar.go
✅ Files skipped from review due to trivial changes (1)
- pkg/statistics/handle/util/pool.go
| tk := testkit.NewTestKit(t, store) | ||
| tk.MustExec("use test") | ||
| tk.MustExec("drop table if exists t") | ||
| tk.MustExec(` create table t ( | ||
| a int primary key auto_increment, | ||
| b int not null default 1, | ||
| c int, | ||
| d varchar(255) not null default '', | ||
| e varchar(255), | ||
| key idx_ab(a,b), | ||
| key idx_be(b,e), | ||
| unique key uidx_cd(c,d) global, | ||
| key idx_d(d), | ||
| unique key uidx_e(e) global, | ||
| key idx_ec(e,c) | ||
| ) partition by hash (a) partitions 7`) | ||
| tk.MustExec(`insert into t (a) values (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)`) | ||
| // increase by 10 ^ 5 rows | ||
| tk.MustExec(`insert into t (a) select null from t, t t2, t t3, t t4, t t5`) | ||
| tk.MustExec(`analyze table t with 1 topn, 3 buckets`) |
There was a problem hiding this comment.
Pin the session settings this regression test depends on.
These assertions only hold for V2 global stats under dynamic pruning, but the test currently relies on whichever defaults happen to be active. If either default changes, partition_name='global' can disappear or the bucket layout can shift for reasons unrelated to this PR.
Suggested fix
func TestGlobalStatsMergeCombined(t *testing.T) {
store, dom := testkit.CreateMockStoreAndDomain(t)
tk := testkit.NewTestKit(t, store)
tk.MustExec("use test")
+ tk.MustExec("set @@tidb_partition_prune_mode='dynamic'")
+ tk.MustExec("set @@tidb_analyze_version=2")
tk.MustExec("drop table if exists t")As per coding guidelines, "Keep test changes minimal and deterministic; avoid broad golden/testdata churn unless required."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/statistics/handle/globalstats/global_stats_test.go` around lines 970 -
989, Pin the session to deterministic global-stats and pruning behavior by
adding tk.MustExec calls before creating the table and running ANALYZE to set
the session variables that force V2 global stats and dynamic partition pruning
and a fixed bucket/topn layout (for example, set tidb_enable_global_stats=1,
tidb_analyze_version=2, tidb_partition_prune_mode='dynamic' and explicit stats
bucket/topn values) so the assertions about partition_name='global' and bucket
layout in this test (which uses tk.MustExec, the CREATE TABLE and ANALYZE
statements) are stable and do not depend on cluster-wide defaults.
Remove MergePartTopN2GlobalTopN, MergeGlobalStatsTopNByConcurrency, the topnStatsMergeWorker infrastructure, and MaxPartitionMergeBatchSize. These were only used by the old separate TopN-then-histogram merge flow which was replaced by MergePartTopNAndHistToGlobal in the previous commit. Their tests and benchmarks are also removed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove MergePartitionHist2GlobalHist, BinarySearchRemoveVal, and DatumMapCache which are no longer used after switching to MergePartTopNAndHistToGlobal as the sole merge strategy. Also remove the recalcRepeatHists parameter from mergeHistBucketsToGlobalHist since it is always nil now. Add partition size 100000 to BenchmarkMergePartTopNAndHistToGlobal so it covers the same range as the deleted BenchmarkMergePartitionHist2GlobalHist for comparable results. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ranch comparison Rename the combined merge benchmark so it matches the same-named benchmark on master (which runs the old two-step pipeline), allowing direct benchstat comparison across branches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
pkg/statistics/histogram.go (2)
1587-1597:⚠️ Potential issue | 🔴 CriticalSkip nil partition TopN entries before calling
TotalCount().
topNscan legitimately containnilfor histogram-only partitions (for exampleANALYZE ... WITH 0 TOPN).topN.TotalCount()will panic before the merge reaches the histogram path.Suggested fix
for _, topN := range topNs { if err := killer.HandleSignal(); err != nil { return nil, nil, err } - if topN.TotalCount() == 0 { + if topN == nil || topN.TotalCount() == 0 { continue } for _, val := range topN.TopN {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/statistics/histogram.go` around lines 1587 - 1597, The loop over topNs must skip nil entries before calling topN.TotalCount() to avoid panics for histogram-only partitions; update the loop in the Merge/aggregation routine that iterates topNs (where killer.HandleSignal() is called and topN.TotalCount() is used) to first check if topN == nil and continue, and likewise guard the inner loop that accesses topN.TopN and uses hack.String(val.Encoded) so it never dereferences a nil topN.
1600-1684:⚠️ Potential issue | 🟠 MajorKeep honoring
SQLKillerafter phase 1.After the initial TopN loop there are no more cancellation points. On large partitioned tables, a killed
ANALYZEcan stay stuck through bucket extraction, sorting, andmergeHistBucketsToGlobalHist.Suggested direction
func MergePartTopNAndHistToGlobal( ... killer *sqlkiller.SQLKiller, ... ) (*TopN, *Histogram, error) { ... for _, hist := range hists { + if err := killer.HandleSignal(); err != nil { + return nil, nil, err + } for _, b := range hist.buildBucket4Merging() { + if err := killer.HandleSignal(); err != nil { + return nil, nil, err + } ... } } ... globalHist, err := mergeHistBucketsToGlobalHist( - sc, buckets, totCount, totNull, totColSize, expBucketNumber, + sc, killer, buckets, totCount, totNull, totColSize, expBucketNumber, ... ) }func mergeHistBucketsToGlobalHist( sc *stmtctx.StatementContext, + killer *sqlkiller.SQLKiller, buckets []*bucket4Merging, ... ) (*Histogram, error) { ... for i := len(buckets) - 1; i >= 0; i-- { + if err := killer.HandleSignal(); err != nil { + return nil, err + } ... }Also applies to: 1692-1701
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/statistics/histogram.go` around lines 1600 - 1684, The loop-heavy phases after Phase 1 (especially the inner loop over hist.buildBucket4Merging(), the Post-TopN leftover handling before sortBucketsByUpperBound, and just before mergeHistBucketsToGlobalHist) do not check SQL cancellation; add periodic SQLKiller checks (using the session context's cancellation API / SQLKiller) at the end of the bucket-extraction loop (inside the for _, b := range hist.buildBucket4Merging() loop), after adding leftover TopN entries (the loop over leftTopN/topNMetaToDatum), and immediately before calling sortBucketsByUpperBound and before mergeHistBucketsToGlobalHist; if the context is cancelled return early with the canonical cancellation error to avoid long-blocking operations in functions like sortBucketsByUpperBound and mergeHistBucketsToGlobalHist.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/statistics/handle/globalstats/topn_test.go`:
- Around line 278-279: The test currently uses require.InDelta(t, inputTotal,
outputTotal, inputTotal*0.01, ...) which allows a 1% tolerance; change this to
an exact equality assertion so integer-count preservation can't be masked
(replace the InDelta call with require.Equal or require.EqualValues comparing
inputTotal and outputTotal and keep the same failure message). Update the
assertion in topn_test.go (the require.InDelta call referencing inputTotal and
outputTotal) to use an exact equality assertion so the totals must match
exactly.
---
Duplicate comments:
In `@pkg/statistics/histogram.go`:
- Around line 1587-1597: The loop over topNs must skip nil entries before
calling topN.TotalCount() to avoid panics for histogram-only partitions; update
the loop in the Merge/aggregation routine that iterates topNs (where
killer.HandleSignal() is called and topN.TotalCount() is used) to first check if
topN == nil and continue, and likewise guard the inner loop that accesses
topN.TopN and uses hack.String(val.Encoded) so it never dereferences a nil topN.
- Around line 1600-1684: The loop-heavy phases after Phase 1 (especially the
inner loop over hist.buildBucket4Merging(), the Post-TopN leftover handling
before sortBucketsByUpperBound, and just before mergeHistBucketsToGlobalHist) do
not check SQL cancellation; add periodic SQLKiller checks (using the session
context's cancellation API / SQLKiller) at the end of the bucket-extraction loop
(inside the for _, b := range hist.buildBucket4Merging() loop), after adding
leftover TopN entries (the loop over leftTopN/topNMetaToDatum), and immediately
before calling sortBucketsByUpperBound and before mergeHistBucketsToGlobalHist;
if the context is cancelled return early with the canonical cancellation error
to avoid long-blocking operations in functions like sortBucketsByUpperBound and
mergeHistBucketsToGlobalHist.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 4c22ec68-7052-4831-b711-57ad5514b767
📒 Files selected for processing (9)
pkg/statistics/cmsketch_util.gopkg/statistics/handle/globalstats/global_stats.gopkg/statistics/handle/globalstats/merge_worker.gopkg/statistics/handle/globalstats/topn.gopkg/statistics/handle/globalstats/topn_bench_test.gopkg/statistics/handle/globalstats/topn_test.gopkg/statistics/histogram.gopkg/statistics/histogram_bench_test.gopkg/statistics/histogram_test.go
💤 Files with no reviewable changes (5)
- pkg/statistics/histogram_test.go
- pkg/statistics/cmsketch_util.go
- pkg/statistics/handle/globalstats/merge_worker.go
- pkg/statistics/histogram_bench_test.go
- pkg/statistics/handle/globalstats/topn.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/statistics/handle/globalstats/topn_bench_test.go
| require.InDelta(t, inputTotal, outputTotal, inputTotal*0.01, | ||
| "total count should be preserved: input=%.0f, output=%.0f", inputTotal, outputTotal) |
There was a problem hiding this comment.
Assert count preservation exactly here.
This fixture only uses exact integer counts, so a 1% delta lets real loss/duplication bugs slip through. The merge invariant should be checked with exact equality.
Suggested fix
- require.InDelta(t, inputTotal, outputTotal, inputTotal*0.01,
- "total count should be preserved: input=%.0f, output=%.0f", inputTotal, outputTotal)
+ require.Equal(t, inputTotal, outputTotal,
+ "total count should be preserved exactly")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| require.InDelta(t, inputTotal, outputTotal, inputTotal*0.01, | |
| "total count should be preserved: input=%.0f, output=%.0f", inputTotal, outputTotal) | |
| require.Equal(t, inputTotal, outputTotal, | |
| "total count should be preserved exactly") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/statistics/handle/globalstats/topn_test.go` around lines 278 - 279, The
test currently uses require.InDelta(t, inputTotal, outputTotal, inputTotal*0.01,
...) which allows a 1% tolerance; change this to an exact equality assertion so
integer-count preservation can't be masked (replace the InDelta call with
require.Equal or require.EqualValues comparing inputTotal and outputTotal and
keep the same failure message). Update the assertion in topn_test.go (the
require.InDelta call referencing inputTotal and outputTotal) to use an exact
equality assertion so the totals must match exactly.
… sysvar Re-add the variable as a no-op that always reads as 1. Setting it to any other value issues a deprecation warning and is silently ignored. This preserves backward compatibility for existing configs/scripts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/sessionctx/variable/sysvar.go`:
- Around line 3084-3094: The TiDBMergePartitionStatsConcurrency sysvar currently
only has a Validation hook, which doesn't override persisted reads; add
GetGlobal and GetSession hooks on the vardef for
TiDBMergePartitionStatsConcurrency that unconditionally return the normalized
value "1" (and nil error) so reads via session.GetGlobalSysVar()/GetSession
always return "1" regardless of stored mysql.global_variables; keep the existing
Validation and SetSession as-is but ensure both GetGlobal and GetSession are
added to the same vardef entry to enforce the "always reads 1" contract on
upgraded clusters.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: f8a6d4c8-38f2-4fd6-9cf7-82fc3173dffa
📒 Files selected for processing (4)
pkg/sessionctx/vardef/tidb_vars.gopkg/sessionctx/variable/sysvar.gopkg/statistics/BUILD.bazelpkg/statistics/handle/globalstats/BUILD.bazel
🚧 Files skipped from review as they are similar to previous changes (3)
- pkg/statistics/BUILD.bazel
- pkg/sessionctx/vardef/tidb_vars.go
- pkg/statistics/handle/globalstats/BUILD.bazel
| Scope: vardef.ScopeGlobal | vardef.ScopeSession, Name: vardef.TiDBMergePartitionStatsConcurrency, Value: "1", Type: vardef.TypeInt, MinValue: 1, MaxValue: vardef.MaxConfigurableConcurrency, | ||
| SetSession: func(_ *SessionVars, _ string) error { | ||
| // Deprecated: do nothing. | ||
| return nil | ||
| }, | ||
| Validation: func(vars *SessionVars, normalizedValue string, _ string, _ vardef.ScopeFlag) (string, error) { | ||
| if normalizedValue != "1" { | ||
| vars.StmtCtx.AppendWarning(ErrWarnDeprecatedSyntaxSimpleMsg.FastGen("tidb_merge_partition_stats_concurrency is deprecated and will be removed in a future release. It no longer affects behavior.")) | ||
| } | ||
| return "1", nil | ||
| }, |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Sysvar definition ==="
rg -n -C2 'TiDBMergePartitionStatsConcurrency|tidb_merge_partition_stats_concurrency' pkg/sessionctx/variable pkg/sessionctx/vardef
echo
echo "=== Validation / session hook flow ==="
rg -n -C3 'func \(sv \*SysVar\) Validate|func \(sv \*SysVar\) SetSessionFromHook|SetSystemVar\(' pkg/sessionctx/variable
echo
echo "=== Global-var load / session-init paths ==="
rg -n -C3 'builtinGlobalVariable|GetGlobalSysVar|SetSessionFromHook|load.*global.*var|mysql\.global_variables' .
echo
echo "=== Any bootstrap / upgrade rewrite for this sysvar ==="
rg -n -C2 'TiDBMergePartitionStatsConcurrency|tidb_merge_partition_stats_concurrency|upgradeToVer|bootstrapVersion' .Repository: pingcap/tidb
Length of output: 50371
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Check if there's an upgrade function for tidb_merge_partition_stats_concurrency ==="
rg -n 'tidb_merge_partition_stats_concurrency' pkg/session
echo
echo "=== How does GetGlobalSysVar handle validation and normalization? ==="
rg -n -A10 'func.*GetGlobalSysVar' pkg/session/session.go | head -60
echo
echo "=== Check ValidateWithRelaxedValidation and its usage in read paths ==="
rg -n -B2 -A5 'ValidateWithRelaxedValidation' pkg/sessionctx/variable
echo
echo "=== Does any code apply Validation during GetGlobalSysVar? ==="
rg -n 'GetGlobalSysVar.*Validate|Validate.*GetGlobalSysVar' pkg/session pkg/sessionctxRepository: pingcap/tidb
Length of output: 143
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== GetGlobalSysVar implementation in session.go ==="
rg -n -A20 'func.*\(s \*session\) GetGlobalSysVar' pkg/session/session.go
echo
echo "=== ValidateWithRelaxedValidation usage ==="
rg -n -B3 -A10 'ValidateWithRelaxedValidation' pkg/sessionctx/variable/variable.go
echo
echo "=== Does SysVar have GetGlobal hook? Show the struct definition ==="
rg -n -A30 'type SysVar struct' pkg/sessionctx/variable/variable.go | head -60Repository: pingcap/tidb
Length of output: 4587
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Complete SysVar struct definition ==="
rg -n -A70 'type SysVar struct' pkg/sessionctx/variable/variable.go | head -100
echo
echo "=== GetGlobalFromHook implementation ==="
rg -n -B3 -A15 'func.*GetGlobalFromHook' pkg/sessionctx/variable/variable.go
echo
echo "=== Verify the validation path: does GetGlobalFromHook always apply ValidateWithRelaxedValidation? ==="
rg -n -B5 -A20 'func.*GetGlobalFromHook' pkg/sessionctx/variable/variable.goRepository: pingcap/tidb
Length of output: 6315
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Where is GlobalVarsAccessor defined and what does GetGlobalSysVar do? ==="
rg -n -A15 'func.*GlobalVarsAccessor.*GetGlobalSysVar|interface.*GlobalVarsAccessor' pkg/sessionctx/variable
echo
echo "=== Look at the session implementation of GetGlobalSysVar to see if it applies validation ==="
rg -n -A30 'func.*session.*GetGlobalSysVar.*{' pkg/session/session.go | head -80Repository: pingcap/tidb
Length of output: 133
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Find GlobalVarsAccessor interface ==="
rg -n 'type GlobalVarsAccessor' pkg/sessionctx
echo
echo "=== Look at how GetGlobalSysVar loads from mysql.global_variables and whether it validates ==="
rg -n -A40 'func.*session.*GetGlobalSysVar' pkg/session/session.go | head -100
echo
echo "=== Check the domain.GetGlobalVar method ==="
rg -n -A20 'func.*GetGlobalVar.*{' pkg/domain | head -60Repository: pingcap/tidb
Length of output: 97
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Search for GlobalVarsAccessor ==="
fd GlobalVarsAccessor pkg/
echo
echo "=== Look at the entire GetGlobalSysVar from session.go ==="
sed -n '1404,1450p' pkg/session/session.go
echo
echo "=== Check if there's validation in the domain.GetGlobalVar path ==="
find pkg/domain -name '*.go' -exec grep -l 'GetGlobalVar\|global.*variable' {} \;
echo
echo "=== Look for ValidateWithRelaxedValidation calls to understand the flow ==="
rg -n 'ValidateWithRelaxedValidation' pkg/session pkg/sessionctxRepository: pingcap/tidb
Length of output: 3778
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== What does ValidateFromType do vs Validation callback? ==="
rg -n -A30 'func \(sv \*SysVar\) ValidateFromType' pkg/sessionctx/variable/variable.go
echo
echo "=== Confirm: does ValidateFromType call Validation callback? ==="
sed -n '238,285p' pkg/sessionctx/variable/variable.goRepository: pingcap/tidb
Length of output: 3180
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Check if GetGlobalFromHook is used for reading global sysvars ==="
rg -n 'GetGlobalFromHook' pkg/sessionctx pkg/session
echo
echo "=== Understand the call chain: when you do SELECT @@global.var, does it use GetGlobalFromHook or GetGlobalSysVar? ==="
rg -n -B5 -A10 'SELECT.*@@global|SHOW.*GLOBAL.*VARIABLES' pkg/executor | head -80Repository: pingcap/tidb
Length of output: 7994
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Verify checkInt64SystemVar behavior for out-of-range values ==="
rg -n -A30 'func \(sv \*SysVar\) checkInt64SystemVar' pkg/sessionctx/variable/variable.go
echo
echo "=== Understand: when ValidateFromType checks an int value like '5', does it reject it or accept it? ==="
sed -n '171,210p' pkg/sessionctx/variable/variable.goRepository: pingcap/tidb
Length of output: 2714
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Check the GetSessionOrGlobalSystemVar flow - is it GetGlobalFromHook or GetGlobalSysVar? ==="
sed -n '2940,2990p' pkg/sessionctx/variable/session.go
echo
echo "=== Verify which path is taken: GetGlobalFromHook or GetGlobalSysVar? ==="
rg -n -B10 'GetGlobalFromHook|GetGlobalSysVar' pkg/sessionctx/variable/session.go | head -100Repository: pingcap/tidb
Length of output: 3025
🏁 Script executed:
#!/bin/bash
set -euo ipefail
echo "=== Confirm: does Validate() call both ValidateFromType AND Validation callback? ==="
sed -n '219,235p' pkg/sessionctx/variable/variable.go
echo
echo "=== And confirm ValidateFromType does NOT call Validation callback? ==="
sed -n '238,263p' pkg/sessionctx/variable/variable.go
echo
echo "=== Does session.GetGlobalSysVar (line 1441) apply ValidateWithRelaxedValidation or just ValidateFromType? ==="
sed -n '1438,1448p' pkg/session/session.goRepository: pingcap/tidb
Length of output: 108
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Confirm: does Validate() call both ValidateFromType AND Validation callback? ==="
sed -n '219,235p' pkg/sessionctx/variable/variable.go
echo
echo "=== And confirm ValidateFromType does NOT call Validation callback? ==="
sed -n '238,263p' pkg/sessionctx/variable/variable.go
echo
echo "=== Does session.GetGlobalSysVar (line 1441) apply ValidateWithRelaxedValidation or just ValidateFromType? ==="
sed -n '1438,1448p' pkg/session/session.goRepository: pingcap/tidb
Length of output: 2366
Add GetGlobal and GetSession hooks to enforce the "always reads 1" contract on upgraded clusters.
The Validation callback only normalizes future SET statements. When reading persisted global values via session.GetGlobalSysVar() (the fallback path when no GetGlobal hook is defined), only ValidateFromType() is applied, which skips the custom Validation() callback entirely. A cluster upgraded from an older version that had a non-1 value persisted in mysql.global_variables will return that old value instead of 1, breaking the new invariant that this deprecated variable always reads as 1.
Add GetGlobal and GetSession hooks that both unconditionally return "1" to ensure the sysvar always reads as 1 on all clusters, regardless of persisted state.
Suggested fix
{
Scope: vardef.ScopeGlobal | vardef.ScopeSession, Name: vardef.TiDBMergePartitionStatsConcurrency, Value: "1", Type: vardef.TypeInt, MinValue: 1, MaxValue: vardef.MaxConfigurableConcurrency,
SetSession: func(_ *SessionVars, _ string) error {
// Deprecated: do nothing.
return nil
},
+ GetSession: func(*SessionVars) (string, error) {
+ return "1", nil
+ },
+ GetGlobal: func(context.Context, *SessionVars) (string, error) {
+ return "1", nil
+ },
Validation: func(vars *SessionVars, normalizedValue string, _ string, _ vardef.ScopeFlag) (string, error) {
if normalizedValue != "1" {
vars.StmtCtx.AppendWarning(ErrWarnDeprecatedSyntaxSimpleMsg.FastGen("tidb_merge_partition_stats_concurrency is deprecated and will be removed in a future release. It no longer affects behavior."))
}
return "1", nil
},
},📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Scope: vardef.ScopeGlobal | vardef.ScopeSession, Name: vardef.TiDBMergePartitionStatsConcurrency, Value: "1", Type: vardef.TypeInt, MinValue: 1, MaxValue: vardef.MaxConfigurableConcurrency, | |
| SetSession: func(_ *SessionVars, _ string) error { | |
| // Deprecated: do nothing. | |
| return nil | |
| }, | |
| Validation: func(vars *SessionVars, normalizedValue string, _ string, _ vardef.ScopeFlag) (string, error) { | |
| if normalizedValue != "1" { | |
| vars.StmtCtx.AppendWarning(ErrWarnDeprecatedSyntaxSimpleMsg.FastGen("tidb_merge_partition_stats_concurrency is deprecated and will be removed in a future release. It no longer affects behavior.")) | |
| } | |
| return "1", nil | |
| }, | |
| Scope: vardef.ScopeGlobal | vardef.ScopeSession, Name: vardef.TiDBMergePartitionStatsConcurrency, Value: "1", Type: vardef.TypeInt, MinValue: 1, MaxValue: vardef.MaxConfigurableConcurrency, | |
| SetSession: func(_ *SessionVars, _ string) error { | |
| // Deprecated: do nothing. | |
| return nil | |
| }, | |
| GetSession: func(*SessionVars) (string, error) { | |
| return "1", nil | |
| }, | |
| GetGlobal: func(context.Context, *SessionVars) (string, error) { | |
| return "1", nil | |
| }, | |
| Validation: func(vars *SessionVars, normalizedValue string, _ string, _ vardef.ScopeFlag) (string, error) { | |
| if normalizedValue != "1" { | |
| vars.StmtCtx.AppendWarning(ErrWarnDeprecatedSyntaxSimpleMsg.FastGen("tidb_merge_partition_stats_concurrency is deprecated and will be removed in a future release. It no longer affects behavior.")) | |
| } | |
| return "1", nil | |
| }, |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/sessionctx/variable/sysvar.go` around lines 3084 - 3094, The
TiDBMergePartitionStatsConcurrency sysvar currently only has a Validation hook,
which doesn't override persisted reads; add GetGlobal and GetSession hooks on
the vardef for TiDBMergePartitionStatsConcurrency that unconditionally return
the normalized value "1" (and nil error) so reads via
session.GetGlobalSysVar()/GetSession always return "1" regardless of stored
mysql.global_variables; keep the existing Validation and SetSession as-is but
ensure both GetGlobal and GetSession are added to the same vardef entry to
enforce the "always reads 1" contract on upgraded clusters.
…tern Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…cation warning Verify that setting to 1 is silent, and setting to any other value issues a deprecation warning while the value stays at 1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Drop empty buckets immediately after extracting their Repeat count in Phase 2 instead of carrying them through Phases 3-4 and filtering in a separate compaction pass. This reduces transient memory and eliminates an O(n) sweep. Also update test expectations for Repeat values in global stats buckets and clean up minor test issues. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…loop Combine the stats-counting loop and the bucket-building loop into one pass. Inline the per-bucket logic from buildBucket4Merging so we only allocate a bucket4Merging when the bucket will be kept (Count > 0 after Repeat extraction), avoiding temporary slice allocations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…mbined Show all partition-level histogram buckets alongside the global ones to make the merge behavior visible and easier to reason about. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace mergeHistBucketsToGlobalHist (which handled overlapping wide buckets poorly, producing badly skewed global histograms) with mergeByUpperBound, which sorts all bucket4Merging entries by upper bound and accumulates counts in a single pass, cutting at equi-depth thresholds. This fixes two issues: - Bucket skew: the old merge produced 7/33346/66656 for 3 buckets over 100K rows; the new merge produces 33347/33341/33321. - Repeat=0: the new merge naturally preserves Repeat from leftover TopN point buckets that land at global bucket upper bounds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Set each non-first global bucket's lower bound to the previous bucket's upper bound instead of the minimum lower of contributing bucket4Merging entries. This prevents overlapping ranges that would break calcFraction interpolation in LessRowCount estimation. Also release bucket4Merging entries eagerly during the merge loop instead of accumulating them for a batch release at the end. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ckets When the cumulative count crosses the equi-depth threshold, compare the previous and current group's distance to the threshold and pick whichever is closer. This produces better-balanced buckets, especially for small datasets where the count jumps at each upper bound are large relative to the bucket size. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Nil out allTopN[i], allHg[i], and allCms[i] after merging each column's global stats so the GC can reclaim partition-level data before processing the next column. Extends the existing pattern used for FMSketch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The tz parameter was only used by the old separate merge path (mergeGlobalStatsTopN). Remove it from dealHistogramAndTopN, cpuWorker, and the MergePartitionStats2GlobalStats call site. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TestGlobalStatsMergePathConsistency verifies that: 1. Async and blocking merge paths produce identical global stats (TopN and histogram buckets) for a 53-partition hash table with 5 columns of diverse distributions (uniform, skewed, sparse, bimodal, string). 2. After ALTER TABLE REMOVE PARTITIONING, non-partitioned stats have matching TopN values with counts within 50% tolerance and matching total row counts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of setting each non-first global bucket's lower bound to the previous bucket's upper (which duplicates the boundary value), track the minimum lower bound of contributing bucket4Merging entries and clamp it to >= the previous cut's upper bound. This produces proper gaps between consecutive buckets (e.g. [2,33348] [33349,66689]) rather than shared boundaries ([2,33348] [33348,66689]). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous group should only be considered as a closer cut point if its cumulative count is below the current threshold. Otherwise it was already used as a prior bucket boundary, and picking it again would create duplicate or degenerate buckets. Fixes TestMergePartTopNAndHistToGlobalCountPreservation where 200 rows were lost because the second group was never emitted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@mjonss: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What problem does this PR solve?
Issue Number: ref #66220
Problem Summary:
The default global stats merge uses
MergePartTopN2GlobalTopNwhichperforms O(P²) cross-partition histogram lookups per TopN value via
EqualRowCount+BinarySearchRemoveVal. This is slow for tableswith many partitions and inflates TopN counts for values that fall
inside histogram buckets but are not at upper bounds, producing less
accurate global statistics.
What changed and how does it work?
MergePartTopNAndHistToGlobal, a new combined TopN+histogrammerge algorithm that sums TopN counts across partitions and extracts
exact Repeat counts from histogram upper bounds in a single linear
pass, avoiding the O(P²) cross-partition lookups
global stats paths
tidb_merge_partition_stats_concurrency: the variable iskept for backward compatibility but always reads as 1; setting it to
any other value issues a deprecation warning and is silently ignored
MergePartTopN2GlobalTopN,MergeGlobalStatsTopNByConcurrency,MergePartitionHist2GlobalHist,BinarySearchRemoveVal,DatumMapCache, the topN merge workerinfrastructure, and
MaxPartitionMergeBatchSizemergeHistBucketsToGlobalHistby removing the unusedrecalcRepeatHistsparameterBenchmarkGlobalStatsMergefor cross-branch performancecomparison
Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
Summary by CodeRabbit
Deprecation
Refactor