pd: report hot read cpu in heartbeat by lhy1024 · Pull Request #10178 · tikv/pd

lhy1024 · 2026-01-21T09:33:24Z

What problem does this PR solve?

Issue Number: Close #5718

What is changed and how does it work?

Simple description

This pr introduces cpu as a new dimension for hot scheduler, it only serve hot read scheduler

From store heartbeat cpu_usages, we can get unfied read pool cpu for schdule. Read priorities become cpu→byte when supported, otherwise fall back to query→byte (or byte→key if query isn’t supported).

Check List

Tests

Unit test
Integration test

Release note

None.

Summary by CodeRabbit

New Features
- CPU-based metrics for hot-region scheduling; API now exposes per-store CPU read rates (cpu-read-rate)
- Scheduler config adds CPU thresholds and tuning (min-hot-cpu-rate, cpu-rate-rank-step-ratio)
Enhancements
- Hot-region history and statistics now include CPU flow, per-store and total CPU rates
- Grafana dashboard updated with a "Store read cpu" panel
Chores
- Version gating for CPU support (cluster version >= 8.5.7)

Signed-off-by: lhy1024 <admin@liudos.us>

ti-chi-bot · 2026-01-21T09:33:27Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2026-01-21T09:33:35Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a CPU/read-CPU dimension throughout hot-region telemetry and scheduling: new CPU stats propagation, metrics, thresholds, scheduler config and logic changes, version gating for CPU support, storage/API payload updates, and corresponding tests and dashboard/CLI adjustments.

Changes

Cohort / File(s)	Summary
Core `pkg/core/factory.go`, `pkg/core/region.go`	Add `CPUStatsFactory` and `RegionInfo.cpuStats`; populate from heartbeats; clone CPUStats; extend `GetLoads`/`GetWriteLoads` to include CPU elements.
CPU metric helpers & constants `pkg/statistics/cpu.go`, `pkg/statistics/cpu_test.go`, `pkg/statistics/utils/constant.go`, `pkg/statistics/utils/kind.go`, `pkg/statistics/utils/kind_test.go`	Introduce Store/Region CPU usage helpers, add CPU dims/priorities (`CPUDim`, `CPUPriority`, `RegionReadCPU`, `RegionWriteCPU`, `StoreReadCPU`), and add min CPU thresholds.
Hot-peer stats & cache `pkg/statistics/hot_peer.go`, `pkg/statistics/hot_peer_cache.go`, `pkg/statistics/hot_peer_cache_test.go`, `pkg/statistics/hot_peer_test.go`, `pkg/statistics/hot_cache_test.go`	Make HotPeerStat safe for out-of-range/nil rolling loads, expand rollingLoads length to `DimLen`, add CPU threshold metrics, and tests for CPU dimension behavior.
Store stats & collection `pkg/statistics/store.go`, `pkg/statistics/store_collection.go`, `pkg/statistics/store_collection_test.go`	Add CPU moving-average windows, observe/set/get store read CPU, expose CPU gauges and include them in resets and tests.
Store hot-peers & load prediction `pkg/statistics/store_load.go`, `pkg/statistics/store_hot_peers_infos.go`, `pkg/statistics/hot_regions_stat.go`	Propagate CPU into load predictions and HotPeersStat/HotPeerStatShow (`StoreCPURate`, `TotalCPURate`, `CPURate`).
Scheduler config and solver `pkg/schedule/schedulers/hot_region_config.go`, `pkg/schedule/schedulers/hot_region_config_test.go`, `pkg/schedule/schedulers/hot_region_solver.go`, `pkg/schedule/schedulers/hot_region_solver_test.go`, `pkg/schedule/schedulers/hot_region_test.go`	Add `MinHotCPURate`, `CPURateRankStepRatio`, CPU-aware priority handling, CPU support tracking (`lastCPUSupported`), validation rules, fallback logic, and tests for CPU fallbacks and behavior.
Scheduler runtime & metrics `pkg/schedule/schedulers/hot_region.go`, `pkg/schedule/schedulers/metrics.go`, `pkg/schedule/coordinator.go`, `pkg/schedule/handler/handler.go`	Load CPU config in scheduler, add CPU rank-step/step mapping, skip counter for CPU uniform-store, expose per-store CPU stats in HotStoreStats and labeled total CPU metrics.
Heartbeat handling (server & mcs) `server/cluster/cluster.go`, `pkg/mcs/scheduling/server/cluster.go`	Extract per-region read-CPU from PeerStat and include `RegionReadCPU` (scaled by interval) in the loads vector passed to hot-peer checks.
Statistics collector `pkg/statistics/collector.go`	Populate CPU dimension in `tikvCollector.getLoads` for read path from store read-CPU.
Storage, API, CLI, dashboard, tests `pkg/storage/hot_region_storage.go`, `server/handler.go`, `client/http/types.go`, `tools/pd-ctl/...`, `metrics/grafana/pd.json`, various tests under `tests/` and `server/`	Add `FlowCPU` to persisted/returned hot-region records, update API/CLI/dashboard defaults and expectations, and update tests to include CPU dimension and new scheduler config fields.
Version gating `pkg/versioninfo/versioninfo.go`, `pkg/versioninfo/versioninfo_test.go`	Add `IsHotScheduleWithCPUSupported` and min-version constants for CPU scheduling support (8.5.7, 9.0.0-beta.1+) with tests.

Sequence Diagram

sequenceDiagram
    participant Store as Store (TiKV)
    participant HB as Heartbeat Handler
    participant Cache as HotPeerCache
    participant Stats as Statistics Collector
    participant Scheduler as Hot Region Scheduler
    participant Version as VersionInfo

    Store->>HB: send PeerStat (includes CpuStats)
    HB->>Stats: extract Store/Region CPU usages
    Stats->>Cache: build loads[] (includes RegionReadCPU)
    Cache->>Cache: update HotPeerStat rolling loads (include CPU)
    Scheduler->>Version: IsHotScheduleWithCPUSupported(clusterVersion)
    Version-->>Scheduler: cpuSupport = true/false
    Scheduler->>Cache: request hot peers (with priorities considering cpuSupport)
    Cache-->>Scheduler: return ranked hot peers (with CPURate)
    Scheduler->>Store: emit balancing decisions

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

core: region heartbeat with bucket meta #10231: Modifies pkg/core/region.go to extend RegionInfo with additional per-region telemetry; conceptually related to adding cpuStats/CPUStatsFactory.

Suggested reviewers

okJiang
bufferflies
rleungx

Poem

🐰 I hopped through heartbeats, stats, and queues,
I carried CPU counts in tiny news,
From heartbeats to scheduler, I leapt with glee,
Now hot regions dance in CPU harmony,
Hooray for balanced loads — carrot cake for me! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 31.91% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'pd: report hot read cpu in heartbeat' directly and concisely summarizes the main change—adding CPU reporting to heartbeats for the hot scheduler.
Description check	✅ Passed	The PR description includes a linked issue (Close `#5718`), explains the problem being solved, describes how it works, and specifies that unit/integration tests are included.
Linked Issues check	✅ Passed	The PR fully implements the objectives from `#5718`: introduces CPU as a new hot scheduler dimension, reports read CPU in heartbeats, incorporates it into store load and hot-peer statistics, uses rolling-window aggregation, and updates prioritization to prefer CPU when supported.
Out of Scope Changes check	✅ Passed	All changes are directly related to the stated objective of adding CPU as a new dimension for the hot scheduler; no unrelated modifications are present.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: lhy1024 <admin@liudos.us>

codecov · 2026-02-10T12:21:18Z

Codecov Report

❌ Patch coverage is 85.99034% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.90%. Comparing base (d7b6380) to head (0289764).
⚠️ Report is 11 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10178      +/-   ##
==========================================
- Coverage   78.98%   78.90%   -0.08%     
==========================================
  Files         530      532       +2     
  Lines       71521    71802     +281     
==========================================
+ Hits        56488    56658     +170     
- Misses      11024    11109      +85     
- Partials     4009     4035      +26

Flag	Coverage Δ
unittests	`78.90% <85.99%> (-0.08%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: lhy1024 <admin@liudos.us>

okJiang · 2026-02-12T08:06:03Z

please link an issue and add some descriptions

rleungx · 2026-02-14T07:04:46Z

pkg/mcs/scheduling/server/cluster.go

+	storeReadQuery := core.GetReadQueryNum(stats.QueryStats)
+	storeWriteQuery := core.GetWriteQueryNum(stats.QueryStats)
+	storeTotalQuery := storeReadQuery + storeWriteQuery
+	storeGRPCCPU := statistics.StoreGRPCCPUUsage(stats.GetCpuUsages())


Here we intentionally use gRPC CPU only. Unified-read CPU is already in peerStat.CpuStats.UnifiedRead, so using store read CPU here would double count.

pkg/statistics/cpu.go

rleungx · 2026-02-14T07:12:00Z

pkg/statistics/cpu.go

+		return unifiedReadCPU
+	}
+	grpcCPU := float64(StoreGRPCCPUUsage(cpuUsages))
+	return unifiedReadCPU + grpcCPU*float64(readQuery)/float64(totalQuery)


Is it accurate?

This is an approximation: unified-read CPU is read-only, while grpc-server CPU is shared by read/write requests, so we apportion gRPC CPU by readQuery/totalQuery.

Please add some comments.

Do we have any research data to support this conclusion? If so, could you please include it in the PR description or issue?

Yes, that is the key assumption here.

grpcCPU * readQuery / totalQuery is a first-order approximation, not an exact CPU attribution.

We use this because current heartbeat metrics provide unified-read CPU and shared grpc CPU, but not per-request CPU split inside grpc threads.

I’ll test it in a follow-up to validate the approximation quality under mixed workloads.

rleungx · 2026-02-14T07:13:00Z

pkg/statistics/hot_peer_cache.go

 	rollingWindowsSize = 5
+	// It is used to moving average CPU usage,
+	// and the window size is larger than other dimensions to make the CPU usage more stable.
+	cpuRollingWindowsSize = 9


A larger window will be more stable for cpu

Then, why not 10 or bigger? I think we need to add some comments about why we made this decision.

ok, I will add a test for cpuRollingWindowsSize = 11

In other words, why does the CPU require a more stable window size, while other dimensions do not?

It is an optimization in local test, I'll remove it temporarily, and if more testing confirms that it always works, I'll add it back in another PR.

rleungx · 2026-02-14T07:13:44Z

pkg/versioninfo/versioninfo.go

+)
+
+// IsHotScheduleWithCPUSupported returns whether TiKV reports CPU info for hot scheduling.
+func IsHotScheduleWithCPUSupported(clusterVersion *semver.Version) bool {


What if we wanna cp to release 8.5?

8.5.6 or 8.5.7?

Do you think we should pick it?

Yes, I think 8.5.x may need it.

Signed-off-by: lhy1024 <admin@liudos.us>

rleungx · 2026-03-27T02:22:29Z

/retest

lhy1024 · 2026-03-27T03:47:00Z

/retest

lhy1024 · 2026-03-27T04:23:15Z

/retest

Signed-off-by: lhy1024 <admin@liudos.us>

lhy1024 · 2026-03-27T05:02:03Z

/retest

lhy1024 · 2026-03-27T05:32:22Z

/retest

lhy1024 · 2026-03-27T05:44:06Z

/retest

lhy1024 · 2026-03-27T06:10:27Z

/retest

Signed-off-by: lhy1024 <admin@liudos.us>

lhy1024 · 2026-03-28T05:08:10Z

/retest

lhy1024 · 2026-03-28T05:14:58Z

/retest

(cherry picked from commit b0e9c280efffb592a0fb5b4919eb6856c1c076dc) (cherry picked from commit f1753c859d9be369c0eae94b9e375951f27c44be) Signed-off-by: lhy1024 <admin@liudos.us>

Signed-off-by: lhy1024 <admin@liudos.us>

ti-chi-bot · 2026-04-01T07:26:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bufferflies, niubell, okJiang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [bufferflies,okJiang]
~~pkg/schedule/schedulers/OWNERS~~ [niubell]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: lhy1024 <admin@liudos.us>

lhy1024 · 2026-04-01T07:52:41Z

/retest

lhy1024 · 2026-04-01T08:30:56Z

/retest

lhy1024 · 2026-04-01T08:55:53Z

/retest

Signed-off-by: lhy1024 <admin@liudos.us>

lhy1024 · 2026-04-01T10:09:27Z

/retest

lhy1024 · 2026-04-01T10:19:04Z

/retest

lhy1024 · 2026-04-01T10:24:04Z

/retest

lhy1024 · 2026-04-01T10:29:05Z

/retest

use hot read cpu

77f90dd

Signed-off-by: lhy1024 <admin@liudos.us>

ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 21, 2026

lhy1024 added 3 commits January 21, 2026 22:10

fix version

5790ab6

Signed-off-by: lhy1024 <admin@liudos.us>

adjust sample windows

531ed27

Signed-off-by: lhy1024 <admin@liudos.us>

fix statistics

5b42858

Signed-off-by: lhy1024 <admin@liudos.us>

lhy1024 force-pushed the hot-read-cpu branch from b199e08 to 5b42858 Compare February 10, 2026 09:44

lhy1024 added 3 commits February 10, 2026 18:16

add comments and tests

a61cfc8

Signed-off-by: lhy1024 <admin@liudos.us>

Merge branch 'master' of github.com:tikv/pd into hot-read-cpu

7e04125

Signed-off-by: lhy1024 <admin@liudos.us>

fix lint

35223a2

Signed-off-by: lhy1024 <admin@liudos.us>

lhy1024 marked this pull request as ready for review February 10, 2026 10:36

ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 10, 2026

fix tests

d9310e1

Signed-off-by: lhy1024 <admin@liudos.us>

update kvproto

550ffd8

Signed-off-by: lhy1024 <admin@liudos.us>

lhy1024 force-pushed the hot-read-cpu branch from d0d3233 to 550ffd8 Compare February 11, 2026 12:41

ti-chi-bot bot removed the do-not-merge/needs-linked-issue label Feb 14, 2026

lhy1024 requested review from okJiang and rleungx February 14, 2026 02:24

rleungx reviewed Feb 14, 2026

View reviewed changes

pkg/statistics/cpu.go Outdated Show resolved Hide resolved

rleungx reviewed Feb 14, 2026

View reviewed changes

lhy1024 force-pushed the hot-read-cpu branch from c460ee4 to 7376b5e Compare March 25, 2026 16:01

*: bump kvproto to 678ff92b1edd

c843e19

Signed-off-by: lhy1024 <admin@liudos.us>

statistics: align read cpu with query hot signals

b477bcf

Signed-off-by: lhy1024 <admin@liudos.us>

lhy1024 force-pushed the hot-read-cpu branch from 6994541 to b477bcf Compare March 27, 2026 04:54

tests: align hot scheduler cpu rate expectations

27f2cf7

Signed-off-by: lhy1024 <admin@liudos.us>

lhy1024 mentioned this pull request Mar 30, 2026

pd: report hot read cpu in heartbeat #10509

Closed

lhy1024 added 2 commits April 1, 2026 14:32

hot-read-cpu: backport compatibility and review fixes

71e1e30

(cherry picked from commit b0e9c280efffb592a0fb5b4919eb6856c1c076dc) (cherry picked from commit f1753c859d9be369c0eae94b9e375951f27c44be) Signed-off-by: lhy1024 <admin@liudos.us>

add pending weight config

76d0eaa

Signed-off-by: lhy1024 <admin@liudos.us>

niubell approved these changes Apr 1, 2026

View reviewed changes

ti-chi-bot bot added the approved label Apr 1, 2026

fix lint

bd2eed8

Signed-off-by: lhy1024 <admin@liudos.us>

tests: pin split bucket priorities and cover cpu fallback

0289764

Signed-off-by: lhy1024 <admin@liudos.us>

ti-chi-bot bot merged commit 3aee2ef into tikv:master Apr 1, 2026
32 checks passed

Conversation

lhy1024 commented Jan 21, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how does it work?

Simple description

Check List

Release note

Summary by CodeRabbit

Uh oh!

ti-chi-bot bot commented Jan 21, 2026

Uh oh!

coderabbitai bot commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

codecov bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

okJiang commented Feb 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rleungx commented Mar 27, 2026

Uh oh!

lhy1024 commented Mar 27, 2026

Uh oh!

lhy1024 commented Mar 27, 2026

Uh oh!

lhy1024 commented Mar 27, 2026

Uh oh!

lhy1024 commented Mar 27, 2026

Uh oh!

lhy1024 commented Mar 27, 2026

lhy1024 commented Jan 21, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 21, 2026 •

edited

Loading

codecov bot commented Feb 10, 2026 •

edited

Loading