Skip to content

pd: report hot read cpu in heartbeat#10178

Merged
ti-chi-bot[bot] merged 32 commits intotikv:masterfrom
lhy1024:hot-read-cpu
Apr 1, 2026
Merged

pd: report hot read cpu in heartbeat#10178
ti-chi-bot[bot] merged 32 commits intotikv:masterfrom
lhy1024:hot-read-cpu

Conversation

@lhy1024
Copy link
Copy Markdown
Contributor

@lhy1024 lhy1024 commented Jan 21, 2026

What problem does this PR solve?

Issue Number: Close #5718

What is changed and how does it work?

image

Simple description

This pr introduces cpu as a new dimension for hot scheduler, it only serve hot read scheduler

From store heartbeat cpu_usages, we can get unfied read pool cpu for schdule. Read priorities become cpu→byte when supported, otherwise fall back to query→byte (or byte→key if query isn’t supported).

Check List

Tests

  • Unit test
  • Integration test

Release note

None.

Summary by CodeRabbit

  • New Features

    • CPU-based metrics for hot-region scheduling; API now exposes per-store CPU read rates (cpu-read-rate)
    • Scheduler config adds CPU thresholds and tuning (min-hot-cpu-rate, cpu-rate-rank-step-ratio)
  • Enhancements

    • Hot-region history and statistics now include CPU flow, per-store and total CPU rates
    • Grafana dashboard updated with a "Store read cpu" panel
  • Chores

    • Version gating for CPU support (cluster version >= 8.5.7)

Signed-off-by: lhy1024 <admin@liudos.us>
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Jan 21, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has signed the dco. labels Jan 21, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 21, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a CPU/read-CPU dimension throughout hot-region telemetry and scheduling: new CPU stats propagation, metrics, thresholds, scheduler config and logic changes, version gating for CPU support, storage/API payload updates, and corresponding tests and dashboard/CLI adjustments.

Changes

Cohort / File(s) Summary
Core
pkg/core/factory.go, pkg/core/region.go
Add CPUStatsFactory and RegionInfo.cpuStats; populate from heartbeats; clone CPUStats; extend GetLoads/GetWriteLoads to include CPU elements.
CPU metric helpers & constants
pkg/statistics/cpu.go, pkg/statistics/cpu_test.go, pkg/statistics/utils/constant.go, pkg/statistics/utils/kind.go, pkg/statistics/utils/kind_test.go
Introduce Store/Region CPU usage helpers, add CPU dims/priorities (CPUDim, CPUPriority, RegionReadCPU, RegionWriteCPU, StoreReadCPU), and add min CPU thresholds.
Hot-peer stats & cache
pkg/statistics/hot_peer.go, pkg/statistics/hot_peer_cache.go, pkg/statistics/hot_peer_cache_test.go, pkg/statistics/hot_peer_test.go, pkg/statistics/hot_cache_test.go
Make HotPeerStat safe for out-of-range/nil rolling loads, expand rollingLoads length to DimLen, add CPU threshold metrics, and tests for CPU dimension behavior.
Store stats & collection
pkg/statistics/store.go, pkg/statistics/store_collection.go, pkg/statistics/store_collection_test.go
Add CPU moving-average windows, observe/set/get store read CPU, expose CPU gauges and include them in resets and tests.
Store hot-peers & load prediction
pkg/statistics/store_load.go, pkg/statistics/store_hot_peers_infos.go, pkg/statistics/hot_regions_stat.go
Propagate CPU into load predictions and HotPeersStat/HotPeerStatShow (StoreCPURate, TotalCPURate, CPURate).
Scheduler config and solver
pkg/schedule/schedulers/hot_region_config.go, pkg/schedule/schedulers/hot_region_config_test.go, pkg/schedule/schedulers/hot_region_solver.go, pkg/schedule/schedulers/hot_region_solver_test.go, pkg/schedule/schedulers/hot_region_test.go
Add MinHotCPURate, CPURateRankStepRatio, CPU-aware priority handling, CPU support tracking (lastCPUSupported), validation rules, fallback logic, and tests for CPU fallbacks and behavior.
Scheduler runtime & metrics
pkg/schedule/schedulers/hot_region.go, pkg/schedule/schedulers/metrics.go, pkg/schedule/coordinator.go, pkg/schedule/handler/handler.go
Load CPU config in scheduler, add CPU rank-step/step mapping, skip counter for CPU uniform-store, expose per-store CPU stats in HotStoreStats and labeled total CPU metrics.
Heartbeat handling (server & mcs)
server/cluster/cluster.go, pkg/mcs/scheduling/server/cluster.go
Extract per-region read-CPU from PeerStat and include RegionReadCPU (scaled by interval) in the loads vector passed to hot-peer checks.
Statistics collector
pkg/statistics/collector.go
Populate CPU dimension in tikvCollector.getLoads for read path from store read-CPU.
Storage, API, CLI, dashboard, tests
pkg/storage/hot_region_storage.go, server/handler.go, client/http/types.go, tools/pd-ctl/..., metrics/grafana/pd.json, various tests under tests/ and server/
Add FlowCPU to persisted/returned hot-region records, update API/CLI/dashboard defaults and expectations, and update tests to include CPU dimension and new scheduler config fields.
Version gating
pkg/versioninfo/versioninfo.go, pkg/versioninfo/versioninfo_test.go
Add IsHotScheduleWithCPUSupported and min-version constants for CPU scheduling support (8.5.7, 9.0.0-beta.1+) with tests.

Sequence Diagram

sequenceDiagram
    participant Store as Store (TiKV)
    participant HB as Heartbeat Handler
    participant Cache as HotPeerCache
    participant Stats as Statistics Collector
    participant Scheduler as Hot Region Scheduler
    participant Version as VersionInfo

    Store->>HB: send PeerStat (includes CpuStats)
    HB->>Stats: extract Store/Region CPU usages
    Stats->>Cache: build loads[] (includes RegionReadCPU)
    Cache->>Cache: update HotPeerStat rolling loads (include CPU)
    Scheduler->>Version: IsHotScheduleWithCPUSupported(clusterVersion)
    Version-->>Scheduler: cpuSupport = true/false
    Scheduler->>Cache: request hot peers (with priorities considering cpuSupport)
    Cache-->>Scheduler: return ranked hot peers (with CPURate)
    Scheduler->>Store: emit balancing decisions
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested reviewers

  • okJiang
  • bufferflies
  • rleungx

Poem

🐰 I hopped through heartbeats, stats, and queues,
I carried CPU counts in tiny news,
From heartbeats to scheduler, I leapt with glee,
Now hot regions dance in CPU harmony,
Hooray for balanced loads — carrot cake for me! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.91% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'pd: report hot read cpu in heartbeat' directly and concisely summarizes the main change—adding CPU reporting to heartbeats for the hot scheduler.
Description check ✅ Passed The PR description includes a linked issue (Close #5718), explains the problem being solved, describes how it works, and specifies that unit/integration tests are included.
Linked Issues check ✅ Passed The PR fully implements the objectives from #5718: introduces CPU as a new hot scheduler dimension, reports read CPU in heartbeats, incorporates it into store load and hot-peer statistics, uses rolling-window aggregation, and updates prioritization to prefer CPU when supported.
Out of Scope Changes check ✅ Passed All changes are directly related to the stated objective of adding CPU as a new dimension for the hot scheduler; no unrelated modifications are present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 21, 2026
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
@lhy1024 lhy1024 marked this pull request as ready for review February 10, 2026 10:36
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 10, 2026
Signed-off-by: lhy1024 <admin@liudos.us>
@codecov
Copy link
Copy Markdown

codecov bot commented Feb 10, 2026

Codecov Report

❌ Patch coverage is 85.99034% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.90%. Comparing base (d7b6380) to head (0289764).
⚠️ Report is 11 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10178      +/-   ##
==========================================
- Coverage   78.98%   78.90%   -0.08%     
==========================================
  Files         530      532       +2     
  Lines       71521    71802     +281     
==========================================
+ Hits        56488    56658     +170     
- Misses      11024    11109      +85     
- Partials     4009     4035      +26     
Flag Coverage Δ
unittests 78.90% <85.99%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: lhy1024 <admin@liudos.us>
@okJiang
Copy link
Copy Markdown
Member

okJiang commented Feb 12, 2026

please link an issue and add some descriptions

storeReadQuery := core.GetReadQueryNum(stats.QueryStats)
storeWriteQuery := core.GetWriteQueryNum(stats.QueryStats)
storeTotalQuery := storeReadQuery + storeWriteQuery
storeGRPCCPU := statistics.StoreGRPCCPUUsage(stats.GetCpuUsages())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we intentionally use gRPC CPU only. Unified-read CPU is already in peerStat.CpuStats.UnifiedRead, so using store read CPU here would double count.

return unifiedReadCPU
}
grpcCPU := float64(StoreGRPCCPUUsage(cpuUsages))
return unifiedReadCPU + grpcCPU*float64(readQuery)/float64(totalQuery)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it accurate?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an approximation: unified-read CPU is read-only, while grpc-server CPU is shared by read/write requests, so we apportion gRPC CPU by readQuery/totalQuery.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some comments.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any research data to support this conclusion? If so, could you please include it in the PR description or issue?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is the key assumption here.

grpcCPU * readQuery / totalQuery is a first-order approximation, not an exact CPU attribution.

We use this because current heartbeat metrics provide unified-read CPU and shared grpc CPU, but not per-request CPU split inside grpc threads.

I’ll test it in a follow-up to validate the approximation quality under mixed workloads.

rollingWindowsSize = 5
// It is used to moving average CPU usage,
// and the window size is larger than other dimensions to make the CPU usage more stable.
cpuRollingWindowsSize = 9
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 9?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A larger window will be more stable for cpu

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, why not 10 or bigger? I think we need to add some comments about why we made this decision.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will add a test for cpuRollingWindowsSize = 11

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, why does the CPU require a more stable window size, while other dimensions do not?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an optimization in local test, I'll remove it temporarily, and if more testing confirms that it always works, I'll add it back in another PR.

)

// IsHotScheduleWithCPUSupported returns whether TiKV reports CPU info for hot scheduling.
func IsHotScheduleWithCPUSupported(clusterVersion *semver.Version) bool {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we wanna cp to release 8.5?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8.5.6 or 8.5.7?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should pick it?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think 8.5.x may need it.

Signed-off-by: lhy1024 <admin@liudos.us>
@rleungx
Copy link
Copy Markdown
Member

rleungx commented Mar 27, 2026

/retest

2 similar comments
@lhy1024
Copy link
Copy Markdown
Contributor Author

lhy1024 commented Mar 27, 2026

/retest

@lhy1024
Copy link
Copy Markdown
Contributor Author

lhy1024 commented Mar 27, 2026

/retest

Signed-off-by: lhy1024 <admin@liudos.us>
@lhy1024
Copy link
Copy Markdown
Contributor Author

lhy1024 commented Mar 27, 2026

/retest

3 similar comments
@lhy1024
Copy link
Copy Markdown
Contributor Author

lhy1024 commented Mar 27, 2026

/retest

@lhy1024
Copy link
Copy Markdown
Contributor Author

lhy1024 commented Mar 27, 2026

/retest

@lhy1024
Copy link
Copy Markdown
Contributor Author

lhy1024 commented Mar 27, 2026

/retest

Signed-off-by: lhy1024 <admin@liudos.us>
@lhy1024
Copy link
Copy Markdown
Contributor Author

lhy1024 commented Mar 28, 2026

/retest

1 similar comment
@lhy1024
Copy link
Copy Markdown
Contributor Author

lhy1024 commented Mar 28, 2026

/retest

lhy1024 added 2 commits April 1, 2026 14:32
(cherry picked from commit b0e9c280efffb592a0fb5b4919eb6856c1c076dc)
(cherry picked from commit f1753c859d9be369c0eae94b9e375951f27c44be)
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 1, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bufferflies, niubell, okJiang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the approved label Apr 1, 2026
Signed-off-by: lhy1024 <admin@liudos.us>
@lhy1024
Copy link
Copy Markdown
Contributor Author

lhy1024 commented Apr 1, 2026

/retest

2 similar comments
@lhy1024
Copy link
Copy Markdown
Contributor Author

lhy1024 commented Apr 1, 2026

/retest

@lhy1024
Copy link
Copy Markdown
Contributor Author

lhy1024 commented Apr 1, 2026

/retest

Signed-off-by: lhy1024 <admin@liudos.us>
@lhy1024
Copy link
Copy Markdown
Contributor Author

lhy1024 commented Apr 1, 2026

/retest

3 similar comments
@lhy1024
Copy link
Copy Markdown
Contributor Author

lhy1024 commented Apr 1, 2026

/retest

@lhy1024
Copy link
Copy Markdown
Contributor Author

lhy1024 commented Apr 1, 2026

/retest

@lhy1024
Copy link
Copy Markdown
Contributor Author

lhy1024 commented Apr 1, 2026

/retest

@ti-chi-bot ti-chi-bot bot merged commit 3aee2ef into tikv:master Apr 1, 2026
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

scheduler: introduce read cpu dimension

5 participants