pd: report hot read cpu in heartbeat#10178
Conversation
Signed-off-by: lhy1024 <admin@liudos.us>
|
Skipping CI for Draft Pull Request. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a CPU/read-CPU dimension throughout hot-region telemetry and scheduling: new CPU stats propagation, metrics, thresholds, scheduler config and logic changes, version gating for CPU support, storage/API payload updates, and corresponding tests and dashboard/CLI adjustments. Changes
Sequence DiagramsequenceDiagram
participant Store as Store (TiKV)
participant HB as Heartbeat Handler
participant Cache as HotPeerCache
participant Stats as Statistics Collector
participant Scheduler as Hot Region Scheduler
participant Version as VersionInfo
Store->>HB: send PeerStat (includes CpuStats)
HB->>Stats: extract Store/Region CPU usages
Stats->>Cache: build loads[] (includes RegionReadCPU)
Cache->>Cache: update HotPeerStat rolling loads (include CPU)
Scheduler->>Version: IsHotScheduleWithCPUSupported(clusterVersion)
Version-->>Scheduler: cpuSupport = true/false
Scheduler->>Cache: request hot peers (with priorities considering cpuSupport)
Cache-->>Scheduler: return ranked hot peers (with CPURate)
Scheduler->>Store: emit balancing decisions
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
b199e08 to
5b42858
Compare
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #10178 +/- ##
==========================================
- Coverage 78.98% 78.90% -0.08%
==========================================
Files 530 532 +2
Lines 71521 71802 +281
==========================================
+ Hits 56488 56658 +170
- Misses 11024 11109 +85
- Partials 4009 4035 +26
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Signed-off-by: lhy1024 <admin@liudos.us>
d0d3233 to
550ffd8
Compare
|
please link an issue and add some descriptions |
pkg/mcs/scheduling/server/cluster.go
Outdated
| storeReadQuery := core.GetReadQueryNum(stats.QueryStats) | ||
| storeWriteQuery := core.GetWriteQueryNum(stats.QueryStats) | ||
| storeTotalQuery := storeReadQuery + storeWriteQuery | ||
| storeGRPCCPU := statistics.StoreGRPCCPUUsage(stats.GetCpuUsages()) |
There was a problem hiding this comment.
Here we intentionally use gRPC CPU only. Unified-read CPU is already in peerStat.CpuStats.UnifiedRead, so using store read CPU here would double count.
pkg/statistics/cpu.go
Outdated
| return unifiedReadCPU | ||
| } | ||
| grpcCPU := float64(StoreGRPCCPUUsage(cpuUsages)) | ||
| return unifiedReadCPU + grpcCPU*float64(readQuery)/float64(totalQuery) |
There was a problem hiding this comment.
This is an approximation: unified-read CPU is read-only, while grpc-server CPU is shared by read/write requests, so we apportion gRPC CPU by readQuery/totalQuery.
There was a problem hiding this comment.
Do we have any research data to support this conclusion? If so, could you please include it in the PR description or issue?
There was a problem hiding this comment.
Yes, that is the key assumption here.
grpcCPU * readQuery / totalQuery is a first-order approximation, not an exact CPU attribution.
We use this because current heartbeat metrics provide unified-read CPU and shared grpc CPU, but not per-request CPU split inside grpc threads.
I’ll test it in a follow-up to validate the approximation quality under mixed workloads.
pkg/statistics/hot_peer_cache.go
Outdated
| rollingWindowsSize = 5 | ||
| // It is used to moving average CPU usage, | ||
| // and the window size is larger than other dimensions to make the CPU usage more stable. | ||
| cpuRollingWindowsSize = 9 |
There was a problem hiding this comment.
A larger window will be more stable for cpu
There was a problem hiding this comment.
Then, why not 10 or bigger? I think we need to add some comments about why we made this decision.
There was a problem hiding this comment.
ok, I will add a test for cpuRollingWindowsSize = 11
There was a problem hiding this comment.
In other words, why does the CPU require a more stable window size, while other dimensions do not?
There was a problem hiding this comment.
It is an optimization in local test, I'll remove it temporarily, and if more testing confirms that it always works, I'll add it back in another PR.
| ) | ||
|
|
||
| // IsHotScheduleWithCPUSupported returns whether TiKV reports CPU info for hot scheduling. | ||
| func IsHotScheduleWithCPUSupported(clusterVersion *semver.Version) bool { |
There was a problem hiding this comment.
What if we wanna cp to release 8.5?
There was a problem hiding this comment.
Do you think we should pick it?
Signed-off-by: lhy1024 <admin@liudos.us>
|
/retest |
2 similar comments
|
/retest |
|
/retest |
Signed-off-by: lhy1024 <admin@liudos.us>
|
/retest |
3 similar comments
|
/retest |
|
/retest |
|
/retest |
Signed-off-by: lhy1024 <admin@liudos.us>
|
/retest |
1 similar comment
|
/retest |
(cherry picked from commit b0e9c280efffb592a0fb5b4919eb6856c1c076dc) (cherry picked from commit f1753c859d9be369c0eae94b9e375951f27c44be) Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bufferflies, niubell, okJiang The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
2 similar comments
|
/retest |
|
/retest |
Signed-off-by: lhy1024 <admin@liudos.us>
|
/retest |
3 similar comments
|
/retest |
|
/retest |
|
/retest |
What problem does this PR solve?
Issue Number: Close #5718
What is changed and how does it work?
Simple description
This pr introduces cpu as a new dimension for hot scheduler, it only serve hot read scheduler
From store heartbeat cpu_usages, we can get unfied read pool cpu for schdule. Read priorities become cpu→byte when supported, otherwise fall back to query→byte (or byte→key if query isn’t supported).
Check List
Tests
Release note
Summary by CodeRabbit
New Features
cpu-read-rate)min-hot-cpu-rate,cpu-rate-rank-step-ratio)Enhancements
Chores