Skip to content
Merged
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
77f90dd
use hot read cpu
lhy1024 Jan 21, 2026
5790ab6
fix version
lhy1024 Jan 21, 2026
531ed27
adjust sample windows
lhy1024 Jan 28, 2026
5b42858
fix statistics
lhy1024 Feb 10, 2026
a61cfc8
add comments and tests
lhy1024 Feb 10, 2026
7e04125
Merge branch 'master' of github.com:tikv/pd into hot-read-cpu
lhy1024 Feb 10, 2026
35223a2
fix lint
lhy1024 Feb 10, 2026
d9310e1
fix tests
lhy1024 Feb 10, 2026
550ffd8
update kvproto
lhy1024 Feb 11, 2026
cbcda2a
address comments
lhy1024 Feb 14, 2026
3ab78da
address comments
lhy1024 Feb 27, 2026
7cc9995
address comments
lhy1024 Feb 28, 2026
6d78ea2
address comments
lhy1024 Mar 2, 2026
b1c86f8
remove cpu specail windows
lhy1024 Mar 2, 2026
1f3e1db
fix lint
lhy1024 Mar 2, 2026
4b2e3c8
Merge branch 'master' of github.com:tikv/pd into hot-read-cpu
lhy1024 Mar 3, 2026
a99ac37
Merge branch 'master' of github.com:tikv/pd into hot-read-cpu
lhy1024 Mar 4, 2026
5e3e458
add panel
lhy1024 Mar 5, 2026
1d18285
remove grpc
lhy1024 Mar 12, 2026
3b56493
Merge remote-tracking branch 'pingcap/master' into hot-read-cpu
lhy1024 Mar 24, 2026
0d55388
adjust version
lhy1024 Mar 24, 2026
f9ed627
add some test and comments
lhy1024 Mar 24, 2026
9fb6ae3
fallback in bucket
lhy1024 Mar 24, 2026
c384dbf
reserve write cpu
lhy1024 Mar 24, 2026
7376b5e
fix client
lhy1024 Mar 24, 2026
c843e19
*: bump kvproto to 678ff92b1edd
lhy1024 Mar 26, 2026
b477bcf
statistics: align read cpu with query hot signals
lhy1024 Mar 25, 2026
27f2cf7
tests: align hot scheduler cpu rate expectations
lhy1024 Mar 27, 2026
71e1e30
hot-read-cpu: backport compatibility and review fixes
lhy1024 Apr 1, 2026
76d0eaa
add pending weight config
lhy1024 Mar 31, 2026
bd2eed8
fix lint
lhy1024 Apr 1, 2026
0289764
tests: pin split bucket priorities and cover cpu fallback
lhy1024 Apr 1, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 112 additions & 0 deletions metrics/grafana/pd.json
Original file line number Diff line number Diff line change
Expand Up @@ -6641,6 +6641,118 @@
"alignLevel": null
}
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "${DS_TEST-CLUSTER}",
"decimals": 0,
"fill": 0,
"gridPos": {
"h": 7,
"w": 12,
"x": 0,
"y": 81
},
"id": 608,
"legend": {
"alignAsTable": true,
"avg": false,
"current": true,
"hideEmpty": true,
"hideZero": true,
"max": true,
"min": false,
"rightSide": true,
"show": true,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 1,
"links": [],
"nullPointMode": "null",
"paceLength": 10,
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "pd_scheduler_store_status{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", store=~\"$store\", type=\"store_read_cpu_usage\"}",
"format": "time_series",
"interval": "",
"intervalFactor": 2,
"legendFormat": "{{address}}-store-{{store}}",
"refId": "A",
"step": 4
},
{
"exemplar": true,
"expr": "pd_scheduler_hot_peers_summary{type=\"exp-cpu-rate-read-leader\"}",
"hide": true,
"interval": "",
"legendFormat": "exp-cpu-rate-read-leader-{{store}}",
"refId": "B"
},
{
"exemplar": true,
"expr": "pd_scheduler_hot_peers_summary{type=\"exp-cpu-rate-read-region\"}",
"hide": true,
"interval": "",
"legendFormat": "exp-cpu-rate-read-region-{{store}}",
"refId": "C"
}
Comment on lines +6697 to +6712
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Scope the hidden expected-load queries to the selected stores.

These two targets ignore $store, so the panel still fetches exp-cpu-rate-read-* for every store even when target A is narrowed to a subset. That makes the hidden debug overlay inconsistent with the visible series and adds unnecessary query load.

Suggested fix
-              "expr": "pd_scheduler_hot_peers_summary{type=\"exp-cpu-rate-read-leader\"}",
+              "expr": "pd_scheduler_hot_peers_summary{store=~\"$store\", type=\"exp-cpu-rate-read-leader\"}",
...
-              "expr": "pd_scheduler_hot_peers_summary{type=\"exp-cpu-rate-read-region\"}",
+              "expr": "pd_scheduler_hot_peers_summary{store=~\"$store\", type=\"exp-cpu-rate-read-region\"}",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
{
"exemplar": true,
"expr": "pd_scheduler_hot_peers_summary{type=\"exp-cpu-rate-read-leader\"}",
"hide": true,
"interval": "",
"legendFormat": "exp-cpu-rate-read-leader-{{store}}",
"refId": "B"
},
{
"exemplar": true,
"expr": "pd_scheduler_hot_peers_summary{type=\"exp-cpu-rate-read-region\"}",
"hide": true,
"interval": "",
"legendFormat": "exp-cpu-rate-read-region-{{store}}",
"refId": "C"
}
{
"exemplar": true,
"expr": "pd_scheduler_hot_peers_summary{store=~\"$store\", type=\"exp-cpu-rate-read-leader\"}",
"hide": true,
"interval": "",
"legendFormat": "exp-cpu-rate-read-leader-{{store}}",
"refId": "B"
},
{
"exemplar": true,
"expr": "pd_scheduler_hot_peers_summary{store=~\"$store\", type=\"exp-cpu-rate-read-region\"}",
"hide": true,
"interval": "",
"legendFormat": "exp-cpu-rate-read-region-{{store}}",
"refId": "C"
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@metrics/grafana/pd.json` around lines 6697 - 6712, The two hidden targets
pd_scheduler_hot_peers_summary{type="exp-cpu-rate-read-leader"} (refId "B") and
pd_scheduler_hot_peers_summary{type="exp-cpu-rate-read-region"} (refId "C")
should be filtered by the selected stores; update their expr fields to include
the store label matcher using the Grafana variable (e.g. add ,store=~"$store")
so they only query the selected stores (handle multi-value via =~). Keep
legendFormat as-is ({{store}}) so labels still render correctly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right that these hidden pd_scheduler_hot_peers_summary queries are not scoped by $store. However, this is not specific to the new CPU panel. The existing read bytes / read keys / read query panels already use the same pattern for their hidden expected-load queries.

If we only changed the CPU panel here, it would make this panel inconsistent with the existing dashboard behavior. If we want to optimize this, I think it is better handled as a separate dashboard cleanup that updates all similar panels together.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description image shows a Grafana time-series panel titled “Store read cpu.” The legend lists multiple stores (e.g., 192.168.8.88:20162-store-1), Y-axis ~0–15, and there’s a spike near 16:40 where store-1 peaks around 13 and store-2 around 9; other stores are near zero. A dropdown on the right shows sorting by max/current.

@lhy1024 Thanks for the context—agreed on keeping this PR consistent.

I double-checked the dashboard JSON and confirmed the pattern you mentioned:

  • Panel 602 “Store read rate bytes”: hidden exp-byte-rate-read-(leader|region) targets are not scoped by $store.
  • Panel 603 “Store read rate keys”: hidden exp-key-rate-read-(leader|region) targets are not scoped by $store.
  • Panel 604 “Store read query”: hidden exp-query-rate-read-(leader|region) targets are not scoped by $store.
  • The new Panel 608 “Store read cpu” follows the same pattern.

Let’s keep panel 608 as-is here. If you’d like, I can open a follow-up issue to update all similar panels together. Proposed scope:

  • Panels: 602, 603, 604, 608 (and any other exp--read- overlays we find).
  • Change hidden expr from:
    pd_scheduler_hot_peers_summary{type="exp-…-read-…"}
    to:
    pd_scheduler_hot_peers_summary{store=~"$store", type="exp-…-read-…"}
  • Keep legendFormat as {{store}}; verify single- and multi-store selections render correctly.

Want me to file that tracking issue and assign it to you, or prefer I prep a small cleanup PR after this merges?


✏️ Learnings added
Learnt from: lhy1024
Repo: tikv/pd PR: 10178
File: metrics/grafana/pd.json:6697-6712
Timestamp: 2026-03-24T13:18:55.540Z
Learning: In metrics/grafana/pd.json, the hot-read panels (ids 602 "Store read rate bytes", 603 "Store read rate keys", 604 "Store read query") intentionally leave hidden pd_scheduler_hot_peers_summary expected-load queries unscoped by $store; new panel 608 "Store read cpu" mirrors this for consistency. Any change to scope those hidden queries by $store should be done as a single dashboard cleanup affecting all such panels, not just one panel in isolation.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Store read cpu",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"decimals": null,
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {},
"bars": false,
Expand Down
2 changes: 2 additions & 0 deletions pkg/core/factory.go
Original file line number Diff line number Diff line change
Expand Up @@ -45,4 +45,6 @@ var (
TimeIntervalFactory = func() *pdpb.TimeInterval { return &pdpb.TimeInterval{} }
// QueryStatsFactory returns new query stats.
QueryStatsFactory = func() *pdpb.QueryStats { return &pdpb.QueryStats{} }
// CPUStatsFactory returns new cpu stats.
CPUStatsFactory = func() *pdpb.CPUStats { return &pdpb.CPUStats{} }
)
29 changes: 19 additions & 10 deletions pkg/core/region.go
Original file line number Diff line number Diff line change
Expand Up @@ -65,15 +65,18 @@ func errRegionIsStale(region *metapb.Region, origin *metapb.Region) error {
// the properties are Read-Only once created except buckets.
// the `buckets` could be modified by the request `report buckets` with greater version.
type RegionInfo struct {
meta *metapb.Region
learners []*metapb.Peer
witnesses []*metapb.Peer
voters []*metapb.Peer
leader *metapb.Peer
downPeers []*pdpb.PeerStats
pendingPeers []*metapb.Peer
term uint64
meta *metapb.Region
learners []*metapb.Peer
witnesses []*metapb.Peer
voters []*metapb.Peer
leader *metapb.Peer
downPeers []*pdpb.PeerStats
pendingPeers []*metapb.Peer
term uint64
// cpuUsage is deprecated and will be removed in the future.
// We should use `cpuStats` instead.
cpuUsage uint64
cpuStats *pdpb.CPUStats
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to mark cpuUsage as deprecated?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

writtenBytes uint64
writtenKeys uint64
readBytes uint64
Expand Down Expand Up @@ -257,8 +260,9 @@ func RegionFromHeartbeat(heartbeat RegionHeartbeatRequest, flowRoundDivisor uint
region.approximateKvSize = int64(h.GetApproximateKvSize() / units.MiB)
region.approximateColumnarKvSize = int64(h.GetApproximateColumnarKvSize() / units.MiB)
region.replicationStatus = h.GetReplicationStatus()
if cpuStats := h.GetCpuStats(); cpuStats != nil {
region.cpuUsage = cpuStats.GetUnifiedRead()
region.cpuStats = h.GetCpuStats()
if region.cpuStats != nil {
region.cpuUsage = region.cpuStats.GetUnifiedRead()
} else {
region.cpuUsage = h.CpuUsage
}
Expand Down Expand Up @@ -329,6 +333,7 @@ func (r *RegionInfo) Clone(opts ...RegionCreateOption) *RegionInfo {
downPeers: downPeers,
pendingPeers: pendingPeers,
cpuUsage: r.cpuUsage,
cpuStats: typeutil.DeepClone(r.cpuStats, CPUStatsFactory),
writtenBytes: r.writtenBytes,
writtenKeys: r.writtenKeys,
readBytes: r.readBytes,
Expand Down Expand Up @@ -2019,6 +2024,8 @@ func (r *RegionInfo) GetLoads() []float64 {
float64(r.GetBytesWritten()),
float64(r.GetKeysWritten()),
float64(r.GetWriteQueryNum()),
float64(r.GetCPUUsage()),
0, // RegionWriteCPU: reserved, not yet reported by TiKV
}
}

Expand All @@ -2031,6 +2038,8 @@ func (r *RegionInfo) GetWriteLoads() []float64 {
float64(r.GetBytesWritten()),
float64(r.GetKeysWritten()),
float64(r.GetWriteQueryNum()),
0,
0, // RegionWriteCPU: reserved, not yet reported by TiKV
}
}

Expand Down
3 changes: 3 additions & 0 deletions pkg/mcs/scheduling/server/cluster.go
Original file line number Diff line number Diff line change
Expand Up @@ -499,13 +499,16 @@ func (c *Cluster) HandleStoreHeartbeat(heartbeat *schedulingpb.StoreHeartbeatReq
continue
}
readQueryNum := core.GetReadQueryNum(peerStat.GetQueryStats())
regionReadCPU := statistics.RegionReadCPUUsage(peerStat)
loads := []float64{
utils.RegionReadBytes: float64(peerStat.GetReadBytes()),
utils.RegionReadKeys: float64(peerStat.GetReadKeys()),
utils.RegionReadQueryNum: float64(readQueryNum),
utils.RegionWriteBytes: 0,
utils.RegionWriteKeys: 0,
utils.RegionWriteQueryNum: 0,
utils.RegionReadCPU: regionReadCPU * float64(interval),
utils.RegionWriteCPU: 0,
}
checkReadPeerTask := func(cache *statistics.HotPeerCache) {
stats := cache.CheckPeerFlow(region, []*metapb.Peer{peer}, loads, interval)
Expand Down
6 changes: 5 additions & 1 deletion pkg/schedule/coordinator.go
Original file line number Diff line number Diff line change
Expand Up @@ -528,7 +528,7 @@ func (c *Coordinator) CollectHotSpotMetrics() {
func collectHotMetrics(cluster sche.ClusterInformer, stores []*core.StoreInfo, typ utils.RWType) {
kind := typ.String()
hotPeerStats := cluster.GetHotPeerStats(typ)
status := statistics.CollectHotPeerInfos(stores, hotPeerStats) // only returns TotalBytesRate,TotalKeysRate,TotalQueryRate,Count
status := statistics.CollectHotPeerInfos(stores, hotPeerStats) // only returns TotalBytesRate,TotalKeysRate,TotalQueryRate,TotalCPURate,Count

for _, s := range stores {
// TODO: pre-allocate gauge metrics
Expand All @@ -540,11 +540,13 @@ func collectHotMetrics(cluster sche.ClusterInformer, stores []*core.StoreInfo, t
hotSpotStatusGauge.WithLabelValues(storeAddress, storeLabel, "total_"+kind+"_bytes_as_leader").Set(stat.TotalBytesRate)
hotSpotStatusGauge.WithLabelValues(storeAddress, storeLabel, "total_"+kind+"_keys_as_leader").Set(stat.TotalKeysRate)
hotSpotStatusGauge.WithLabelValues(storeAddress, storeLabel, "total_"+kind+"_query_as_leader").Set(stat.TotalQueryRate)
hotSpotStatusGauge.WithLabelValues(storeAddress, storeLabel, "total_"+kind+"_cpu_as_leader").Set(stat.TotalCPURate)
hotSpotStatusGauge.WithLabelValues(storeAddress, storeLabel, "hot_"+kind+"_region_as_leader").Set(float64(stat.Count))
} else {
hotSpotStatusGauge.DeleteLabelValues(storeAddress, storeLabel, "total_"+kind+"_bytes_as_leader")
hotSpotStatusGauge.DeleteLabelValues(storeAddress, storeLabel, "total_"+kind+"_keys_as_leader")
hotSpotStatusGauge.DeleteLabelValues(storeAddress, storeLabel, "total_"+kind+"_query_as_leader")
hotSpotStatusGauge.DeleteLabelValues(storeAddress, storeLabel, "total_"+kind+"_cpu_as_leader")
hotSpotStatusGauge.DeleteLabelValues(storeAddress, storeLabel, "hot_"+kind+"_region_as_leader")
}

Expand All @@ -553,11 +555,13 @@ func collectHotMetrics(cluster sche.ClusterInformer, stores []*core.StoreInfo, t
hotSpotStatusGauge.WithLabelValues(storeAddress, storeLabel, "total_"+kind+"_bytes_as_peer").Set(stat.TotalBytesRate)
hotSpotStatusGauge.WithLabelValues(storeAddress, storeLabel, "total_"+kind+"_keys_as_peer").Set(stat.TotalKeysRate)
hotSpotStatusGauge.WithLabelValues(storeAddress, storeLabel, "total_"+kind+"_query_as_peer").Set(stat.TotalQueryRate)
hotSpotStatusGauge.WithLabelValues(storeAddress, storeLabel, "total_"+kind+"_cpu_as_peer").Set(stat.TotalCPURate)
hotSpotStatusGauge.WithLabelValues(storeAddress, storeLabel, "hot_"+kind+"_region_as_peer").Set(float64(stat.Count))
} else {
hotSpotStatusGauge.DeleteLabelValues(storeAddress, storeLabel, "total_"+kind+"_bytes_as_peer")
hotSpotStatusGauge.DeleteLabelValues(storeAddress, storeLabel, "total_"+kind+"_keys_as_peer")
hotSpotStatusGauge.DeleteLabelValues(storeAddress, storeLabel, "total_"+kind+"_query_as_peer")
hotSpotStatusGauge.DeleteLabelValues(storeAddress, storeLabel, "total_"+kind+"_cpu_as_peer")
hotSpotStatusGauge.DeleteLabelValues(storeAddress, storeLabel, "hot_"+kind+"_region_as_peer")
}

Expand Down
3 changes: 3 additions & 0 deletions pkg/schedule/handler/handler.go
Original file line number Diff line number Diff line change
Expand Up @@ -994,6 +994,7 @@ type HotStoreStats struct {
KeysReadStats map[uint64]float64 `json:"keys-read-rate,omitempty"`
QueryWriteStats map[uint64]float64 `json:"query-write-rate,omitempty"`
QueryReadStats map[uint64]float64 `json:"query-read-rate,omitempty"`
CPUReadStats map[uint64]float64 `json:"cpu-read-rate,omitempty"`
}

// GetHotStores gets all hot stores stats.
Expand All @@ -1005,6 +1006,7 @@ func (h *Handler) GetHotStores() (*HotStoreStats, error) {
KeysReadStats: make(map[uint64]float64),
QueryWriteStats: make(map[uint64]float64),
QueryReadStats: make(map[uint64]float64),
CPUReadStats: make(map[uint64]float64),
}
stores, error := h.GetStores()
if error != nil {
Expand All @@ -1031,6 +1033,7 @@ func (h *Handler) GetHotStores() (*HotStoreStats, error) {
stats.KeysReadStats[id] = loads[utils.StoreReadKeys]
stats.QueryWriteStats[id] = loads[utils.StoreWriteQuery]
stats.QueryReadStats[id] = loads[utils.StoreReadQuery]
stats.CPUReadStats[id] = loads[utils.StoreReadCPU]
}
}
return stats, nil
Expand Down
2 changes: 2 additions & 0 deletions pkg/schedule/schedulers/hot_region.go
Original file line number Diff line number Diff line change
Expand Up @@ -219,11 +219,13 @@ func (s *hotScheduler) ReloadConfig() error {
s.conf.MinHotByteRate = newCfg.MinHotByteRate
s.conf.MinHotKeyRate = newCfg.MinHotKeyRate
s.conf.MinHotQueryRate = newCfg.MinHotQueryRate
s.conf.MinHotCPURate = newCfg.MinHotCPURate
s.conf.MaxZombieRounds = newCfg.MaxZombieRounds
s.conf.MaxPeerNum = newCfg.MaxPeerNum
s.conf.ByteRateRankStepRatio = newCfg.ByteRateRankStepRatio
s.conf.KeyRateRankStepRatio = newCfg.KeyRateRankStepRatio
s.conf.QueryRateRankStepRatio = newCfg.QueryRateRankStepRatio
s.conf.CPURateRankStepRatio = newCfg.CPURateRankStepRatio
s.conf.CountRankStepRatio = newCfg.CountRankStepRatio
s.conf.GreatDecRatio = newCfg.GreatDecRatio
s.conf.MinorDecRatio = newCfg.MinorDecRatio
Expand Down
Loading
Loading