Enhancement Task
Problem
Currently, there is no metric that accurately reflects the real-time RU/s demand from clients before Resource Control throttling takes effect:
- Client-side
avgRUPerSec (group_controller.go) is computed from getRUValueFromConsumption() — actual post-throttling consumption. When a resource group is throttled, requests wait in Reserve(), consumption slows, and avgRUPerSec only reflects the throttled rate.
- Server-side
read_request_unit_max_per_sec / write_request_unit_max_per_sec are derived from Consumption.RRU/WRU reported by clients — also post-throttling values.
- Server-side
sampled_request_unit_per_sec is based on requiredToken in AcquireTokenBuckets, which is avgRUPerSec * targetPeriod * amplification - availableTokens — not a clean demand rate, and lacks per-instance granularity.
This makes it impossible for operators to determine the true workload demand when Resource Control is actively throttling.
Proposal
Add a new client-side Prometheus Gauge that tracks the EMA of demanded RU/s, sampled at the acquireTokens() entry point (before Reserve() throttling):
- Metric:
resource_manager_client_resource_group_demand_ru_per_sec{resource_group="..."}
- Data source: the RU cost (
v) passed to acquireTokens() in group_controller.go, which represents the true per-request demand before any token bucket throttling
- Smoothing: time-aware EMA (reuse the existing
movingAvgFactor logic)
Expected Usage
# Per-instance demand
resource_manager_client_resource_group_demand_ru_per_sec{instance="tidb-0", resource_group="default"}
# Cluster-wide demand for a resource group
sum(resource_manager_client_resource_group_demand_ru_per_sec) by (resource_group)
# Peak demand over time
max_over_time(sum(resource_manager_client_resource_group_demand_ru_per_sec) by (resource_group)[1h])
Benefits
- Accurate: samples RU cost before throttling, reflects true workload demand
- Per-instance: client-side metric naturally carries
instance label
- Aggregatable:
sum by in Grafana for cluster-wide view
- Rolling-upgrade friendly: pure client-side change, no proto or PD server changes required
Related
Enhancement Task
Problem
Currently, there is no metric that accurately reflects the real-time RU/s demand from clients before Resource Control throttling takes effect:
avgRUPerSec(group_controller.go) is computed fromgetRUValueFromConsumption()— actual post-throttling consumption. When a resource group is throttled, requests wait inReserve(), consumption slows, andavgRUPerSeconly reflects the throttled rate.read_request_unit_max_per_sec/write_request_unit_max_per_secare derived fromConsumption.RRU/WRUreported by clients — also post-throttling values.sampled_request_unit_per_secis based onrequiredTokeninAcquireTokenBuckets, which isavgRUPerSec * targetPeriod * amplification - availableTokens— not a clean demand rate, and lacks per-instance granularity.This makes it impossible for operators to determine the true workload demand when Resource Control is actively throttling.
Proposal
Add a new client-side Prometheus Gauge that tracks the EMA of demanded RU/s, sampled at the
acquireTokens()entry point (beforeReserve()throttling):resource_manager_client_resource_group_demand_ru_per_sec{resource_group="..."}v) passed toacquireTokens()ingroup_controller.go, which represents the true per-request demand before any token bucket throttlingmovingAvgFactorlogic)Expected Usage
Benefits
instancelabelsum byin Grafana for cluster-wide viewRelated