metrics: Add liveness check faults to grafana dashboard#65460
Conversation
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
|
Hi @MyonKeminta. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #65460 +/- ##
================================================
- Coverage 77.6858% 77.2785% -0.4073%
================================================
Files 2007 1929 -78
Lines 549495 542180 -7315
================================================
- Hits 426880 418989 -7891
- Misses 120955 123188 +2233
+ Partials 1660 3 -1657
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
/retest |
|
@MyonKeminta: PRs from untrusted users cannot be marked as trusted with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest |
|
@MyonKeminta: PRs from untrusted users cannot be marked as trusted with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
…iveness-check-result-to-metrics
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
📝 WalkthroughWalkthroughAdds multiple Grafana dashboard panels to two JSON dashboard files to surface TiKV client liveness, store liveness faults, request totals, and related memory/owner metrics; changes are purely dashboard configuration (no code logic changes). Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/metrics/grafana/tidb.json`:
- Around line 13070-13083: The PromQL expressions using
max_over_time(tidb_tikvclient_store_liveness_state[...] ) == 1 and == 2 are
filtering series and will drop a series if the other state appears in the same
window; change the comparisons to use the bool modifier inside the range vectors
so you evaluate "did this state occur in the window" (e.g. use
max_over_time((tidb_tikvclient_store_liveness_state{...} == bool 1)[1m]) and
max_over_time((tidb_tikvclient_store_liveness_state{...} == bool 2)[1m]) ) so
both unreachable (1) and unknown (2) are counted independently even when both
occur in the 1m window.
| "expr": "count(max_over_time(tidb_tikvclient_store_liveness_state{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", instance=~\"$instance\"}[1m]) == 2) by (instance, store)", | ||
| "format": "time_series", | ||
| "hide": false, | ||
| "interval": "", | ||
| "intervalFactor": 2, | ||
| "legendFormat": "{{instance}}-store-{{store}}-unknown", | ||
| "range": true, | ||
| "refId": "A", | ||
| "step": 30 | ||
| }, | ||
| { | ||
| "exemplar": true, | ||
| "expr": "count(max_over_time(tidb_tikvclient_store_liveness_state{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", instance=~\"$instance\"}[1m]) == 1) by (instance, store)", | ||
| "hide": false, |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check the file exists and examine the specific lines
wc -l pkg/metrics/grafana/tidb.jsonRepository: pingcap/tidb
Length of output: 91
🏁 Script executed:
# Get the specific lines mentioned in the review
sed -n '13070,13083p' pkg/metrics/grafana/tidb.json | cat -nRepository: pingcap/tidb
Length of output: 930
🏁 Script executed:
# Get more context around these lines to understand the full panel structure
sed -n '13060,13090p' pkg/metrics/grafana/tidb.json | cat -nRepository: pingcap/tidb
Length of output: 1544
🌐 Web query:
PromQL max_over_time behavior with numeric state metrics and comparison operators
💡 Result:
max_over_time() with “numeric state” (0/1 or enum-like) metrics
max_over_time(v[5m])looks at all float samples for each series inside the 5m range and returns the maximum sample value seen in that window. It does not mean “latest state” or “most common state”. [1]- Therefore:
- For a binary gauge (0/1),
max_over_time(x[5m])returns 1 if the series was ever 1 in the last 5 minutes; otherwise 0. - For an enum encoded as numbers (e.g., 0=OK, 1=WARN, 2=CRIT), it returns the highest numeric code that occurred in the window (which may be misleading if the metric can go back down but you really wanted “current state”). [1]
- For a binary gauge (0/1),
- If you want the most recent value in the window instead, use
last_over_time(x[5m]). [1]
How comparisons interact with _over_time() (and why bool matters)
PromQL comparisons (==, >, etc.) behave like this:
- Without
bool: comparisons filter series/samples—false results are dropped from the output. [2]- If you then feed that into an
_over_time()via a subquery, you can end up with no samples in the range (so the series may disappear entirely).
- If you then feed that into an
- With
bool: comparisons return 0/1 values instead of filtering (and the metric name is dropped). [2]- This is usually what you want for “did it ever match in the window?” style queries, because now
max_over_time()can safely compute a max over a continuous 0/1 signal.
- This is usually what you want for “did it ever match in the window?” style queries, because now
Common pattern: “was condition ever true in the last 5 minutes?”
max_over_time((my_state_metric == bool 2)[5m:])
== bool 2turns the condition into a 0/1 time series. [2]max_over_time(...[5m:])returns 1 if it was true at least once in the last 5 minutes, else 0. [1]
If you instead do (my_state_metric == 2) without bool, it filters, so it tends to answer “show me samples where it was 2”, not “return 0/1 over time for the condition”. [2]
Sources: Prometheus function docs for _over_time / max_over_time / last_over_time [1], and PromQL comparison operator (bool vs filtering) semantics [2].
[1] (prometheus.io)
[2] (prometheus.io)
Citations:
- 1: https://prometheus.io/docs/prometheus/3.1/querying/functions/?utm_source=openai
- 2: https://prometheus.io/docs/prometheus/latest/querying/operators/?utm_source=openai
🌐 Web query:
Prometheus PromQL max_over_time == operator semantics state transitions
💡 Result:
-
max_over_time(v[5m])takes a range vector and returns an instant vector whose sample value (at the current evaluation time) is the maximum sample value seen in the last 5 minutes for each input series. (prometheus.io)- The returned sample’s timestamp is the query evaluation time, not “when the max happened” in the window (PromQL can’t directly return the timestamp-of-max via
timestamp(max_over_time(...))). (github.com)
- The returned sample’s timestamp is the query evaluation time, not “when the max happened” in the window (PromQL can’t directly return the timestamp-of-max via
-
==is a comparison operator. For vectors it is filtering by default: it keeps only the series where the comparison is true, and drops the others. (prometheus.io)- Important: when a series is kept (true), the output sample value is the original left-hand side value (here: the
max_over_time(...)result), not1.
- Important: when a series is kept (true), the output sample value is the original left-hand side value (here: the
-
== boolchanges the semantics to return 0/1 instead of filtering. Withbool, the series are not dropped; the output value becomes1if true,0if false. (prometheus.io)
What this means for “state transitions”
If you’re graphing/alerting on a “state” derived from max_over_time(...) == ..., the “state” can change only at evaluation instants, and it is driven by whether the max in the lookback window crosses your threshold:
-
“Condition holds now” (filtering form):
max_over_time(metric[5m]) == 0Produces no sample when false (series disappears), and produces a sample valued 0 when true (because LHS max is 0).
-
“Condition as a stable 0/1 signal” (recommended for transitions):
max_over_time(metric[5m]) == bool 0Produces a continuous 1/0 series, making edges easy to detect (e.g., with
changes(...)). (prometheus.io)
Citations:
- 1: https://prometheus.io/docs/prometheus/3.4/querying/functions?utm_source=openai
- 2: Actual timestamp() for max_over_time (and other agg functions) prometheus/prometheus#5003
- 3: https://prometheus.io/docs/prometheus/latest/querying/operators/?utm_source=openai
- 4: https://prometheus.io/docs/prometheus/latest/querying/operators/?utm_source=openai
- 5: https://prometheus.io/docs/prometheus/latest/querying/operators/?utm_source=openai
State transitions will be undercounted: max_over_time(state[1m]) == 1 is masked by state 2
Lines 13070 and 13082 use filtering comparisons without bool. In PromQL, max_over_time(state[1m]) == 1 (without bool) filters the series—it only passes through when the max equals 1. If the state transitions 1 → 2 within the 1-minute window, max_over_time yields 2, so the == 1 filter drops the series entirely. This causes state unreachable to be undercounted whenever unknown also occurs in the same window.
To detect "did state ever occur in the window", use comparisons with the bool modifier to return 0/1 instead of filtering:
max_over_time((tidb_tikvclient_store_liveness_state{k8s_cluster="$k8s_cluster", tidb_cluster="$tidb_cluster", instance=~"$instance"} == bool 2)[1m:])
This ensures both state 1 and state 2 are counted independently even when both occur in the same window.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/metrics/grafana/tidb.json` around lines 13070 - 13083, The PromQL
expressions using max_over_time(tidb_tikvclient_store_liveness_state[...] ) == 1
and == 2 are filtering series and will drop a series if the other state appears
in the same window; change the comparisons to use the bool modifier inside the
range vectors so you evaluate "did this state occur in the window" (e.g. use
max_over_time((tidb_tikvclient_store_liveness_state{...} == bool 1)[1m]) and
max_over_time((tidb_tikvclient_store_liveness_state{...} == bool 2)[1m]) ) so
both unreachable (1) and unknown (2) are counted independently even when both
occur in the 1m window.
There was a problem hiding this comment.
The first PR description image shows a Grafana panel titled "Store Liveness Check Faults" displaying a stacked bar chart. The chart shows a spike around 15:40 with multiple colored bars representing different store instances (from various IP addresses like 192.168.178.12 and 192.168.180.11/13) that are unreachable for store-4. The legend on the right shows max, avg, and current values of 1 for each series.
The second PR description image shows another "Store Liveness Check Faults" panel with a bar chart spike around 18:00. This displays two series: tc-tidb-0-store-5-unknown and tc-tidb-1-store-5-unknown, both with max, avg, and current values of 1.
@MyonKeminta, I'm replying to your latest comment without being aware of the entire conversation thread (as the earlier comments are too long to be provided).
Your comment "¿" is unclear to me. Could you please clarify what you're asking about? Are you questioning:
- The Prometheus query expressions in the new panel?
- The visualization configuration (bars, stacking, etc.)?
- The metric name or description?
- Something else about this monitoring addition?
I'm happy to help once I understand what you'd like to know!
zimulala
left a comment
There was a problem hiding this comment.
Do we need to add corresponding metrics for the next-gen version as well?
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/metrics/nextgengrafana/tidb_with_keyspace_name.json`:
- Around line 12961-12977: The liveness queries for
tidb_tikvclient_store_liveness_state (refId "A" and "B") are masking values
because they compare after max_over_time; update both expressions to apply the
comparison inside the time aggregation using the "== bool" pattern, e.g.
max_over_time((tidb_tikvclient_store_liveness_state{...} == bool 1)[1m]) and
max_over_time((tidb_tikvclient_store_liveness_state{...} == bool 2)[1m]) (or
wrap with count()/max as needed to preserve counting semantics), and ensure
target B also includes "step": 30 to match target A for consistent sampling
cadence.
| "expr": "count(max_over_time(tidb_tikvclient_store_liveness_state{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", instance=~\"$instance\"}[1m]) == 2) by (instance, store)", | ||
| "format": "time_series", | ||
| "hide": false, | ||
| "interval": "", | ||
| "intervalFactor": 2, | ||
| "legendFormat": "{{instance}}-store-{{store}}-unknown", | ||
| "range": true, | ||
| "refId": "A", | ||
| "step": 30 | ||
| }, | ||
| { | ||
| "exemplar": true, | ||
| "expr": "count(max_over_time(tidb_tikvclient_store_liveness_state{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", instance=~\"$instance\"}[1m]) == 1) by (instance, store)", | ||
| "hide": false, | ||
| "interval": "", | ||
| "legendFormat": "{{instance}}-store-{{store}}-unreachable", | ||
| "refId": "B" |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
PromQL max_over_time comparison operator enum states masking
💡 Result:
In Prometheus, an “enum/state” metric is typically exported as one time series per state (e.g., my_task_state{state="running"}), where the current state has value 1 and the others 0. (prometheus.github.io)
1) “Was in state X at any time during the last N?”
Use max_over_time() (max value in the lookback window). If the series is 0/1, then max_over_time(...)=1 means “it happened at least once” in the window. (prometheus.io)
max_over_time(my_task_state{state="running"}[10m])
2) Comparison operator behavior (filtering vs keeping 0/1)
PromQL comparison operators filter by default (false samples are dropped). Add the bool modifier to keep the series and return 0/1 instead. (prometheus.io)
Filter to only series that were running sometime in the last 10m:
max_over_time(my_task_state{state="running"}[10m]) > 0
Keep all series, but as a 0/1 “mask”:
max_over_time(my_task_state{state="running"}[10m]) > bool 0
3) “Masking” another metric by that enum-state condition
Multiply by the 0/1 mask to zero-out values when the condition is false:
some_metric
* on(job, instance)
(max_over_time(my_task_state{state="running"}[10m]) > bool 0)
(Use on(...) labels that uniquely match your series.) (prometheus.io)
Citations:
- 1: https://prometheus.github.io/client_python/instrumenting/enum/?utm_source=openai
- 2: https://prometheus.io/docs/prometheus/3.4/querying/functions?utm_source=openai
- 3: https://prometheus.io/docs/prometheus/latest/querying/operators/?utm_source=openai
- 4: https://prometheus.io/docs/prometheus/latest/querying/operators/?utm_source=openai
🌐 Web query:
PromQL max_over_time == operator behavior gauge with multiple values in range
💡 Result:
max_over_time(gauge[5m])looks at all samples of each individual time series in the last 5 minutes and returns an instant vector whose value is the maximum sample seen in that range (per series). It does not return multiple values; it collapses the range to one value per series. [2]- When you then do
max_over_time(gauge[5m]) == X, the==comparison is applied to that single max value (per series) at the evaluation time:- Without
bool: it filters—you only get the series whose max equalsX, and the resulting sample value is the LHS value (i.e.,X). [1] - With
bool: it returns 1 (true) or 0 (false) for each matched series instead of filtering. [1]
- Without
Common “I expected it to check any sample equals X in the range” pattern
To test whether the gauge equaled X at any point in the last 5 minutes, compare first (with bool), then aggregate over time:
max_over_time((gauge == bool X)[5m]) == 1
Because (gauge == bool X) produces 0/1 at each eval, max_over_time(...[5m]) becomes 1 if it was ever true in the window. [1] [2]
If you also meant “max across multiple series”
max_over_time(...)is per-series over time; to then take the max across series, wrap with themax()aggregation, e.g.max(max_over_time(gauge[5m])). [2] [3]
🌐 Web query:
PromQL == bool operator vs == comparison Prometheus
💡 Result:
In PromQL, == is a comparison operator whose default behavior is filtering, while == bool applies the bool modifier to change the output into 0/1 instead of filtering.
a == b (no bool): filters
- For vector comparisons, samples where the comparison is false are dropped from the result (only “true” matches remain). (prometheus.io)
- For scalar-to-scalar comparisons, PromQL requires the
boolmodifier. (prometheus.io)
Example:
up == 1
Returns only series where up equals 1; others disappear.
a == bool b: returns 0/1
- For vector comparisons, keeps matched series and sets the sample value to:
1if true0if false
(Series with no match are still dropped.) (prometheus.io)
- Also drops the metric name in the output. (prometheus.io)
- For scalar-to-scalar,
boolis mandatory and the result is scalar0or1. (prometheus.io)
Example:
up == bool 1
Returns the same label sets (where matching exists), but values become 1 or 0 instead of dropping the false ones. (prometheus.io)
Citations:
- 1: https://prometheus.io/docs/prometheus/latest/querying/operators/?utm_source=openai
- 2: https://prometheus.io/docs/prometheus/latest/querying/operators/?utm_source=openai
- 3: https://prometheus.io/docs/prometheus/latest/querying/operators/?utm_source=openai
- 4: https://prometheus.io/docs/prometheus/latest/querying/operators/?utm_source=openai
- 5: https://prometheus.io/docs/prometheus/latest/querying/operators/?utm_source=openai
- 6: https://prometheus.io/docs/prometheus/latest/querying/operators/?utm_source=openai
🌐 Web query:
PromQL how to detect if value 1 appeared in time window without being masked by higher value
💡 Result:
To detect whether a series ever had the exact value 1 within a lookback window (even if it also had higher values), compare first, then aggregate over time:
Return 1 if m was ever 1 in the last 10 minutes, else 0
max_over_time((m == bool 1)[10m])
== bool turns each sample into 1 (true) or 0 (false) instead of filtering, and max_over_time over the window becomes “did it ever happen?”. [1]
Or: return the count of samples equal to 1 in the last 10 minutes
count_over_time((m == 1)[10m])
Then “did it happen?” is:
count_over_time((m == 1)[10m]) > 0
count_over_time counts samples in the range. [2]
Tip: use m == bool 1 if you want a stable 0/1 output per series; using m == 1 filters samples and can yield “no series returned” when there were no matches. [1]
Sources: Prometheus comparison operators and bool modifier [1]; Prometheus _over_time functions including count_over_time / max_over_time [2].
Fix state-value masking in liveness fault detection
The query max_over_time(...state...[1m]) == 1 will miss unreachable faults when unknown (2) also occurs in the same 1-minute window. This is because max_over_time() returns the maximum value in the window (here: 2), so the comparison 2 == 1 filters out the series incorrectly. The correct pattern is to apply the comparison before the time aggregation: max_over_time((state == bool 1)[1m]) to detect whether value 1 appeared at any point in the window.
Additionally, target B lacks the explicit step: 30 present in target A, which can cause inconsistent evaluation cadences. Both targets should use matching sampling parameters for alignment.
The suggested fix below uses the correct masking-prevention pattern with == bool, though the change from count() to max() aggregation should be reviewed to ensure it preserves the original intent of counting faults per instance/store:
Suggested query pattern update
- "expr": "count(max_over_time(tidb_tikvclient_store_liveness_state{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", instance=~\"$instance\"}[1m]) == 2) by (instance, store)",
+ "expr": "count((max_over_time(tidb_tikvclient_store_liveness_state{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", instance=~\"$instance\"}[1m]) == bool 2)) by (instance, store)",
@@
- "expr": "count(max_over_time(tidb_tikvclient_store_liveness_state{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", instance=~\"$instance\"}[1m]) == 1) by (instance, store)",
+ "expr": "count((max_over_time(tidb_tikvclient_store_liveness_state{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", instance=~\"$instance\"}[1m]) == bool 1)) by (instance, store)",
@@
- "refId": "B"
+ "step": 30,
+ "refId": "B"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/metrics/nextgengrafana/tidb_with_keyspace_name.json` around lines 12961 -
12977, The liveness queries for tidb_tikvclient_store_liveness_state (refId "A"
and "B") are masking values because they compare after max_over_time; update
both expressions to apply the comparison inside the time aggregation using the
"== bool" pattern, e.g. max_over_time((tidb_tikvclient_store_liveness_state{...}
== bool 1)[1m]) and max_over_time((tidb_tikvclient_store_liveness_state{...} ==
bool 2)[1m]) (or wrap with count()/max as needed to preserve counting
semantics), and ensure target B also includes "step": 30 to match target A for
consistent sampling cadence.
Added. PTAL again. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cfzjywxk, ekexium, zimulala The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
|
/test unit-test |
|
@MyonKeminta: The specified target(s) for Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What problem does this PR solve?
Issue Number: close #65459
Problem Summary:
What changed and how does it work?
This PR shows abnormal
checkLivenessresults on the grafana dashboardCheck List
Tests
Examples:
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
Summary by CodeRabbit