metrics: Add liveness check faults to grafana dashboard by MyonKeminta · Pull Request #65460 · pingcap/tidb

MyonKeminta · 2026-01-07T08:34:03Z

What problem does this PR solve?

Issue Number: close #65459

Problem Summary:

What changed and how does it work?

This PR shows abnormal checkLiveness results on the grafana dashboard

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No need to test
- I checked and no code files have been changed.

Examples:

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Summary by CodeRabbit

New Features
- Expanded monitoring dashboards with many new panels for TiKV client observability: liveness check faults and success rates, KV request totals, abnormal liveness results, request latency and failure metrics.
- Added owner/memory-focused panels: events, memory quota/pool stats, and runtime memory pressure to improve resource and store-health visibility.

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

tiprow · 2026-01-07T08:34:27Z

Hi @MyonKeminta. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

codecov · 2026-01-07T08:57:53Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.2785%. Comparing base (75d41dc) to head (9e00888).
⚠️ Report is 20 commits behind head on master.

Additional details and impacted files

@@               Coverage Diff                @@
##             master     #65460        +/-   ##
================================================
- Coverage   77.6858%   77.2785%   -0.4073%     
================================================
  Files          2007       1929        -78     
  Lines        549495     542180      -7315     
================================================
- Hits         426880     418989      -7891     
- Misses       120955     123188      +2233     
+ Partials       1660          3      -1657

Flag	Coverage Δ
unit	`76.4922% <ø> (+0.1510%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`56.7974% <ø> (ø)`
parser	`∅ <ø> (∅)`
br	`48.7337% <ø> (-12.1518%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

ti-chi-bot · 2026-02-25T07:32:43Z

[LGTM Timeline notifier]

Timeline:

2026-02-25 07:03:10.307281076 +0000 UTC m=+252062.822075686: ☑️ agreed by ekexium.
2026-02-25 07:32:42.187599046 +0000 UTC m=+253834.702393645: ☑️ agreed by cfzjywxk.

MyonKeminta · 2026-02-25T08:23:39Z

/retest

tiprow · 2026-02-25T08:24:03Z

@MyonKeminta: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

MyonKeminta · 2026-02-27T05:22:31Z

/retest

tiprow · 2026-02-27T05:22:54Z

@MyonKeminta: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

MyonKeminta · 2026-02-27T08:54:30Z

/ok-to-test

…iveness-check-result-to-metrics

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

coderabbitai · 2026-02-27T09:47:13Z

📝 Walkthrough

Walkthrough

Adds multiple Grafana dashboard panels to two JSON dashboard files to surface TiKV client liveness, store liveness faults, request totals, and related memory/owner metrics; changes are purely dashboard configuration (no code logic changes).

Changes

Cohort / File(s)	Summary
Grafana: tidb dashboard `pkg/metrics/grafana/tidb.json`	Inserted many new panels (IDs ~320–421) under existing rows (Resource Manager, KV Errors) tracking TiKV client liveness faults, liveness success ratios, KV request totals, abnormal liveness results, and related request/fault metrics using Prometheus queries and configured display options.
Grafana: tidb with keyspace `pkg/metrics/nextgengrafana/tidb_with_keyspace_name.json`	Added four panels (Owner / Memory Arbitrator rows) for Events, Mem Quota Stats, Mem Pool Stats, and Runtime Mem Pressure; includes metrics tied to liveness checks, store liveness states, heartbeat delays, and memory-related stats with visualization settings.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 I hopped through JSON, nibbling bits of code,
New panels sprout where metrics flowed.
Liveness and faults now brightly shown,
I watch TiKV's heartbeat from my burrowed home,
Hop-hop — dashboards bloom on the monitoring road.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: adding liveness check fault metrics to the Grafana dashboard, which aligns with the file modifications and PR objectives.
Description check	✅ Passed	The PR description includes the required Issue Number referencing `#65459`, explains what changed (adding abnormal checkLiveness results to dashboard), provides manual testing evidence with images, and fills out the checklist appropriately.
Linked Issues check	✅ Passed	The PR successfully implements the main objective from `#65459`: adding liveness check faults to Grafana dashboard for observability of TiKV health status and client-go replica selection impact.
Out of Scope Changes check	✅ Passed	All changes are scoped to Grafana dashboard JSON files (tidb.json and tidb_with_keyspace_name.json), adding new monitoring panels for liveness checks as required by the linked issue.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/metrics/grafana/tidb.json`:
- Around line 13070-13083: The PromQL expressions using
max_over_time(tidb_tikvclient_store_liveness_state[...] ) == 1 and == 2 are
filtering series and will drop a series if the other state appears in the same
window; change the comparisons to use the bool modifier inside the range vectors
so you evaluate "did this state occur in the window" (e.g. use
max_over_time((tidb_tikvclient_store_liveness_state{...} == bool 1)[1m]) and
max_over_time((tidb_tikvclient_store_liveness_state{...} == bool 2)[1m]) ) so
both unreachable (1) and unknown (2) are counted independently even when both
occur in the 1m window.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 75d41dc and b60d949.

📒 Files selected for processing (1)

pkg/metrics/grafana/tidb.json

coderabbitai · 2026-02-27T09:51:20Z

pkg/metrics/grafana/tidb.json

+              "expr": "count(max_over_time(tidb_tikvclient_store_liveness_state{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", instance=~\"$instance\"}[1m]) == 2) by (instance, store)",
+              "format": "time_series",
+              "hide": false,
+              "interval": "",
+              "intervalFactor": 2,
+              "legendFormat": "{{instance}}-store-{{store}}-unknown",
+              "range": true,
+              "refId": "A",
+              "step": 30
+            },
+            {
+              "exemplar": true,
+              "expr": "count(max_over_time(tidb_tikvclient_store_liveness_state{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", instance=~\"$instance\"}[1m]) == 1) by (instance, store)",
+              "hide": false,


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check the file exists and examine the specific lines wc -l pkg/metrics/grafana/tidb.json

Repository: pingcap/tidb

Length of output: 91

🏁 Script executed:

# Get the specific lines mentioned in the review sed -n '13070,13083p' pkg/metrics/grafana/tidb.json | cat -n

Repository: pingcap/tidb

Length of output: 930

🏁 Script executed:

# Get more context around these lines to understand the full panel structure sed -n '13060,13090p' pkg/metrics/grafana/tidb.json | cat -n

Repository: pingcap/tidb

Length of output: 1544

🌐 Web query:

PromQL max_over_time behavior with numeric state metrics and comparison operators

💡 Result:

max_over_time() with “numeric state” (0/1 or enum-like) metrics

max_over_time(v[5m]) looks at all float samples for each series inside the 5m range and returns the maximum sample value seen in that window. It does not mean “latest state” or “most common state”. [1]

Therefore:

For a binary gauge (0/1), max_over_time(x[5m]) returns 1 if the series was ever 1 in the last 5 minutes; otherwise 0.

For an enum encoded as numbers (e.g., 0=OK, 1=WARN, 2=CRIT), it returns the highest numeric code that occurred in the window (which may be misleading if the metric can go back down but you really wanted “current state”). [1]

If you want the most recent value in the window instead, use last_over_time(x[5m]). [1]

How comparisons interact with _over_time() (and why bool matters)

PromQL comparisons (==, >, etc.) behave like this:

Without bool: comparisons filter series/samples—false results are dropped from the output. [2]

If you then feed that into an _over_time() via a subquery, you can end up with no samples in the range (so the series may disappear entirely).

With bool: comparisons return 0/1 values instead of filtering (and the metric name is dropped). [2]

This is usually what you want for “did it ever match in the window?” style queries, because now max_over_time() can safely compute a max over a continuous 0/1 signal.

Common pattern: “was condition ever true in the last 5 minutes?”

max_over_time((my_state_metric == bool 2)[5m:])

== bool 2 turns the condition into a 0/1 time series. [2]

max_over_time(...[5m:]) returns 1 if it was true at least once in the last 5 minutes, else 0. [1]

If you instead do (my_state_metric == 2) without bool, it filters, so it tends to answer “show me samples where it was 2”, not “return 0/1 over time for the condition”. [2]

Sources: Prometheus function docs for _over_time / max_over_time / last_over_time [1], and PromQL comparison operator (bool vs filtering) semantics [2].

[1] (prometheus.io)
[2] (prometheus.io)

Citations:

1: https://prometheus.io/docs/prometheus/3.1/querying/functions/?utm_source=openai

2: https://prometheus.io/docs/prometheus/latest/querying/operators/?utm_source=openai

🌐 Web query:

Prometheus PromQL max_over_time == operator semantics state transitions

💡 Result:

max_over_time(v[5m]) takes a range vector and returns an instant vector whose sample value (at the current evaluation time) is the maximum sample value seen in the last 5 minutes for each input series. (prometheus.io)

The returned sample’s timestamp is the query evaluation time, not “when the max happened” in the window (PromQL can’t directly return the timestamp-of-max via timestamp(max_over_time(...))). (github.com)

== is a comparison operator. For vectors it is filtering by default: it keeps only the series where the comparison is true, and drops the others. (prometheus.io)

Important: when a series is kept (true), the output sample value is the original left-hand side value (here: the max_over_time(...) result), not 1.

== bool changes the semantics to return 0/1 instead of filtering. With bool, the series are not dropped; the output value becomes 1 if true, 0 if false. (prometheus.io)

What this means for “state transitions”

If you’re graphing/alerting on a “state” derived from max_over_time(...) == ..., the “state” can change only at evaluation instants, and it is driven by whether the max in the lookback window crosses your threshold:

“Condition holds now” (filtering form):

max_over_time(metric[5m]) == 0

Produces no sample when false (series disappears), and produces a sample valued 0 when true (because LHS max is 0).

“Condition as a stable 0/1 signal” (recommended for transitions):

max_over_time(metric[5m]) == bool 0

Produces a continuous 1/0 series, making edges easy to detect (e.g., with changes(...)). (prometheus.io)

Citations:

1: https://prometheus.io/docs/prometheus/3.4/querying/functions?utm_source=openai

2: Actual timestamp() for max_over_time (and other agg functions) prometheus/prometheus#5003

3: https://prometheus.io/docs/prometheus/latest/querying/operators/?utm_source=openai

4: https://prometheus.io/docs/prometheus/latest/querying/operators/?utm_source=openai

5: https://prometheus.io/docs/prometheus/latest/querying/operators/?utm_source=openai

State transitions will be undercounted: max_over_time(state[1m]) == 1 is masked by state 2

Lines 13070 and 13082 use filtering comparisons without bool. In PromQL, max_over_time(state[1m]) == 1 (without bool) filters the series—it only passes through when the max equals 1. If the state transitions 1 → 2 within the 1-minute window, max_over_time yields 2, so the == 1 filter drops the series entirely. This causes state unreachable to be undercounted whenever unknown also occurs in the same window.

To detect "did state ever occur in the window", use comparisons with the bool modifier to return 0/1 instead of filtering:

max_over_time((tidb_tikvclient_store_liveness_state{k8s_cluster="$k8s_cluster", tidb_cluster="$tidb_cluster", instance=~"$instance"} == bool 2)[1m:])

This ensures both state 1 and state 2 are counted independently even when both occur in the same window.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/metrics/grafana/tidb.json` around lines 13070 - 13083, The PromQL expressions using max_over_time(tidb_tikvclient_store_liveness_state[...] ) == 1 and == 2 are filtering series and will drop a series if the other state appears in the same window; change the comparisons to use the bool modifier inside the range vectors so you evaluate "did this state occur in the window" (e.g. use max_over_time((tidb_tikvclient_store_liveness_state{...} == bool 1)[1m]) and max_over_time((tidb_tikvclient_store_liveness_state{...} == bool 2)[1m]) ) so both unreachable (1) and unknown (2) are counted independently even when both occur in the 1m window.

The first PR description image shows a Grafana panel titled "Store Liveness Check Faults" displaying a stacked bar chart. The chart shows a spike around 15:40 with multiple colored bars representing different store instances (from various IP addresses like 192.168.178.12 and 192.168.180.11/13) that are unreachable for store-4. The legend on the right shows max, avg, and current values of 1 for each series.

The second PR description image shows another "Store Liveness Check Faults" panel with a bar chart spike around 18:00. This displays two series: tc-tidb-0-store-5-unknown and tc-tidb-1-store-5-unknown, both with max, avg, and current values of 1.

@MyonKeminta, I'm replying to your latest comment without being aware of the entire conversation thread (as the earlier comments are too long to be provided).

Your comment "¿" is unclear to me. Could you please clarify what you're asking about? Are you questioning:

The Prometheus query expressions in the new panel?

The visualization configuration (bars, stacking, etc.)?

The metric name or description?

Something else about this monitoring addition?

I'm happy to help once I understand what you'd like to know!

zimulala

Do we need to add corresponding metrics for the next-gen version as well?

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/metrics/nextgengrafana/tidb_with_keyspace_name.json`:
- Around line 12961-12977: The liveness queries for
tidb_tikvclient_store_liveness_state (refId "A" and "B") are masking values
because they compare after max_over_time; update both expressions to apply the
comparison inside the time aggregation using the "== bool" pattern, e.g.
max_over_time((tidb_tikvclient_store_liveness_state{...} == bool 1)[1m]) and
max_over_time((tidb_tikvclient_store_liveness_state{...} == bool 2)[1m]) (or
wrap with count()/max as needed to preserve counting semantics), and ensure
target B also includes "step": 30 to match target A for consistent sampling
cadence.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b60d949 and 9e00888.

📒 Files selected for processing (1)

pkg/metrics/nextgengrafana/tidb_with_keyspace_name.json

coderabbitai · 2026-03-03T07:19:47Z

pkg/metrics/nextgengrafana/tidb_with_keyspace_name.json

+              "expr": "count(max_over_time(tidb_tikvclient_store_liveness_state{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", instance=~\"$instance\"}[1m]) == 2) by (instance, store)",
+              "format": "time_series",
+              "hide": false,
+              "interval": "",
+              "intervalFactor": 2,
+              "legendFormat": "{{instance}}-store-{{store}}-unknown",
+              "range": true,
+              "refId": "A",
+              "step": 30
+            },
+            {
+              "exemplar": true,
+              "expr": "count(max_over_time(tidb_tikvclient_store_liveness_state{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", instance=~\"$instance\"}[1m]) == 1) by (instance, store)",
+              "hide": false,
+              "interval": "",
+              "legendFormat": "{{instance}}-store-{{store}}-unreachable",
+              "refId": "B"


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

PromQL max_over_time comparison operator enum states masking

💡 Result:

In Prometheus, an “enum/state” metric is typically exported as one time series per state (e.g., my_task_state{state="running"}), where the current state has value 1 and the others 0. (prometheus.github.io)

1) “Was in state X at any time during the last N?”

Use max_over_time() (max value in the lookback window). If the series is 0/1, then max_over_time(...)=1 means “it happened at least once” in the window. (prometheus.io)

max_over_time(my_task_state{state="running"}[10m])

2) Comparison operator behavior (filtering vs keeping 0/1)

PromQL comparison operators filter by default (false samples are dropped). Add the bool modifier to keep the series and return 0/1 instead. (prometheus.io)

Filter to only series that were running sometime in the last 10m:

max_over_time(my_task_state{state="running"}[10m]) > 0

Keep all series, but as a 0/1 “mask”:

max_over_time(my_task_state{state="running"}[10m]) > bool 0

3) “Masking” another metric by that enum-state condition

Multiply by the 0/1 mask to zero-out values when the condition is false:

some_metric * on(job, instance) (max_over_time(my_task_state{state="running"}[10m]) > bool 0)

(Use on(...) labels that uniquely match your series.) (prometheus.io)

Citations:

1: https://prometheus.github.io/client_python/instrumenting/enum/?utm_source=openai

2: https://prometheus.io/docs/prometheus/3.4/querying/functions?utm_source=openai

3: https://prometheus.io/docs/prometheus/latest/querying/operators/?utm_source=openai

4: https://prometheus.io/docs/prometheus/latest/querying/operators/?utm_source=openai

🌐 Web query:

PromQL max_over_time == operator behavior gauge with multiple values in range

💡 Result:

max_over_time(gauge[5m]) looks at all samples of each individual time series in the last 5 minutes and returns an instant vector whose value is the maximum sample seen in that range (per series). It does not return multiple values; it collapses the range to one value per series. [2]

When you then do max_over_time(gauge[5m]) == X, the == comparison is applied to that single max value (per series) at the evaluation time:

Without bool: it filters—you only get the series whose max equals X, and the resulting sample value is the LHS value (i.e., X). [1]

With bool: it returns 1 (true) or 0 (false) for each matched series instead of filtering. [1]

Common “I expected it to check any sample equals X in the range” pattern

To test whether the gauge equaled X at any point in the last 5 minutes, compare first (with bool), then aggregate over time:

max_over_time((gauge == bool X)[5m]) == 1

Because (gauge == bool X) produces 0/1 at each eval, max_over_time(...[5m]) becomes 1 if it was ever true in the window. [1] [2]

If you also meant “max across multiple series”

max_over_time(...) is per-series over time; to then take the max across series, wrap with the max() aggregation, e.g. max(max_over_time(gauge[5m])). [2] [3]

🌐 Web query:

PromQL == bool operator vs == comparison Prometheus

💡 Result:

In PromQL, == is a comparison operator whose default behavior is filtering, while == bool applies the bool modifier to change the output into 0/1 instead of filtering.

a == b (no bool): filters

For vector comparisons, samples where the comparison is false are dropped from the result (only “true” matches remain). (prometheus.io)

For scalar-to-scalar comparisons, PromQL requires the bool modifier. (prometheus.io)

Example:

up == 1

Returns only series where up equals 1; others disappear.

a == bool b: returns 0/1

For vector comparisons, keeps matched series and sets the sample value to:

1 if true

0 if false
(Series with no match are still dropped.) (prometheus.io)

Also drops the metric name in the output. (prometheus.io)

For scalar-to-scalar, bool is mandatory and the result is scalar 0 or 1. (prometheus.io)

Example:

up == bool 1

Returns the same label sets (where matching exists), but values become 1 or 0 instead of dropping the false ones. (prometheus.io)

Citations:

1: https://prometheus.io/docs/prometheus/latest/querying/operators/?utm_source=openai

2: https://prometheus.io/docs/prometheus/latest/querying/operators/?utm_source=openai

3: https://prometheus.io/docs/prometheus/latest/querying/operators/?utm_source=openai

4: https://prometheus.io/docs/prometheus/latest/querying/operators/?utm_source=openai

5: https://prometheus.io/docs/prometheus/latest/querying/operators/?utm_source=openai

6: https://prometheus.io/docs/prometheus/latest/querying/operators/?utm_source=openai

🌐 Web query:

PromQL how to detect if value 1 appeared in time window without being masked by higher value

💡 Result:

To detect whether a series ever had the exact value 1 within a lookback window (even if it also had higher values), compare first, then aggregate over time:

Return 1 if m was ever 1 in the last 10 minutes, else 0

max_over_time((m == bool 1)[10m])

== bool turns each sample into 1 (true) or 0 (false) instead of filtering, and max_over_time over the window becomes “did it ever happen?”. [1]

Or: return the count of samples equal to 1 in the last 10 minutes

count_over_time((m == 1)[10m])

Then “did it happen?” is:

count_over_time((m == 1)[10m]) > 0

count_over_time counts samples in the range. [2]

Tip: use m == bool 1 if you want a stable 0/1 output per series; using m == 1 filters samples and can yield “no series returned” when there were no matches. [1]

Sources: Prometheus comparison operators and bool modifier [1]; Prometheus _over_time functions including count_over_time / max_over_time [2].

Fix state-value masking in liveness fault detection

The query max_over_time(...state...[1m]) == 1 will miss unreachable faults when unknown (2) also occurs in the same 1-minute window. This is because max_over_time() returns the maximum value in the window (here: 2), so the comparison 2 == 1 filters out the series incorrectly. The correct pattern is to apply the comparison before the time aggregation: max_over_time((state == bool 1)[1m]) to detect whether value 1 appeared at any point in the window.

Additionally, target B lacks the explicit step: 30 present in target A, which can cause inconsistent evaluation cadences. Both targets should use matching sampling parameters for alignment.

The suggested fix below uses the correct masking-prevention pattern with == bool, though the change from count() to max() aggregation should be reviewed to ensure it preserves the original intent of counting faults per instance/store:

Suggested query pattern update

- "expr": "count(max_over_time(tidb_tikvclient_store_liveness_state{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", instance=~\"$instance\"}[1m]) == 2) by (instance, store)", + "expr": "count((max_over_time(tidb_tikvclient_store_liveness_state{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", instance=~\"$instance\"}[1m]) == bool 2)) by (instance, store)", @@ - "expr": "count(max_over_time(tidb_tikvclient_store_liveness_state{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", instance=~\"$instance\"}[1m]) == 1) by (instance, store)", + "expr": "count((max_over_time(tidb_tikvclient_store_liveness_state{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", instance=~\"$instance\"}[1m]) == bool 1)) by (instance, store)", @@ - "refId": "B" + "step": 30, + "refId": "B"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/metrics/nextgengrafana/tidb_with_keyspace_name.json` around lines 12961 - 12977, The liveness queries for tidb_tikvclient_store_liveness_state (refId "A" and "B") are masking values because they compare after max_over_time; update both expressions to apply the comparison inside the time aggregation using the "== bool" pattern, e.g. max_over_time((tidb_tikvclient_store_liveness_state{...} == bool 1)[1m]) and max_over_time((tidb_tikvclient_store_liveness_state{...} == bool 2)[1m]) (or wrap with count()/max as needed to preserve counting semantics), and ensure target B also includes "step": 30 to match target A for consistent sampling cadence.

MyonKeminta · 2026-03-03T07:57:00Z

Do we need to add corresponding metrics for the next-gen version as well?

Added. PTAL again.

ti-chi-bot · 2026-03-03T08:05:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cfzjywxk, ekexium, zimulala

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~pkg/metrics/OWNERS~~ [zimulala]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

MyonKeminta · 2026-03-04T04:28:19Z

/retest

MyonKeminta · 2026-03-04T05:28:17Z

/test unit-test

tiprow · 2026-03-04T05:28:41Z

@MyonKeminta: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test fast_test_tiprow

/test tidb_parser_test

Use /test all to run all jobs.

Details

In response to this:

/test unit-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

metrics: Add liveness check faults to grafana dashboard

cb39b6b

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 7, 2026

ekexium approved these changes Feb 25, 2026

View reviewed changes

ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Feb 25, 2026

cfzjywxk approved these changes Feb 25, 2026

View reviewed changes

ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Feb 25, 2026

ti-chi-bot bot added the ok-to-test Indicates a PR is ready to be tested. label Feb 27, 2026

MyonKeminta added 2 commits February 27, 2026 17:45

Merge branch 'master' of https://github.com/pingcap/tidb into m/add-l…

d0de132

…iveness-check-result-to-metrics

change panel id

b60d949

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

coderabbitai bot reviewed Feb 27, 2026

View reviewed changes

zimulala reviewed Mar 3, 2026

View reviewed changes

Add the panel to next gen dashboard

9e00888

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

zimulala approved these changes Mar 3, 2026

View reviewed changes

ti-chi-bot bot added the approved label Mar 3, 2026

ti-chi-bot bot merged commit a833b14 into pingcap:master Mar 4, 2026
31 checks passed

coderabbitai bot mentioned this pull request Mar 5, 2026

metrics, util/stmtsummary: add stmt summary window metrics #66700

Merged

13 tasks

MyonKeminta deleted the m/add-liveness-check-result-to-metrics branch March 5, 2026 09:03

Conversation

MyonKeminta commented Jan 7, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What changed and how does it work?

Check List

Release note

Summary by CodeRabbit

Uh oh!

tiprow bot commented Jan 7, 2026

Uh oh!

codecov bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ti-chi-bot bot commented Feb 25, 2026

[LGTM Timeline notifier]

Uh oh!

MyonKeminta commented Feb 25, 2026

Uh oh!

tiprow bot commented Feb 25, 2026

Uh oh!

MyonKeminta commented Feb 27, 2026

Uh oh!

tiprow bot commented Feb 27, 2026

Uh oh!

MyonKeminta commented Feb 27, 2026

Uh oh!

coderabbitai bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

max_over_time() with “numeric state” (0/1 or enum-like) metrics

How comparisons interact with _over_time() (and why bool matters)

What this means for “state transitions”

Uh oh!

MyonKeminta Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

zimulala left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

1) “Was in state X at any time during the last N?”

2) Comparison operator behavior (filtering vs keeping 0/1)

3) “Masking” another metric by that enum-state condition

a == b (no bool): filters

a == bool b: returns 0/1

Return 1 if m was ever 1 in the last 10 minutes, else 0

Or: return the count of samples equal to 1 in the last 10 minutes

Uh oh!

MyonKeminta commented Mar 3, 2026

Uh oh!

ti-chi-bot bot commented Mar 3, 2026

Uh oh!

MyonKeminta commented Mar 4, 2026

Uh oh!

MyonKeminta commented Mar 4, 2026

Uh oh!

tiprow bot commented Mar 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

MyonKeminta commented Jan 7, 2026 •

edited by coderabbitai bot

Loading

codecov bot commented Jan 7, 2026 •

edited

Loading

coderabbitai bot commented Feb 27, 2026 •

edited

Loading

coderabbitai bot Feb 27, 2026 •

edited

Loading

`max_over_time()` with “numeric state” (0/1 or enum-like) metrics

How comparisons interact with `_over_time()` (and why `bool` matters)

`a == b` (no `bool`): filters

`a == bool b`: returns 0/1

Return 1 if `m` was ever `1` in the last 10 minutes, else 0