Skip to content

metrics: Add liveness check faults to grafana dashboard#65460

Merged
ti-chi-bot[bot] merged 4 commits intopingcap:masterfrom
MyonKeminta:m/add-liveness-check-result-to-metrics
Mar 4, 2026
Merged

metrics: Add liveness check faults to grafana dashboard#65460
ti-chi-bot[bot] merged 4 commits intopingcap:masterfrom
MyonKeminta:m/add-liveness-check-result-to-metrics

Conversation

@MyonKeminta
Copy link
Copy Markdown
Contributor

@MyonKeminta MyonKeminta commented Jan 7, 2026

What problem does this PR solve?

Issue Number: close #65459

Problem Summary:

What changed and how does it work?

This PR shows abnormal checkLiveness results on the grafana dashboard

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Examples:

image image

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Summary by CodeRabbit

  • New Features
    • Expanded monitoring dashboards with many new panels for TiKV client observability: liveness check faults and success rates, KV request totals, abnormal liveness results, request latency and failure metrics.
    • Added owner/memory-focused panels: events, memory quota/pool stats, and runtime memory pressure to improve resource and store-health visibility.

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 7, 2026
@tiprow
Copy link
Copy Markdown

tiprow bot commented Jan 7, 2026

Hi @MyonKeminta. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@codecov
Copy link
Copy Markdown

codecov bot commented Jan 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.2785%. Comparing base (75d41dc) to head (9e00888).
⚠️ Report is 20 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #65460        +/-   ##
================================================
- Coverage   77.6858%   77.2785%   -0.4073%     
================================================
  Files          2007       1929        -78     
  Lines        549495     542180      -7315     
================================================
- Hits         426880     418989      -7891     
- Misses       120955     123188      +2233     
+ Partials       1660          3      -1657     
Flag Coverage Δ
unit 76.4922% <ø> (+0.1510%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 56.7974% <ø> (ø)
parser ∅ <ø> (∅)
br 48.7337% <ø> (-12.1518%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Feb 25, 2026
@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Feb 25, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Feb 25, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-02-25 07:03:10.307281076 +0000 UTC m=+252062.822075686: ☑️ agreed by ekexium.
  • 2026-02-25 07:32:42.187599046 +0000 UTC m=+253834.702393645: ☑️ agreed by cfzjywxk.

@MyonKeminta
Copy link
Copy Markdown
Contributor Author

/retest

@tiprow
Copy link
Copy Markdown

tiprow bot commented Feb 25, 2026

@MyonKeminta: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@MyonKeminta
Copy link
Copy Markdown
Contributor Author

/retest

@tiprow
Copy link
Copy Markdown

tiprow bot commented Feb 27, 2026

@MyonKeminta: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@MyonKeminta
Copy link
Copy Markdown
Contributor Author

/ok-to-test

@ti-chi-bot ti-chi-bot bot added the ok-to-test Indicates a PR is ready to be tested. label Feb 27, 2026
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 27, 2026

📝 Walkthrough

Walkthrough

Adds multiple Grafana dashboard panels to two JSON dashboard files to surface TiKV client liveness, store liveness faults, request totals, and related memory/owner metrics; changes are purely dashboard configuration (no code logic changes).

Changes

Cohort / File(s) Summary
Grafana: tidb dashboard
pkg/metrics/grafana/tidb.json
Inserted many new panels (IDs ~320–421) under existing rows (Resource Manager, KV Errors) tracking TiKV client liveness faults, liveness success ratios, KV request totals, abnormal liveness results, and related request/fault metrics using Prometheus queries and configured display options.
Grafana: tidb with keyspace
pkg/metrics/nextgengrafana/tidb_with_keyspace_name.json
Added four panels (Owner / Memory Arbitrator rows) for Events, Mem Quota Stats, Mem Pool Stats, and Runtime Mem Pressure; includes metrics tied to liveness checks, store liveness states, heartbeat delays, and memory-related stats with visualization settings.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 I hopped through JSON, nibbling bits of code,
New panels sprout where metrics flowed.
Liveness and faults now brightly shown,
I watch TiKV's heartbeat from my burrowed home,
Hop-hop — dashboards bloom on the monitoring road.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding liveness check fault metrics to the Grafana dashboard, which aligns with the file modifications and PR objectives.
Description check ✅ Passed The PR description includes the required Issue Number referencing #65459, explains what changed (adding abnormal checkLiveness results to dashboard), provides manual testing evidence with images, and fills out the checklist appropriately.
Linked Issues check ✅ Passed The PR successfully implements the main objective from #65459: adding liveness check faults to Grafana dashboard for observability of TiKV health status and client-go replica selection impact.
Out of Scope Changes check ✅ Passed All changes are scoped to Grafana dashboard JSON files (tidb.json and tidb_with_keyspace_name.json), adding new monitoring panels for liveness checks as required by the linked issue.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/metrics/grafana/tidb.json`:
- Around line 13070-13083: The PromQL expressions using
max_over_time(tidb_tikvclient_store_liveness_state[...] ) == 1 and == 2 are
filtering series and will drop a series if the other state appears in the same
window; change the comparisons to use the bool modifier inside the range vectors
so you evaluate "did this state occur in the window" (e.g. use
max_over_time((tidb_tikvclient_store_liveness_state{...} == bool 1)[1m]) and
max_over_time((tidb_tikvclient_store_liveness_state{...} == bool 2)[1m]) ) so
both unreachable (1) and unknown (2) are counted independently even when both
occur in the 1m window.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 75d41dc and b60d949.

📒 Files selected for processing (1)
  • pkg/metrics/grafana/tidb.json

Comment on lines +13070 to +13083
"expr": "count(max_over_time(tidb_tikvclient_store_liveness_state{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", instance=~\"$instance\"}[1m]) == 2) by (instance, store)",
"format": "time_series",
"hide": false,
"interval": "",
"intervalFactor": 2,
"legendFormat": "{{instance}}-store-{{store}}-unknown",
"range": true,
"refId": "A",
"step": 30
},
{
"exemplar": true,
"expr": "count(max_over_time(tidb_tikvclient_store_liveness_state{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", instance=~\"$instance\"}[1m]) == 1) by (instance, store)",
"hide": false,
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check the file exists and examine the specific lines
wc -l pkg/metrics/grafana/tidb.json

Repository: pingcap/tidb

Length of output: 91


🏁 Script executed:

# Get the specific lines mentioned in the review
sed -n '13070,13083p' pkg/metrics/grafana/tidb.json | cat -n

Repository: pingcap/tidb

Length of output: 930


🏁 Script executed:

# Get more context around these lines to understand the full panel structure
sed -n '13060,13090p' pkg/metrics/grafana/tidb.json | cat -n

Repository: pingcap/tidb

Length of output: 1544


🌐 Web query:

PromQL max_over_time behavior with numeric state metrics and comparison operators

💡 Result:

max_over_time() with “numeric state” (0/1 or enum-like) metrics

  • max_over_time(v[5m]) looks at all float samples for each series inside the 5m range and returns the maximum sample value seen in that window. It does not mean “latest state” or “most common state”. [1]
  • Therefore:
    • For a binary gauge (0/1), max_over_time(x[5m]) returns 1 if the series was ever 1 in the last 5 minutes; otherwise 0.
    • For an enum encoded as numbers (e.g., 0=OK, 1=WARN, 2=CRIT), it returns the highest numeric code that occurred in the window (which may be misleading if the metric can go back down but you really wanted “current state”). [1]
  • If you want the most recent value in the window instead, use last_over_time(x[5m]). [1]

How comparisons interact with _over_time() (and why bool matters)

PromQL comparisons (==, >, etc.) behave like this:

  • Without bool: comparisons filter series/samples—false results are dropped from the output. [2]
    • If you then feed that into an _over_time() via a subquery, you can end up with no samples in the range (so the series may disappear entirely).
  • With bool: comparisons return 0/1 values instead of filtering (and the metric name is dropped). [2]
    • This is usually what you want for “did it ever match in the window?” style queries, because now max_over_time() can safely compute a max over a continuous 0/1 signal.

Common pattern: “was condition ever true in the last 5 minutes?”

max_over_time((my_state_metric == bool 2)[5m:])
  • == bool 2 turns the condition into a 0/1 time series. [2]
  • max_over_time(...[5m:]) returns 1 if it was true at least once in the last 5 minutes, else 0. [1]

If you instead do (my_state_metric == 2) without bool, it filters, so it tends to answer “show me samples where it was 2”, not “return 0/1 over time for the condition”. [2]

Sources: Prometheus function docs for _over_time / max_over_time / last_over_time [1], and PromQL comparison operator (bool vs filtering) semantics [2].

[1] (prometheus.io)
[2] (prometheus.io)

Citations:


🌐 Web query:

Prometheus PromQL max_over_time == operator semantics state transitions

💡 Result:

  • max_over_time(v[5m]) takes a range vector and returns an instant vector whose sample value (at the current evaluation time) is the maximum sample value seen in the last 5 minutes for each input series. (prometheus.io)

    • The returned sample’s timestamp is the query evaluation time, not “when the max happened” in the window (PromQL can’t directly return the timestamp-of-max via timestamp(max_over_time(...))). (github.com)
  • == is a comparison operator. For vectors it is filtering by default: it keeps only the series where the comparison is true, and drops the others. (prometheus.io)

    • Important: when a series is kept (true), the output sample value is the original left-hand side value (here: the max_over_time(...) result), not 1.
  • == bool changes the semantics to return 0/1 instead of filtering. With bool, the series are not dropped; the output value becomes 1 if true, 0 if false. (prometheus.io)

What this means for “state transitions”

If you’re graphing/alerting on a “state” derived from max_over_time(...) == ..., the “state” can change only at evaluation instants, and it is driven by whether the max in the lookback window crosses your threshold:

  • “Condition holds now” (filtering form):

    max_over_time(metric[5m]) == 0
    

    Produces no sample when false (series disappears), and produces a sample valued 0 when true (because LHS max is 0).

  • “Condition as a stable 0/1 signal” (recommended for transitions):

    max_over_time(metric[5m]) == bool 0
    

    Produces a continuous 1/0 series, making edges easy to detect (e.g., with changes(...)). (prometheus.io)

Citations:


State transitions will be undercounted: max_over_time(state[1m]) == 1 is masked by state 2

Lines 13070 and 13082 use filtering comparisons without bool. In PromQL, max_over_time(state[1m]) == 1 (without bool) filters the series—it only passes through when the max equals 1. If the state transitions 1 → 2 within the 1-minute window, max_over_time yields 2, so the == 1 filter drops the series entirely. This causes state unreachable to be undercounted whenever unknown also occurs in the same window.

To detect "did state ever occur in the window", use comparisons with the bool modifier to return 0/1 instead of filtering:

max_over_time((tidb_tikvclient_store_liveness_state{k8s_cluster="$k8s_cluster", tidb_cluster="$tidb_cluster", instance=~"$instance"} == bool 2)[1m:])

This ensures both state 1 and state 2 are counted independently even when both occur in the same window.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/metrics/grafana/tidb.json` around lines 13070 - 13083, The PromQL
expressions using max_over_time(tidb_tikvclient_store_liveness_state[...] ) == 1
and == 2 are filtering series and will drop a series if the other state appears
in the same window; change the comparisons to use the bool modifier inside the
range vectors so you evaluate "did this state occur in the window" (e.g. use
max_over_time((tidb_tikvclient_store_liveness_state{...} == bool 1)[1m]) and
max_over_time((tidb_tikvclient_store_liveness_state{...} == bool 2)[1m]) ) so
both unreachable (1) and unknown (2) are counted independently even when both
occur in the 1m window.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

¿

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first PR description image shows a Grafana panel titled "Store Liveness Check Faults" displaying a stacked bar chart. The chart shows a spike around 15:40 with multiple colored bars representing different store instances (from various IP addresses like 192.168.178.12 and 192.168.180.11/13) that are unreachable for store-4. The legend on the right shows max, avg, and current values of 1 for each series.

The second PR description image shows another "Store Liveness Check Faults" panel with a bar chart spike around 18:00. This displays two series: tc-tidb-0-store-5-unknown and tc-tidb-1-store-5-unknown, both with max, avg, and current values of 1.


@MyonKeminta, I'm replying to your latest comment without being aware of the entire conversation thread (as the earlier comments are too long to be provided).

Your comment "¿" is unclear to me. Could you please clarify what you're asking about? Are you questioning:

  • The Prometheus query expressions in the new panel?
  • The visualization configuration (bars, stacking, etc.)?
  • The metric name or description?
  • Something else about this monitoring addition?

I'm happy to help once I understand what you'd like to know!

Copy link
Copy Markdown
Contributor

@zimulala zimulala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add corresponding metrics for the next-gen version as well?

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/metrics/nextgengrafana/tidb_with_keyspace_name.json`:
- Around line 12961-12977: The liveness queries for
tidb_tikvclient_store_liveness_state (refId "A" and "B") are masking values
because they compare after max_over_time; update both expressions to apply the
comparison inside the time aggregation using the "== bool" pattern, e.g.
max_over_time((tidb_tikvclient_store_liveness_state{...} == bool 1)[1m]) and
max_over_time((tidb_tikvclient_store_liveness_state{...} == bool 2)[1m]) (or
wrap with count()/max as needed to preserve counting semantics), and ensure
target B also includes "step": 30 to match target A for consistent sampling
cadence.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b60d949 and 9e00888.

📒 Files selected for processing (1)
  • pkg/metrics/nextgengrafana/tidb_with_keyspace_name.json

Comment on lines +12961 to +12977
"expr": "count(max_over_time(tidb_tikvclient_store_liveness_state{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", instance=~\"$instance\"}[1m]) == 2) by (instance, store)",
"format": "time_series",
"hide": false,
"interval": "",
"intervalFactor": 2,
"legendFormat": "{{instance}}-store-{{store}}-unknown",
"range": true,
"refId": "A",
"step": 30
},
{
"exemplar": true,
"expr": "count(max_over_time(tidb_tikvclient_store_liveness_state{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", instance=~\"$instance\"}[1m]) == 1) by (instance, store)",
"hide": false,
"interval": "",
"legendFormat": "{{instance}}-store-{{store}}-unreachable",
"refId": "B"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

PromQL max_over_time comparison operator enum states masking

💡 Result:

In Prometheus, an “enum/state” metric is typically exported as one time series per state (e.g., my_task_state{state="running"}), where the current state has value 1 and the others 0. (prometheus.github.io)

1) “Was in state X at any time during the last N?”

Use max_over_time() (max value in the lookback window). If the series is 0/1, then max_over_time(...)=1 means “it happened at least once” in the window. (prometheus.io)

max_over_time(my_task_state{state="running"}[10m])

2) Comparison operator behavior (filtering vs keeping 0/1)

PromQL comparison operators filter by default (false samples are dropped). Add the bool modifier to keep the series and return 0/1 instead. (prometheus.io)

Filter to only series that were running sometime in the last 10m:

max_over_time(my_task_state{state="running"}[10m]) > 0

Keep all series, but as a 0/1 “mask”:

max_over_time(my_task_state{state="running"}[10m]) > bool 0

3) “Masking” another metric by that enum-state condition

Multiply by the 0/1 mask to zero-out values when the condition is false:

some_metric
  * on(job, instance)
    (max_over_time(my_task_state{state="running"}[10m]) > bool 0)

(Use on(...) labels that uniquely match your series.) (prometheus.io)

Citations:


🌐 Web query:

PromQL max_over_time == operator behavior gauge with multiple values in range

💡 Result:

  • max_over_time(gauge[5m]) looks at all samples of each individual time series in the last 5 minutes and returns an instant vector whose value is the maximum sample seen in that range (per series). It does not return multiple values; it collapses the range to one value per series. [2]
  • When you then do max_over_time(gauge[5m]) == X, the == comparison is applied to that single max value (per series) at the evaluation time:
    • Without bool: it filters—you only get the series whose max equals X, and the resulting sample value is the LHS value (i.e., X). [1]
    • With bool: it returns 1 (true) or 0 (false) for each matched series instead of filtering. [1]

Common “I expected it to check any sample equals X in the range” pattern

To test whether the gauge equaled X at any point in the last 5 minutes, compare first (with bool), then aggregate over time:

max_over_time((gauge == bool X)[5m]) == 1

Because (gauge == bool X) produces 0/1 at each eval, max_over_time(...[5m]) becomes 1 if it was ever true in the window. [1] [2]

If you also meant “max across multiple series”

  • max_over_time(...) is per-series over time; to then take the max across series, wrap with the max() aggregation, e.g. max(max_over_time(gauge[5m])). [2] [3]

🌐 Web query:

PromQL == bool operator vs == comparison Prometheus

💡 Result:

In PromQL, == is a comparison operator whose default behavior is filtering, while == bool applies the bool modifier to change the output into 0/1 instead of filtering.

a == b (no bool): filters

  • For vector comparisons, samples where the comparison is false are dropped from the result (only “true” matches remain). (prometheus.io)
  • For scalar-to-scalar comparisons, PromQL requires the bool modifier. (prometheus.io)

Example:

up == 1

Returns only series where up equals 1; others disappear.

a == bool b: returns 0/1

  • For vector comparisons, keeps matched series and sets the sample value to:
    • 1 if true
    • 0 if false
      (Series with no match are still dropped.) (prometheus.io)
  • Also drops the metric name in the output. (prometheus.io)
  • For scalar-to-scalar, bool is mandatory and the result is scalar 0 or 1. (prometheus.io)

Example:

up == bool 1

Returns the same label sets (where matching exists), but values become 1 or 0 instead of dropping the false ones. (prometheus.io)

Citations:


🌐 Web query:

PromQL how to detect if value 1 appeared in time window without being masked by higher value

💡 Result:

To detect whether a series ever had the exact value 1 within a lookback window (even if it also had higher values), compare first, then aggregate over time:

Return 1 if m was ever 1 in the last 10 minutes, else 0

max_over_time((m == bool 1)[10m])

== bool turns each sample into 1 (true) or 0 (false) instead of filtering, and max_over_time over the window becomes “did it ever happen?”. [1]

Or: return the count of samples equal to 1 in the last 10 minutes

count_over_time((m == 1)[10m])

Then “did it happen?” is:

count_over_time((m == 1)[10m]) > 0

count_over_time counts samples in the range. [2]

Tip: use m == bool 1 if you want a stable 0/1 output per series; using m == 1 filters samples and can yield “no series returned” when there were no matches. [1]

Sources: Prometheus comparison operators and bool modifier [1]; Prometheus _over_time functions including count_over_time / max_over_time [2].


Fix state-value masking in liveness fault detection

The query max_over_time(...state...[1m]) == 1 will miss unreachable faults when unknown (2) also occurs in the same 1-minute window. This is because max_over_time() returns the maximum value in the window (here: 2), so the comparison 2 == 1 filters out the series incorrectly. The correct pattern is to apply the comparison before the time aggregation: max_over_time((state == bool 1)[1m]) to detect whether value 1 appeared at any point in the window.

Additionally, target B lacks the explicit step: 30 present in target A, which can cause inconsistent evaluation cadences. Both targets should use matching sampling parameters for alignment.

The suggested fix below uses the correct masking-prevention pattern with == bool, though the change from count() to max() aggregation should be reviewed to ensure it preserves the original intent of counting faults per instance/store:

Suggested query pattern update
-              "expr": "count(max_over_time(tidb_tikvclient_store_liveness_state{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", instance=~\"$instance\"}[1m]) == 2) by (instance, store)",
+              "expr": "count((max_over_time(tidb_tikvclient_store_liveness_state{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", instance=~\"$instance\"}[1m]) == bool 2)) by (instance, store)",
@@
-              "expr": "count(max_over_time(tidb_tikvclient_store_liveness_state{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", instance=~\"$instance\"}[1m]) == 1) by (instance, store)",
+              "expr": "count((max_over_time(tidb_tikvclient_store_liveness_state{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", instance=~\"$instance\"}[1m]) == bool 1)) by (instance, store)",
@@
-              "refId": "B"
+              "step": 30,
+              "refId": "B"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/metrics/nextgengrafana/tidb_with_keyspace_name.json` around lines 12961 -
12977, The liveness queries for tidb_tikvclient_store_liveness_state (refId "A"
and "B") are masking values because they compare after max_over_time; update
both expressions to apply the comparison inside the time aggregation using the
"== bool" pattern, e.g. max_over_time((tidb_tikvclient_store_liveness_state{...}
== bool 1)[1m]) and max_over_time((tidb_tikvclient_store_liveness_state{...} ==
bool 2)[1m]) (or wrap with count()/max as needed to preserve counting
semantics), and ensure target B also includes "step": 30 to match target A for
consistent sampling cadence.

@MyonKeminta
Copy link
Copy Markdown
Contributor Author

Do we need to add corresponding metrics for the next-gen version as well?

Added. PTAL again.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Mar 3, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cfzjywxk, ekexium, zimulala

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the approved label Mar 3, 2026
@MyonKeminta
Copy link
Copy Markdown
Contributor Author

/retest

@MyonKeminta
Copy link
Copy Markdown
Contributor Author

/test unit-test

@tiprow
Copy link
Copy Markdown

tiprow bot commented Mar 4, 2026

@MyonKeminta: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test fast_test_tiprow
/test tidb_parser_test

Use /test all to run all jobs.

Details

In response to this:

/test unit-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot merged commit a833b14 into pingcap:master Mar 4, 2026
31 checks passed
@MyonKeminta MyonKeminta deleted the m/add-liveness-check-result-to-metrics branch March 5, 2026 09:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add metrics about store liveness which helps in diagnosing failover and replica selection

4 participants