Skip to content

metrics: enhance diagnostic capabilities for gRPC network issues#67811

Open
zyguan wants to merge 7 commits intopingcap:masterfrom
zyguan:dev/bump-client-go
Open

metrics: enhance diagnostic capabilities for gRPC network issues#67811
zyguan wants to merge 7 commits intopingcap:masterfrom
zyguan:dev/bump-client-go

Conversation

@zyguan
Copy link
Copy Markdown
Contributor

@zyguan zyguan commented Apr 16, 2026

What problem does this PR solve?

Issue Number: close #67810

Problem Summary: ref #67810

What changed and how does it work?

Bump client-go and register channelz collector.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Summary by CodeRabbit

  • Chores

    • Updated pinned third-party Go/Bazel dependency version.
  • New Features

    • Added gRPC Channelz metrics collection to report channel/socket health.
  • Tests

    • Added unit tests for the Channelz collector and metrics gathering.
    • Improved test cleanup and harness (explicit collector teardown, added leak-ignore rules for gRPC/bufconn, increased some test shard counts).

Signed-off-by: zyguan <zhongyangguan@gmail.com>
@ti-chi-bot ti-chi-bot bot added the release-note-none Denotes a PR that doesn't merit a release note. label Apr 16, 2026
@pantheon-ai
Copy link
Copy Markdown

pantheon-ai bot commented Apr 16, 2026

@zyguan I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details.

⏳ This process typically takes 10-30 minutes depending on the complexity of the changes.

ℹ️ Learn more details on Pantheon AI.

@ti-chi-bot ti-chi-bot bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 16, 2026
@tiprow
Copy link
Copy Markdown

tiprow bot commented Apr 16, 2026

Hi @zyguan. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a mutex-guarded gRPC Channelz Prometheus collector (with bufconn server/client and test cleanup), updates pinned github.com/tikv/client-go/v2 pseudo-version/sha256, adjusts Bazel build/test deps and sharding, and adds unit tests and goleak ignore entries across tests.

Changes

Cohort / File(s) Summary
Dependency Metadata
DEPS.bzl, go.mod
Bumped pinned pseudo-version for github.com/tikv/client-go/v2 (updated strip_prefix, urls, and sha256) and updated go.mod requirement to the new pseudo-version.
Metrics Implementation
pkg/metrics/metrics.go
Introduces Channelz collector: mutex-guarded singleton state, bufconn-based local gRPC server, client dialer, collector creation with filters, Prometheus registration, and stop/cleanup helpers (test short-circuit).
Metrics Tests & Hooks
pkg/metrics/main_test.go, pkg/metrics/metrics_internal_test.go
Adds goleak cleanup hook and new unit tests for singleton init, test-mode skipping, cleanup/reset, and Prometheus Gather assertions; includes helper functions to inspect metric families.
Build / Test Config
pkg/importsdk/BUILD.bazel, pkg/metrics/BUILD.bazel, br/pkg/metautil/BUILD.bazel
Added //pkg/parser/ast to importsdk_test; added //pkg/util/intest, tikv Channelz collectors and gRPC bufconn/insecure/channelz deps to pkg/metrics library; increased shard_count for metrics_test 5→8 and metautil_test 13→15.
goleak Ignore Additions
multiple test mains (e.g., br/cmd/br/main_test.go, pkg/server/.../main_test.go, pkg/server/tests/.../main_test.go)
Added goleak.IgnoreTopFunction entries for google.golang.org/grpc/internal/grpcsync.(*CallbackSerializer).run and google.golang.org/grpc/test/bufconn.(*Listener).Accept across many TestMain files to suppress bufconn/grpc-related goroutine reports.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application / Test
    participant Metrics as pkg/metrics
    participant Mutex as grpcChannelzCollector.mu
    participant Server as bufconn gRPC Server
    participant Client as gRPC ClientConn
    participant Channelz as ChannelzCollector
    participant Prom as Prometheus Registry

    App->>Metrics: setupChannelzCollector()
    rect rgba(100,150,200,0.5)
        Metrics->>Mutex: Lock
        Metrics->>Metrics: check intest.InTest / registered
    end

    alt Not in test and not registered
        Metrics->>Server: start bufconn server + register Channelz service
        Metrics->>Client: dial via bufconn dialer
        Metrics->>Channelz: NewChannelzCollector(Client, opts)
        Metrics->>Prom: prometheus.MustRegister(Channelz)
        Metrics->>Metrics: set registered = true
    end

    rect rgba(100,150,200,0.5)
        Metrics->>Mutex: Unlock
    end

    App->>Prom: Gather()
    Prom->>Channelz: Collect()
    Channelz-->>Prom: MetricFamilies
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

component/statistics, ok-to-test

Suggested reviewers

  • yibin87

Poem

🐰 I tunneled bufconn lanes below,
I guarded metrics with a mutex glow,
Prom counts hops where diagnostics go,
I clean the burrow after tests run slow,
Hop — channelz stories start to show.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title is specific and directly related to the main changeset, which implements a gRPC channelz collector for improved diagnostics of network issues.
Description check ✅ Passed The description follows the template and includes a properly formatted issue reference (close #67810), a concise explanation of changes, and appropriate test/release note declarations.
Linked Issues check ✅ Passed The PR implements both key objectives from #67810: exporting gRPC internal metrics to Prometheus via channelz collector registration and improving observability for connection-level diagnostics.
Out of Scope Changes check ✅ Passed All changes are scoped to the objectives: bumping client-go dependency and implementing channelz collector setup with supporting test infrastructure and goleak configuration updates.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/metrics/metrics.go (1)

499-531: Consider adding a brief startup synchronization or documenting the bufconn behavior.

The goroutine starting the gRPC server (line 509-513) runs asynchronously. While bufconn makes this safe because listener.DialContext will work immediately, a brief comment explaining why no explicit synchronization is needed would help future readers understand the design choice.

📝 Suggested documentation improvement
 	grpcChannelzCollector.server = grpc.NewServer()
 	service.RegisterChannelzServiceToServer(grpcChannelzCollector.server)
+	// The server is started asynchronously, but bufconn.Listener.DialContext works
+	// immediately without waiting for Serve() to be called, so no synchronization is needed.
 	go func(listener *bufconn.Listener, server *grpc.Server) {
 		if err := server.Serve(listener); err != nil {
 			logutil.BgLogger().Warn("internal channelz grpc server stopped", zap.Error(err))
 		}
 	}(grpcChannelzCollector.listener, grpcChannelzCollector.server)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/metrics/metrics.go` around lines 499 - 531, Add a short inline comment in
initGrpcChannelzCollectorLocked near the goroutine that starts the in-memory
gRPC server explaining that no explicit synchronization is required because
bufconn.Listen returns a ready listener and listener.DialContext will succeed
immediately (so dialing from the client goroutine is safe), and note that the
goroutine is only for Serve's lifecycle and errors are logged — reference the
goroutine that launches server.Serve(listener), the local variable listener
(grpcChannelzCollector.listener), and the DialContext usage in the
grpc.WithContextDialer closure to make the rationale easy to find.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/metrics/metrics.go`:
- Around line 499-531: Add a short inline comment in
initGrpcChannelzCollectorLocked near the goroutine that starts the in-memory
gRPC server explaining that no explicit synchronization is required because
bufconn.Listen returns a ready listener and listener.DialContext will succeed
immediately (so dialing from the client goroutine is safe), and note that the
goroutine is only for Serve's lifecycle and errors are logged — reference the
goroutine that launches server.Serve(listener), the local variable listener
(grpcChannelzCollector.listener), and the DialContext usage in the
grpc.WithContextDialer closure to make the rationale easy to find.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 15a22892-57e7-441c-88eb-fb0d72cefd36

📥 Commits

Reviewing files that changed from the base of the PR and between 7762bc6 and 0a2b8ef.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (7)
  • DEPS.bzl
  • go.mod
  • pkg/importsdk/BUILD.bazel
  • pkg/metrics/BUILD.bazel
  • pkg/metrics/main_test.go
  • pkg/metrics/metrics.go
  • pkg/metrics/metrics_internal_test.go

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 16, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.3104%. Comparing base (65d9fb6) to head (7974954).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #67811        +/-   ##
================================================
+ Coverage   77.5964%   78.3104%   +0.7140%     
================================================
  Files          1982       1983         +1     
  Lines        548885     549403       +518     
================================================
+ Hits         425915     430240      +4325     
+ Misses       122165     118118      -4047     
- Partials        805       1045       +240     
Flag Coverage Δ
integration 44.2919% <ø> (+9.9518%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 61.5065% <ø> (+0.0901%) ⬆️
parser ∅ <ø> (∅)
br 65.9812% <ø> (+5.4568%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@zyguan
Copy link
Copy Markdown
Contributor Author

zyguan commented Apr 16, 2026

/retest

@tiprow
Copy link
Copy Markdown

tiprow bot commented Apr 16, 2026

@zyguan: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@zyguan
Copy link
Copy Markdown
Contributor Author

zyguan commented Apr 16, 2026

/retest

@tiprow
Copy link
Copy Markdown

tiprow bot commented Apr 16, 2026

@zyguan: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Signed-off-by: zyguan <zhongyangguan@gmail.com>
@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 16, 2026
Comment thread pkg/metrics/metrics.go
prometheus.MustRegister(StmtSummaryWindowEvictedCount)

// Channelz
setupChannelzCollector()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would other goleak-based suites calling metrics.RegisterMetrics() directly get go leak check error in some caces?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved by c33b6e1

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case some of integration tests which call RegisterMetrics but are NOT compiled with -tags=intest, a6303d8 added the goroutine to the goleak whitelist.

Comment thread pkg/metrics/metrics.go
Comment on lines +538 to +542
func channelzCollectorOpts() tikvcollectors.ChannelzCollectorOpts {
return tikvcollectors.ChannelzCollectorOpts{
Namespace: namespace,
DisableLocalLabel: true,
Filter: func(node any) (collect bool, walkChildren bool) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This filter seems to include the collector’s own internal bufnet connection, so scraping
may inflate tidb_grpc_channelz_* by itself. Should we exclude it here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved by c5f3b5b

Signed-off-by: zyguan <zhongyangguan@gmail.com>
@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 16, 2026
zyguan added 3 commits April 16, 2026 10:27
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Copy link
Copy Markdown
Contributor

@cfzjywxk cfzjywxk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most lgtm

Comment thread pkg/metrics/metrics.go
grpcChannelzCollector.listener = bufconn.Listen(1 << 20)
grpcChannelzCollector.server = grpc.NewServer()
service.RegisterChannelzServiceToServer(grpcChannelzCollector.server)
go func(listener *bufconn.Listener, server *grpc.Server) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is graceful shutdown of tidb needs to be considered here to close this background thread properly?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, it's just for collecting channelz data and won't block graceful shutdown.

Comment thread pkg/metrics/metrics.go
return target == "bufnet" || target == "passthrough:///bufnet"
}

func isInternalChannelzSocket(socket *grpc_channelz_v1.Socket) bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better adding comments to explain the meaning of internal channel.

Signed-off-by: zyguan <zhongyangguan@gmail.com>
Copy link
Copy Markdown
Contributor

@cfzjywxk cfzjywxk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 17, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cfzjywxk
Once this PR has been reviewed and has the lgtm label, please assign 3pointer, nolouch for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 17, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-04-17 10:48:44.710005708 +0000 UTC m=+1730929.915365755: ☑️ agreed by cfzjywxk.

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Apr 17, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 17, 2026

@zyguan: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
idc-jenkins-ci-tidb/unit-test 7974954 link true /test unit-test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-1-more-lgtm Indicates a PR needs 1 more LGTM. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

metrics: enhance diagnostic capabilities for gRPC network issues

2 participants