metrics: enhance diagnostic capabilities for gRPC network issues#67811
metrics: enhance diagnostic capabilities for gRPC network issues#67811zyguan wants to merge 7 commits intopingcap:masterfrom
Conversation
Signed-off-by: zyguan <zhongyangguan@gmail.com>
|
@zyguan I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details. ⏳ This process typically takes 10-30 minutes depending on the complexity of the changes. ℹ️ Learn more details on Pantheon AI. |
|
Hi @zyguan. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a mutex-guarded gRPC Channelz Prometheus collector (with bufconn server/client and test cleanup), updates pinned Changes
Sequence Diagram(s)sequenceDiagram
participant App as Application / Test
participant Metrics as pkg/metrics
participant Mutex as grpcChannelzCollector.mu
participant Server as bufconn gRPC Server
participant Client as gRPC ClientConn
participant Channelz as ChannelzCollector
participant Prom as Prometheus Registry
App->>Metrics: setupChannelzCollector()
rect rgba(100,150,200,0.5)
Metrics->>Mutex: Lock
Metrics->>Metrics: check intest.InTest / registered
end
alt Not in test and not registered
Metrics->>Server: start bufconn server + register Channelz service
Metrics->>Client: dial via bufconn dialer
Metrics->>Channelz: NewChannelzCollector(Client, opts)
Metrics->>Prom: prometheus.MustRegister(Channelz)
Metrics->>Metrics: set registered = true
end
rect rgba(100,150,200,0.5)
Metrics->>Mutex: Unlock
end
App->>Prom: Gather()
Prom->>Channelz: Collect()
Channelz-->>Prom: MetricFamilies
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/metrics/metrics.go (1)
499-531: Consider adding a brief startup synchronization or documenting the bufconn behavior.The goroutine starting the gRPC server (line 509-513) runs asynchronously. While
bufconnmakes this safe becauselistener.DialContextwill work immediately, a brief comment explaining why no explicit synchronization is needed would help future readers understand the design choice.📝 Suggested documentation improvement
grpcChannelzCollector.server = grpc.NewServer() service.RegisterChannelzServiceToServer(grpcChannelzCollector.server) + // The server is started asynchronously, but bufconn.Listener.DialContext works + // immediately without waiting for Serve() to be called, so no synchronization is needed. go func(listener *bufconn.Listener, server *grpc.Server) { if err := server.Serve(listener); err != nil { logutil.BgLogger().Warn("internal channelz grpc server stopped", zap.Error(err)) } }(grpcChannelzCollector.listener, grpcChannelzCollector.server)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/metrics/metrics.go` around lines 499 - 531, Add a short inline comment in initGrpcChannelzCollectorLocked near the goroutine that starts the in-memory gRPC server explaining that no explicit synchronization is required because bufconn.Listen returns a ready listener and listener.DialContext will succeed immediately (so dialing from the client goroutine is safe), and note that the goroutine is only for Serve's lifecycle and errors are logged — reference the goroutine that launches server.Serve(listener), the local variable listener (grpcChannelzCollector.listener), and the DialContext usage in the grpc.WithContextDialer closure to make the rationale easy to find.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/metrics/metrics.go`:
- Around line 499-531: Add a short inline comment in
initGrpcChannelzCollectorLocked near the goroutine that starts the in-memory
gRPC server explaining that no explicit synchronization is required because
bufconn.Listen returns a ready listener and listener.DialContext will succeed
immediately (so dialing from the client goroutine is safe), and note that the
goroutine is only for Serve's lifecycle and errors are logged — reference the
goroutine that launches server.Serve(listener), the local variable listener
(grpcChannelzCollector.listener), and the DialContext usage in the
grpc.WithContextDialer closure to make the rationale easy to find.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 15a22892-57e7-441c-88eb-fb0d72cefd36
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (7)
DEPS.bzlgo.modpkg/importsdk/BUILD.bazelpkg/metrics/BUILD.bazelpkg/metrics/main_test.gopkg/metrics/metrics.gopkg/metrics/metrics_internal_test.go
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #67811 +/- ##
================================================
+ Coverage 77.5964% 78.3104% +0.7140%
================================================
Files 1982 1983 +1
Lines 548885 549403 +518
================================================
+ Hits 425915 430240 +4325
+ Misses 122165 118118 -4047
- Partials 805 1045 +240
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
/retest |
|
@zyguan: PRs from untrusted users cannot be marked as trusted with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest |
|
@zyguan: PRs from untrusted users cannot be marked as trusted with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Signed-off-by: zyguan <zhongyangguan@gmail.com>
| prometheus.MustRegister(StmtSummaryWindowEvictedCount) | ||
|
|
||
| // Channelz | ||
| setupChannelzCollector() |
There was a problem hiding this comment.
Would other goleak-based suites calling metrics.RegisterMetrics() directly get go leak check error in some caces?
There was a problem hiding this comment.
In case some of integration tests which call RegisterMetrics but are NOT compiled with -tags=intest, a6303d8 added the goroutine to the goleak whitelist.
| func channelzCollectorOpts() tikvcollectors.ChannelzCollectorOpts { | ||
| return tikvcollectors.ChannelzCollectorOpts{ | ||
| Namespace: namespace, | ||
| DisableLocalLabel: true, | ||
| Filter: func(node any) (collect bool, walkChildren bool) { |
There was a problem hiding this comment.
This filter seems to include the collector’s own internal bufnet connection, so scraping
may inflate tidb_grpc_channelz_* by itself. Should we exclude it here?
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
| grpcChannelzCollector.listener = bufconn.Listen(1 << 20) | ||
| grpcChannelzCollector.server = grpc.NewServer() | ||
| service.RegisterChannelzServiceToServer(grpcChannelzCollector.server) | ||
| go func(listener *bufconn.Listener, server *grpc.Server) { |
There was a problem hiding this comment.
Is graceful shutdown of tidb needs to be considered here to close this background thread properly?
There was a problem hiding this comment.
I don't think so, it's just for collecting channelz data and won't block graceful shutdown.
| return target == "bufnet" || target == "passthrough:///bufnet" | ||
| } | ||
|
|
||
| func isInternalChannelzSocket(socket *grpc_channelz_v1.Socket) bool { |
There was a problem hiding this comment.
Better adding comments to explain the meaning of internal channel.
Signed-off-by: zyguan <zhongyangguan@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: cfzjywxk The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
|
@zyguan: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What problem does this PR solve?
Issue Number: close #67810
Problem Summary: ref #67810
What changed and how does it work?
Bump client-go and register channelz collector.
Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
Summary by CodeRabbit
Chores
New Features
Tests