Skip to content

tso: validate callee id for tso requests#10600

Open
bufferflies wants to merge 7 commits intotikv:masterfrom
bufferflies:pr-merge/88e4f7d1-validate-callee-id
Open

tso: validate callee id for tso requests#10600
bufferflies wants to merge 7 commits intotikv:masterfrom
bufferflies:pr-merge/88e4f7d1-validate-callee-id

Conversation

@bufferflies
Copy link
Copy Markdown
Contributor

@bufferflies bufferflies commented Apr 14, 2026

Issue Number

ref #10516, close #10552

What problem does this PR solve?

The TSO client can keep using a stale gRPC connection after service endpoint changes. This patch carries a callee ID in TSO requests and lets the server reject requests that land on the wrong endpoint, so the client can drop the stale connection and reconnect.

What is changed and how does it work?

  • attach CalleeId to TSO request headers
  • validate callee ID in the TSO gRPC service against the advertised listen address
  • treat callee-ID mismatch as a reconnect signal in the TSO client
  • add RemoveClientConn to service discovery implementations so stale gRPC connections can be closed and recreated
  • upgrade github.com/pingcap/kvproto to latest v0.0.0-20260414083400-4388bfaaedab
  • tidy affected submodules after the kvproto upgrade
  • include the minimal gofmt/test-stub follow-up required for make check

Check List

  • Tests
  • No release note

Validation

  • go test ./pkg/mcs/tso/server -count=1
  • cd client && go test ./errs ./servicediscovery ./clients/tso -run TestDoesNotExist -count=1
  • make check

author: @iosmanthus
cp 88e4f7d1

Summary by CodeRabbit

  • New Features

    • Callee ID is now propagated and validated to detect stale/mismatched endpoints; servers reject mismatched requests.
  • Bug Fixes

    • Stale gRPC connections are proactively removed and distinguished from leader-change errors to improve recovery and routing.
  • Tests

    • Added tests covering callee-mismatch handling and connection removal behavior.
  • Chores

    • Updated kvproto dependency to a newer commit.

Signed-off-by: bufferflies <1045931706@qq.com>
Signed-off-by: bufferflies <1045931706@qq.com>
Signed-off-by: bufferflies <1045931706@qq.com>
Signed-off-by: bufferflies <1045931706@qq.com>
Signed-off-by: bufferflies <1045931706@qq.com>
Signed-off-by: bufferflies <1045931706@qq.com>
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 14, 2026

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot added dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 14, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 14, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 53003799-7db7-4120-a979-52b16ac2f595

📥 Commits

Reviewing files that changed from the base of the PR and between eeaa902 and 1540767.

📒 Files selected for processing (5)
  • client/clients/tso/client.go
  • client/clients/tso/dispatcher.go
  • client/clients/tso/dispatcher_test.go
  • pkg/mcs/tso/server/grpc_service.go
  • pkg/mcs/tso/server/server.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/mcs/tso/server/server.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • client/clients/tso/client.go
  • pkg/mcs/tso/server/grpc_service.go

📝 Walkthrough

Walkthrough

Clients now include a callee ID in TSO requests and validate it server-side against the server's advertise listen host. On callee mismatches, clients remove the stale gRPC connection and retry with a fresh connection. ServiceDiscovery gained RemoveClientConn to support selective connection cleanup.

Changes

Cohort / File(s) Summary
TSO client connection & dispatcher
client/clients/tso/client.go, client/clients/tso/dispatcher.go
Use GetOrCreateGRPCConn(url) with error handling; dispatcher construction simplified to newTSODispatcher(c.ctx, c) and error branch now handles callee-mismatch case by removing stale connection.
TSO stream / callee ID
client/clients/tso/stream.go
Add calleeID to TSO stream adapter, derive from server URL and include in outgoing TsoRequest header.
Error definitions & helpers
client/errs/errno.go, client/errs/errs.go
Add MismatchCalleeIDErr constant and IsCalleeMismatch(err error) bool helper for detecting callee-mismatch errors.
ServiceDiscovery surface & implementations
client/servicediscovery/service_discovery.go, client/servicediscovery/tso_service_discovery.go, client/servicediscovery/router_service_discovery.go, client/servicediscovery/mock_service_discovery.go
Add RemoveClientConn(url string) to the interface and implement it across service discovery types; implementations evict and close per-URL gRPC connections.
Tests & test doubles
client/clients/tso/dispatcher_test.go, client/resource_manager_client_test.go
Update tests to inject svcDiscovery, add countingServiceDiscovery, test removal on callee mismatch, and add RemoveClientConn in test service discovery.
Server-side validation
pkg/mcs/tso/server/grpc_service.go, pkg/mcs/tso/server/server.go
Add advertise-listen-host parsing and validate incoming request CalleeId against it; return FailedPrecondition on mismatch.
Module updates
client/go.mod, go.mod, tests/integrations/go.mod, tools/go.mod
Bump github.com/pingcap/kvproto version to include updated TSO proto/header support.
Misc / formatting
server/cluster/cluster.go
Whitespace/field alignment reformatting only (no functional change).

Sequence Diagram(s)

sequenceDiagram
    participant Client as TSO Client
    participant SD as ServiceDiscovery
    participant Conn as gRPC Conn
    participant Server as TSO Server

    rect rgba(100,200,100,0.5)
    Note over Client,Server: Initial request with callee ID
    Client->>Client: extract calleeID from serverURL
    Client->>SD: GetOrCreateGRPCConn(url)
    SD-->>Client: *grpc.ClientConn
    Client->>Server: TsoRequest (CalleeId header)
    Server->>Server: parse advertise host & compare CalleeId
    alt match
        Server-->>Client: Response
    else mismatch
        Server-->>Client: FailedPrecondition (mismatch)
    end
    end

    rect rgba(100,150,200,0.5)
    Note over Client,SD: Recovery on mismatch
    Client->>Client: IsCalleeMismatch(err)
    Client->>SD: RemoveClientConn(url)
    SD->>Conn: Close() and delete cache entry
    Client->>SD: GetOrCreateGRPCConn(url) [retry]
    SD-->>Client: new *grpc.ClientConn
    Client->>Server: TsoRequest (retry)
    Server-->>Client: Response
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

release-note-none, lgtm, approved, needs-cherry-pick-release-8.5

Suggested reviewers

  • okJiang

Poem

🐰 I sniffed the server URL today,

CalleeId in paw, I hop away.
Connections stale fall with a thunk—
Fresh hops run fast, no more DNS funk! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding callee ID validation for TSO requests to prevent stale connections.
Description check ✅ Passed The description covers the problem, changes, and validation steps, following the template structure with issue references and a detailed explanation.
Linked Issues check ✅ Passed The PR implements all requirements from the linked issues: attaching CalleeId to TSO request headers, validating on server-side, handling mismatches in client, and adding RemoveClientConn to service discovery.
Out of Scope Changes check ✅ Passed All changes are within scope: kvproto dependency upgrade, callee ID implementation across client/server, service discovery enhancements, and associated tests and formatting adjustments.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bufferflies
Copy link
Copy Markdown
Contributor Author

/ping @iosmanthus

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
client/errs/errs.go (1)

42-48: Harden callee-mismatch detection to include gRPC status code check.

Line 47 currently relies only on substring matching against error messages. This can produce false positives if other gRPC errors coincidentally contain "mismatch callee id". Adding a status code check first narrows the scope to only codes.FailedPrecondition errors that match the message.

♻️ Proposed refactor
 import (
 	"strings"
 
 	"go.uber.org/zap"
 	"go.uber.org/zap/zapcore"
 	"google.golang.org/grpc/codes"
+	"google.golang.org/grpc/status"
 
 	"github.com/pingcap/errors"
 )
@@
 func IsCalleeMismatch(err error) bool {
 	if err == nil {
 		return false
 	}
-	return strings.Contains(err.Error(), MismatchCalleeIDErr)
+	cause := errors.Cause(err)
+	if status.Code(cause) != codes.FailedPrecondition {
+		return false
+	}
+	return strings.Contains(status.Convert(cause).Message(), MismatchCalleeIDErr)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/errs/errs.go` around lines 42 - 48, IsCalleeMismatch currently only
does substring matching on err.Error(), which can false-positive; update
IsCalleeMismatch to first extract a gRPC status via status.FromError(err),
verify the status.Code() == codes.FailedPrecondition, and only then check that
st.Message() (or err.Error()) contains the MismatchCalleeIDErr constant;
reference the IsCalleeMismatch function, MismatchCalleeIDErr constant, and use
status.FromError and codes.FailedPrecondition in the check.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@client/errs/errs.go`:
- Around line 42-48: IsCalleeMismatch currently only does substring matching on
err.Error(), which can false-positive; update IsCalleeMismatch to first extract
a gRPC status via status.FromError(err), verify the status.Code() ==
codes.FailedPrecondition, and only then check that st.Message() (or err.Error())
contains the MismatchCalleeIDErr constant; reference the IsCalleeMismatch
function, MismatchCalleeIDErr constant, and use status.FromError and
codes.FailedPrecondition in the check.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 96598a2f-f433-46f3-8eb3-101e98530b4f

📥 Commits

Reviewing files that changed from the base of the PR and between b0a3c90 and eeaa902.

⛔ Files ignored due to path filters (4)
  • client/go.sum is excluded by !**/*.sum
  • go.sum is excluded by !**/*.sum
  • tests/integrations/go.sum is excluded by !**/*.sum
  • tools/go.sum is excluded by !**/*.sum
📒 Files selected for processing (16)
  • client/clients/tso/client.go
  • client/clients/tso/dispatcher.go
  • client/clients/tso/stream.go
  • client/errs/errno.go
  • client/errs/errs.go
  • client/go.mod
  • client/resource_manager_client_test.go
  • client/servicediscovery/mock_service_discovery.go
  • client/servicediscovery/router_service_discovery.go
  • client/servicediscovery/service_discovery.go
  • client/servicediscovery/tso_service_discovery.go
  • go.mod
  • pkg/mcs/tso/server/grpc_service.go
  • server/cluster/cluster.go
  • tests/integrations/go.mod
  • tools/go.mod

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 14, 2026

@iosmanthus: adding LGTM is restricted to approvers and reviewers in OWNERS files.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 14, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: iosmanthus
Once this PR has been reviewed and has the lgtm label, please assign rleungx for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

svcDiscovery := td.provider.getServiceDiscovery()
if errs.IsLeaderChange(err) {
switch {
case errs.IsCalleeMismatch(err):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a test to cover this change.

Signed-off-by: bufferflies <1045931706@qq.com>
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 14, 2026

@bufferflies: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-error-log-review 1540767 link true /test pull-error-log-review

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 14, 2026

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456, multiple issues should use full syntax for each issue and be separated by a comma, like: Issue Number: close #123, ref #456.

📖 For more info, you can check the "Linking issues" section in the CONTRIBUTING.md.

return &Service{
Server: server,
Server: server,
advertiseListenHost: getAdvertiseListenHost(server.GetAdvertiseListenAddr()),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the startServer init advertiseListenHost?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/needs-linked-issue do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

88e4f7d1 validate callee ID for TSO requests

3 participants