Skip to content

client: add lock in tso.Request to avoid data race#10130

Open
okJiang wants to merge 6 commits intotikv:masterfrom
okJiang:fix-datarace-10124
Open

client: add lock in tso.Request to avoid data race#10130
okJiang wants to merge 6 commits intotikv:masterfrom
okJiang:fix-datarace-10124

Conversation

@okJiang
Copy link
Copy Markdown
Member

@okJiang okJiang commented Jan 5, 2026

What problem does this PR solve?

Issue Number: Close #10124

What is changed and how does it work?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Code changes

Side effects

  • Possible performance regression
  • Increased code complexity
  • Breaking backward compatibility

Related changes

Release note

None.

Summary by CodeRabbit

  • Chores
    • Improved concurrency handling in the TSO client to reduce rare race conditions and increase stability.
    • Made timing, timeout handling, and tracing more consistent, leading to more reliable metrics and fewer spurious errors.

Signed-off-by: okjiang <819421878@qq.com>
Signed-off-by: okjiang <819421878@qq.com>
@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-triage-completed release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. labels Jan 5, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Jan 5, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign huachaohuang for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 5, 2026
@okJiang okJiang requested a review from JmPotato January 5, 2026 06:55
func (c *Cli) GetTSORequest(ctx context.Context) *Request {
req := c.tsoReqPool.Get().(*Request)
// Set needed fields in the request before using it.
req.mu.Lock()
Copy link
Copy Markdown
Member

@rleungx rleungx Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it cause performance regression?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait a test result. I will do it.

Copy link
Copy Markdown
Member Author

@okJiang okJiang Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost -2% performance regression. Can we accept it? cc @JmPotato

./bin/pd-tso-bench -client 1 -duration 10m -c 100

Master
Total:
count: 174720411, max: 10.5724ms, min: 0.0911ms, avg: 0.3338ms
<1ms: 174699770, >1ms: 19262, >2ms: 579, >5ms: 700, >10ms: 100, >30ms: 0, >50ms: 0, >100ms: 0, >200ms: 0, >400ms: 0, >800ms: 0, >1s: 0
count: 174720411, <1ms: 99.99%, >1ms: 0.01%, >2ms: 0.00%, >5ms: 0.00%, >10ms: 0.00%, >30ms: 0.00%, >50ms: 0.00%, >100ms: 0.00%, >200ms:
 0.00%, >400ms: 0.00%, >800ms: 0.00%, >1s: 0.00%
P0.5: 0.3348ms, P0.8: 0.3789ms, P0.9: 0.4039ms, P0.99: 0.4830ms

Pr
Total:
count: 171009435, max: 10.3831ms, min: 0.0940ms, avg: 0.3415ms
<1ms: 170987346, >1ms: 20519, >2ms: 871, >5ms: 599, >10ms: 100, >30ms: 0, >50ms: 0, >100ms: 0, >200ms: 0, >400ms: 0, >800ms: 0, >1s: 0
count: 171009435, <1ms: 99.99%, >1ms: 0.01%, >2ms: 0.00%, >5ms: 0.00%, >10ms: 0.00%, >30ms: 0.00%, >50ms: 0.00%, >100ms: 0.00%, >200ms:
 0.00%, >400ms: 0.00%, >800ms: 0.00%, >1s: 0.00%
P0.5: 0.3420ms, P0.8: 0.3875ms, P0.9: 0.4135ms, P0.99: 0.4959ms

Copy link
Copy Markdown
Member Author

@okJiang okJiang Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more test after applying this comment #10130 (comment), the conclusion remains unchanged

Total:
count: 170459945, max: 10.4290ms, min: 0.0907ms, avg: 0.3432ms
<1ms: 170427514, >1ms: 28694, >2ms: 2732, >5ms: 953, >10ms: 52, >30ms: 0, >50ms: 0, >100ms: 0, >200ms: 0, >400ms: 0, >800ms: 0, >1s: 0
count: 170459945, <1ms: 99.98%, >1ms: 0.02%, >2ms: 0.00%, >5ms: 0.00%, >10ms: 0.00%, >30ms: 0.00%, >50ms: 0.00%, >100ms: 0.00%, >200ms:
 0.00%, >400ms: 0.00%, >800ms: 0.00%, >1s: 0.00%
P0.5: 0.3429ms, P0.8: 0.3899ms, P0.9: 0.4173ms, P0.99: 0.5068ms

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: okjiang <819421878@qq.com>
Comment on lines +105 to +107
req.mu.Lock()
physical, logical = req.physical, req.logical
req.mu.Unlock()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lock or RLock?

Signed-off-by: okjiang <819421878@qq.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Jan 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.96%. Comparing base (205352f) to head (c133631).
⚠️ Report is 107 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10130      +/-   ##
==========================================
+ Coverage   78.38%   78.96%   +0.58%     
==========================================
  Files         518      532      +14     
  Lines       69476    71882    +2406     
==========================================
+ Hits        54456    56759    +2303     
- Misses      11066    11098      +32     
- Partials     3954     4025      +71     
Flag Coverage Δ
unittests 78.96% <100.00%> (+0.58%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@JmPotato JmPotato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t understand how the race in the original issue happened—why would a request that already has a result still be written during the doneCollectedRequests processing? In theory, a request should only be returned to the request pool for reuse after it has been completely used.

@okJiang
Copy link
Copy Markdown
Member Author

okJiang commented Jan 12, 2026

I don’t understand how the race in the original issue happened—why would a request that already has a result still be written during the doneCollectedRequests processing? In theory, a request should only be returned to the request pool for reuse after it has been completely used.

  • Request is an object reused from the pool, and done is a channel with buffer=1, and it is not reset/cleared during reuse (see GetTSORequest() of client.go, which originally only resets the physical/logical/streamID and other fields, and does not process done).
  • If a request of the "previous generation" (the last use of the same object) writes a value to done, but the corresponding Wait() does not have time to consume it due to timeout/cancellation (select takes the ctx.Done() branch), then this value will remain in done.
  • The next time this object is taken out from the pool and used as a "new request", calling Wait() will immediately read the remaining old value from done, so Wait() will consider "the result is ready" and continue to read req.physical/req.logical (see waitCtx() of request.go).
  • At the same time, this "new request" is actually still in the normal TSO process, and the dispatcher's finisher will still write req.physical/req.logical/streamID in doneCollectedRequests (see dispatcher.go's tsoRequestFinisher()).
  • So the phenomenon you see appears: while Wait() is reading physical/logical, the finisher is writing physical/logical, and the race detector reports this pair of reads and writes.

@wk989898
Copy link
Copy Markdown

wk989898 commented Mar 3, 2026

@okJiang
Copy link
Copy Markdown
Member Author

okJiang commented Mar 5, 2026

@JmPotato ptal again

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

req.requestCtx is read here without acquiring the lock.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Comment on lines 67 to 74
func (req *Request) TryDone(err error) {
req.mu.RLock()
defer req.mu.RUnlock()
select {
case req.done <- err:
default:
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lock seems unnecessary, because once done is created it won’t be modified, and the channel itself is already thread-safe.

@@ -174,13 +174,15 @@ func (c *Cli) scheduleUpdateTSOConnectionCtxs() {
func (c *Cli) GetTSORequest(ctx context.Context) *Request {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the physical, logical, streamID, and other fields are reset, but the done channel is not reset. A req taken from the pool may have leftover data in its done channel. However, this isn’t an issue introduced by this PR, but it’s worth keeping an eye on.

Signed-off-by: okjiang <819421878@qq.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 5, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 99647f3f-edce-4a24-bb78-8d58943fdf7d

📥 Commits

Reviewing files that changed from the base of the PR and between 6173d50 and c133631.

📒 Files selected for processing (3)
  • client/clients/tso/client.go
  • client/clients/tso/dispatcher.go
  • client/clients/tso/request.go

📝 Walkthrough

Walkthrough

Adds a mutex to TSO Request and applies locking around request initialization, finalization, and waiting to synchronize concurrent access to fields (start, requestCtx, clientCtx, physical, logical, streamID) between client, dispatcher, and stream goroutines.

Changes

Cohort / File(s) Summary
TSO Request sync & init
client/clients/tso/client.go
Acquire req.mu while initializing pooled Request fields in GetTSORequest.
TSO Request finalization
client/clients/tso/dispatcher.go
Lock tsoReq.mu when reading requestCtx and updating physical, logical, streamID in tsoRequestFinisher before unlocking and finishing.
Request concurrency protections
client/clients/tso/request.go
Added mu sync.RWMutex to Request; use RLock/RUnlock for reads in IsFrom, Wait, waitCtx, and waitTimeout; capture fields under lock and operate on local copies to avoid races.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • JmPotato
  • lhy1024

Poem

🐰 With a hop and a careful twitch,
I wrapped each field in a little stitch.
Locks snug tight, no racey stitch,
TSO races fixed—now isn't that rich? 🐇🔒

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding lock synchronization in tso.Request to prevent data race, which directly matches the changeset.
Description check ✅ Passed The description includes the required Issue Number (Close #10124) and completes most checklist sections, though the 'What is changed and how it works' section lacks detailed explanation of the mutex implementation.
Linked Issues check ✅ Passed The code changes add synchronization (RWMutex) around concurrent access to Request fields in client.go, dispatcher.go, and request.go, directly addressing the data race between tsoDispatcher and Request.Wait paths described in issue #10124.
Out of Scope Changes check ✅ Passed All changes are focused on adding mutex protection to tso.Request fields (client.go, dispatcher.go, request.go) and are directly scoped to resolve the data race in issue #10124 without introducing unrelated modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@okJiang
Copy link
Copy Markdown
Member Author

okJiang commented Mar 5, 2026

/retest

1 similar comment
@okJiang
Copy link
Copy Markdown
Member Author

okJiang commented Mar 15, 2026

/retest

Signed-off-by: okjiang <819421878@qq.com>
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 8, 2026

@okJiang: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test-next-gen-3 c133631 link true /test pull-unit-test-next-gen-3

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/needs-triage-completed release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Data race in tso

5 participants