Skip to content

schedule/labeler: Optimize locking to fix high-concurrency goroutine surge#9857

Merged
ti-chi-bot[bot] merged 2 commits intotikv:masterfrom
lhy1024:fix-label-lock
Oct 30, 2025
Merged

schedule/labeler: Optimize locking to fix high-concurrency goroutine surge#9857
ti-chi-bot[bot] merged 2 commits intotikv:masterfrom
lhy1024:fix-label-lock

Conversation

@lhy1024
Copy link
Copy Markdown
Contributor

@lhy1024 lhy1024 commented Oct 22, 2025

…surge

What problem does this PR solve?

Issue Number: Close #9854

What is changed and how does it work?

  1. Remove GC in the get query
  2. Reduce lock time for storage and only lock for memory state

Check List

Tests

  • Unit test

Release note

None.

…surge

Signed-off-by: lhy1024 <admin@liudos.us>
@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-triage-completed release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 22, 2025
Signed-off-by: lhy1024 <admin@liudos.us>
@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Oct 22, 2025
}

// only Lock for in-memory update
l.Lock()
Copy link
Copy Markdown
Contributor

@HaoW30 HaoW30 Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my local experiments, this lock can still experience heavy contention because the range list is built while holding the write lock.
One possible improvement is:

  1. Make a copy of the rules under a read lock.
  2. Build the range list without holding any lock.
  3. Use write lock to set the new range list in RegionLabeler.

This approach avoids holding the write lock for an extended period and should significantly reduce contention.

Copy link
Copy Markdown
Contributor

@HaoW30 HaoW30 Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something this

-       // only Lock for in-memory update
+       // Build range list outside the write lock to reduce contention
+       l.RLock()
+       rulesCopy := make(map[string]*LabelRule, len(l.labelRules)+1)
+       for k, v := range l.labelRules {
+               rulesCopy[k] = v
+       }
+       l.RUnlock()
+       rulesCopy[rule.ID] = rule
+
+       // Build the new range list without holding any lock (expensive operation)
+       newRangeList, newMinExpire := buildRangeListFrom(rulesCopy)
+
+       // Quick atomic swap under write lock
        l.Lock()
-       defer l.Unlock()
        l.labelRules[rule.ID] = rule
-       l.buildRangeList()
+       l.rangeList = newRangeList
+       l.minExpire = newMinExpire
+       l.Unlock()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above code may not be bulletproof — it’s just to show the idea.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need compare and swap to avoid losing updates.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m totally fine with handling the further optimization in a separate PR, since the compare-and-swap part could be a bit tricky.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job. I think it will be helpful and we could do it in another pr.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Oct 28, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bufferflies, okJiang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Oct 28, 2025
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Oct 28, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-10-22 06:32:07.282367601 +0000 UTC m=+853433.359620151: ☑️ agreed by bufferflies.
  • 2025-10-28 07:38:50.716186795 +0000 UTC m=+1375836.793439345: ☑️ agreed by okJiang.

@okJiang
Copy link
Copy Markdown
Member

okJiang commented Oct 28, 2025

/retest

@codecov
Copy link
Copy Markdown

codecov bot commented Oct 28, 2025

Codecov Report

❌ Patch coverage is 62.26415% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.69%. Comparing base (9003262) to head (7c8034e).
⚠️ Report is 18 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #9857      +/-   ##
==========================================
+ Coverage   76.82%   78.69%   +1.87%     
==========================================
  Files         491      491              
  Lines       78473    66137   -12336     
==========================================
- Hits        60283    52048    -8235     
+ Misses      14486    10397    -4089     
+ Partials     3704     3692      -12     
Flag Coverage Δ
unittests 78.69% <62.26%> (+1.87%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@okJiang
Copy link
Copy Markdown
Member

okJiang commented Oct 29, 2025

/test pull-unit-test-next-gen

2 similar comments
@lhy1024
Copy link
Copy Markdown
Contributor Author

lhy1024 commented Oct 30, 2025

/test pull-unit-test-next-gen

@lhy1024
Copy link
Copy Markdown
Contributor Author

lhy1024 commented Oct 30, 2025

/test pull-unit-test-next-gen

@lhy1024
Copy link
Copy Markdown
Contributor Author

lhy1024 commented Oct 30, 2025

/test pull-integration-realcluster-test

@ti-chi-bot ti-chi-bot bot merged commit 36f7836 into tikv:master Oct 30, 2025
36 of 40 checks passed
@ti-chi-bot ti-chi-bot bot added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label Oct 30, 2025
@ti-chi-bot
Copy link
Copy Markdown
Member

In response to a cherrypick label: new pull request created to branch release-8.5: #9892.

ti-chi-bot bot pushed a commit that referenced this pull request Oct 31, 2025
…surge (#9857) (#9892)

close #9854

Signed-off-by: lhy1024 <admin@liudos.us>

Co-authored-by: lhy1024 <admin@liudos.us>
@King-Dylan
Copy link
Copy Markdown

King-Dylan commented Oct 31, 2025

/label needs-cherry-pick-release-7.5

@ti-chi-bot ti-chi-bot bot added the needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. label Oct 31, 2025
@ti-chi-bot
Copy link
Copy Markdown
Member

In response to a cherrypick label: new pull request created to branch release-7.5: #9897.
But this PR has conflicts, please resolve them!

lhy1024 added a commit to ti-chi-bot/pd that referenced this pull request Nov 24, 2025
…surge (tikv#9857)

close tikv#9854

Signed-off-by: lhy1024 <admin@liudos.us>
lhy1024 added a commit to lhy1024/pd that referenced this pull request Nov 24, 2025
…surge (tikv#9857)

close tikv#9854

Signed-off-by: lhy1024 <admin@liudos.us>
ti-chi-bot bot pushed a commit that referenced this pull request Dec 18, 2025
…surge (#9857) (#9897)

close #9854

Signed-off-by: lhy1024 <admin@liudos.us>

Co-authored-by: lhy1024 <liuhanyang@pingcap.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

High goroutine surge and server instability when SetRegionLabelRule is called concurrently during Lightning import

6 participants