Skip to content

client: add backoff mechanism for service URL updates#10540

Merged
ti-chi-bot[bot] merged 4 commits intotikv:masterfrom
bufferflies:pr-merge/6e2a4d20-service-url-backoff
Apr 2, 2026
Merged

client: add backoff mechanism for service URL updates#10540
ti-chi-bot[bot] merged 4 commits intotikv:masterfrom
bufferflies:pr-merge/6e2a4d20-service-url-backoff

Conversation

@bufferflies
Copy link
Copy Markdown
Contributor

@bufferflies bufferflies commented Apr 1, 2026

Summary

  • add a backoff interval before repeated RM service URL refreshes triggered by error paths
  • keep the wait context-aware so shutdown can still exit the update loop promptly

Issue Number

Issue Number: ref #10516, close #10539

Validation

  • cd client && go test . -run 'Test.*ResourceManager.*' -count=1\n- cd client && go test . -run 'TestTryResourceManagerConnectUsesRMForTokenAndFallbackToPD' -count=1\n- make check\n\n## Release Note\nrelease-note\nNone\n

Summary by CodeRabbit

  • Bug Fixes

    • Introduced a backoff to service discovery updates so repeated update signals respect a minimum interval, reducing rapid repeat updates and improving stability and logging around wait durations.
  • Tests

    • Added a test that validates the update backoff behavior, ensures a delayed second update, and confirms the update loop exits promptly when canceled.

Signed-off-by: bufferflies <1045931706@qq.com>
@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has signed the dco. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 1, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 1, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4056277d-a93c-4355-b7d5-ae4174b76545

📥 Commits

Reviewing files that changed from the base of the PR and between dbc6db0 and dc612b5.

📒 Files selected for processing (1)
  • client/servicediscovery/resource_manager_service_discovery.go

📝 Walkthrough

Walkthrough

Adds a time-based backoff to updateServiceURLLoop: track last update time, delay discoverAndUpdate() when updates arrive sooner than the retry interval, and allow early exit if the context is canceled.

Changes

Cohort / File(s) Summary
Service URL backoff
client/servicediscovery/resource_manager_service_discovery.go
Introduce serviceURLRetryInterval and lastUpdateTime in updateServiceURLLoop; when triggered, wait the remaining backoff before calling discoverAndUpdate(), and respect context cancellation during the wait.
Backoff unit test
client/servicediscovery/resource_manager_service_discovery_test.go
Add TestResourceManagerServiceURLUpdateBackoff and a countingMetaStorageClient to assert that consecutive update signals are rate-limited and that the loop exits promptly on context cancel.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

size/S, lgtm

Suggested reviewers

  • lhy1024

Poem

🐇 I pause on a pebble, three heartbeats I keep,
When updates come tumbling, I hush them to sleep.
A patient small hop, then I fetch what is due,
Calm steps, steady timing — a rabbit's review. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: adding a backoff mechanism for service URL updates. It is concise, specific, and directly relates to the implemented functionality.
Description check ✅ Passed The PR description includes issue references, a clear summary of changes, validation steps, and a release note section. All required template sections are either completed or appropriately marked as 'None'.
Linked Issues check ✅ Passed The changes implement the backoff mechanism objective from #10539: adding throttling for repeated RM service URL updates while maintaining context-aware interruption for prompt shutdown.
Out of Scope Changes check ✅ Passed All changes are directly related to implementing the backoff mechanism for service URL updates. The test file validates the new backoff behavior without introducing unrelated modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bufferflies
Copy link
Copy Markdown
Contributor Author

Re-reviewed the change on top of commit 31c1608aa15de85a7ca2a1784e1b5407c8c22c31.

I do not have a new code finding on this delta. The backoff is confined to the error-triggered refresh path, and the wait stays context-aware so shutdown can still break out promptly.

Residual risk: I do not see a targeted regression test for the new backoff behavior itself in this PR, so the current validation still relies on the existing RM discovery test surface plus make check.

Signed-off-by: bufferflies <1045931706@qq.com>
@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 1, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/servicediscovery/resource_manager_service_discovery.go`:
- Line 265: The log reads r.serviceURL without holding r.mu, causing a data
race; fix by reading the protected value while holding the mutex (e.g., use
r.mu.RLock()/RUnlock() around the read) or call the existing GetServiceURL() to
obtain the URL safely, then pass that safe value into log.Info (references:
r.serviceURL, r.mu, GetServiceURL(), resetConn()).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4ba4e40f-ef93-49e9-a208-3417e5ed7d4a

📥 Commits

Reviewing files that changed from the base of the PR and between 048f0d8 and 31c1608.

📒 Files selected for processing (1)
  • client/servicediscovery/resource_manager_service_discovery.go

Signed-off-by: bufferflies <1045931706@qq.com>
@bufferflies
Copy link
Copy Markdown
Contributor Author

Re-reviewed the latest follow-up on top of commit dbc6db0aa6372194908783f7012dfe341af814cc.

The previous test finding is now addressed:

  • the production default backoff remains unchanged
  • the test overrides the interval locally, so it no longer waits on the real 3s window

I do not have a new finding on this follow-up delta.

@bufferflies bufferflies requested review from disksing and okJiang April 1, 2026 08:42
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.96%. Comparing base (3eb99ae) to head (dc612b5).
⚠️ Report is 13 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10540      +/-   ##
==========================================
+ Coverage   78.88%   78.96%   +0.07%     
==========================================
  Files         530      532       +2     
  Lines       71548    71818     +270     
==========================================
+ Hits        56439    56709     +270     
+ Misses      11092    11078      -14     
- Partials     4017     4031      +14     
Flag Coverage Δ
unittests 78.96% <100.00%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
client/servicediscovery/resource_manager_service_discovery.go (1)

266-266: ⚠️ Potential issue | 🔴 Critical

Read serviceURL under lock (or via accessor) in this log path.

This line reads r.serviceURL directly and can race with resetConn() writes. Use r.GetServiceURL() (or RLock) before logging.

Suggested fix
-			log.Info("[resource-manager] updating service URL", zap.String("old-url", r.serviceURL))
+			log.Info("[resource-manager] updating service URL", zap.String("old-url", r.GetServiceURL()))

As per coding guidelines: Guard shared state with mutex/RWMutex; keep lock ordering consistent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/servicediscovery/resource_manager_service_discovery.go` at line 266,
The log reads r.serviceURL without holding the lock, which can race with
resetConn() writes—wrap the read in the same synchronization used elsewhere:
either call r.GetServiceURL() (the safe accessor) or acquire r.RLock()/RUnlock()
around the read before calling log.Info; ensure you use the same lock ordering
as other methods (e.g., methods that call resetConn()) to avoid deadlocks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@client/servicediscovery/resource_manager_service_discovery.go`:
- Line 266: The log reads r.serviceURL without holding the lock, which can race
with resetConn() writes—wrap the read in the same synchronization used
elsewhere: either call r.GetServiceURL() (the safe accessor) or acquire
r.RLock()/RUnlock() around the read before calling log.Info; ensure you use the
same lock ordering as other methods (e.g., methods that call resetConn()) to
avoid deadlocks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4e2e33ec-de2f-4ab8-bea1-79b978b7aca7

📥 Commits

Reviewing files that changed from the base of the PR and between 31c1608 and dbc6db0.

📒 Files selected for processing (2)
  • client/servicediscovery/resource_manager_service_discovery.go
  • client/servicediscovery/resource_manager_service_discovery_test.go

@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Apr 1, 2026
Signed-off-by: bufferflies <1045931706@qq.com>
@bufferflies
Copy link
Copy Markdown
Contributor Author

Re-reviewed the latest follow-up on top of commit dc612b5dc3811ccff02cf1c43e772a1f7d9f9e67.

The accepted r.serviceURL suggestion is now addressed: the log path reads the old URL through GetServiceURL() instead of reaching into the protected field directly.

I do not have a new finding on this follow-up delta.

@bufferflies
Copy link
Copy Markdown
Contributor Author

/retest

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Apr 2, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 2, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-04-01 09:47:31.15574243 +0000 UTC m=+344856.361102487: ☑️ agreed by okJiang.
  • 2026-04-02 03:48:45.196218891 +0000 UTC m=+409730.401578938: ☑️ agreed by lhy1024.

@bufferflies
Copy link
Copy Markdown
Contributor Author

/retest

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 2, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: disksing, lhy1024, okJiang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [disksing,lhy1024,okJiang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bufferflies
Copy link
Copy Markdown
Contributor Author

/retest by AI

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 2, 2026

@bufferflies: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

/test pull-build
/test pull-build-next-gen
/test pull-check-deps
/test pull-error-log-review
/test pull-integration-realcluster-test
/test pull-unit-test-next-gen-1
/test pull-unit-test-next-gen-2
/test pull-unit-test-next-gen-3

The following commands are available to trigger optional jobs:

/test pull-integration-copr-test
/test pull-integration-realcluster-test-next-gen
/test pull-unit-test

Use /test all to run the following jobs that were automatically triggered:

pull-build
pull-build-next-gen
pull-check-deps
pull-error-log-review
pull-unit-test-next-gen-1
pull-unit-test-next-gen-2
pull-unit-test-next-gen-3
tikv/pd/pull_integration_realcluster_test
tikv/pd/pull_integration_realcluster_test_next_gen
Details

In response to this:

/retest by AI

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@bufferflies
Copy link
Copy Markdown
Contributor Author

/retest

@ti-chi-bot ti-chi-bot bot merged commit 5cb07cf into tikv:master Apr 2, 2026
41 of 43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6e2a4d20 add backoff mechanism for service URL updates

4 participants