Skip to content

util: prioritize cancellations in retry loop#154782

Merged
craig[bot] merged 1 commit into
cockroachdb:masterfrom
sanki92:fix-retry-test-timing-154764
Oct 6, 2025
Merged

util: prioritize cancellations in retry loop#154782
craig[bot] merged 1 commit into
cockroachdb:masterfrom
sanki92:fix-retry-test-timing-154764

Conversation

@sanki92

@sanki92 sanki92 commented Oct 3, 2025

Copy link
Copy Markdown
Contributor

This commit teaches util.Retry to prioritize context cancellations and stoppers over retry attempts. This ensures more consistent behaviors and reduces test flakes.

Fixes: #154764

Release note: None

@blathers-crl

blathers-crl Bot commented Oct 3, 2025

Copy link
Copy Markdown

Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR.

Before a member of our team reviews your PR, I have some potential action items for you:

  • Please ensure your git commit message contains a release note.
  • When CI has completed, please ensure no errors have appeared.

I was unable to automatically find a reviewer. You can try CCing one of the following members:

  • A person you worked with closely on this PR.
  • The person who created the ticket, or a CRDB organization member involved with the ticket (author, commenter, etc.).
  • Join our community slack channel and ask on #contributors.
  • Try find someone else from here.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl Bot added O-community Originated from the community X-blathers-untriaged blathers was unable to find an owner labels Oct 3, 2025
@cockroach-teamcity

Copy link
Copy Markdown
Member

This change is Reviewable

@cockroachlabs-cla-agent

cockroachlabs-cla-agent Bot commented Oct 3, 2025

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@yuzefovich

Copy link
Copy Markdown
Member

Thanks for opening a PR! I'm not sure if the fix is quite right, I'll defer to @kev-cao who authored the test for that.

Generally, we don't expect community members to look into test failures, so I'd encourage you to take a look at #41815 as a starting point to find an interesting feature issue to work on.

@yuzefovich yuzefovich removed the X-blathers-untriaged blathers was unable to find an owner label Oct 3, 2025

@kev-cao kev-cao left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM! While this test does use a manually controlled clock, these two particular subtests have some variability due to the select statement randomly picking between the stopped channels versus the clock's timer.

Increasing the timing tolerance like this will decrease the amount of flakes without impacting the correctness of the test, although to fully resolve the issue we'd probably want to look into repeatedly advancing the manual clock by some fraction of the backoff to avoid ties.

@sanki92

sanki92 commented Oct 3, 2025

Copy link
Copy Markdown
Contributor Author

@yuzefovich Thanks for the feedback! I appreciate @kev-cao confirming the fix approach.

I notice the CI failures seem to be infrastructure-related rather than issues with the code changes. Should I wait for these to resolve, or is there anything specific I should address?

Also, thank you for pointing me toward #41815 for feature work - I'll definitely explore those opportunities for more substantial contributions!

@kev-cao kev-cao left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the CI issues were a little weird since those subtests were failing under duress despite the fact that they should be skipped under duress.

That being said, while I was investigating this, I realized that there is a better solution here that eliminates the need to skip running these tests under duress. In retry.Next, we currently perform a blocking select on the backoff timer, context cancellation, and stopper.

This does mean that with shorter backoffs, context cancellation/stoppers are not prioritized in the event that all three channels are ready. This is the reason why we are running into these flakes.

If we instead do a two-stage select, a non-blocking select on the context/stopper first before running our blocking select, this correctly prioritizes context cancellation/stops and also prevents these flakes entirely and we can delete the code for skipping under duress.

@sanki92 If you'd like to give this a try, you are more than welcome to. Otherwise I can put up a quick PR for it.

@sanki92

sanki92 commented Oct 3, 2025

Copy link
Copy Markdown
Contributor Author

@kev-cao I'd love to give this a try! Thank you for the detailed explanation of the two-stage select approach - it makes perfect sense and is a much more elegant solution than the timing tolerance bandaid.

I'll implement the non-blocking select for context/closer prioritization followed by the blocking select with timer, and remove the duress skipping logic as you described.

This is exactly the kind of meaningful contribution I was hoping to make. I appreciate you taking the time to mentor me through the proper solution!

@blathers-crl

blathers-crl Bot commented Oct 3, 2025

Copy link
Copy Markdown

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

  • We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
  • Please ensure your git commit message contains a release note.
  • When CI has completed, please ensure no errors have appeared.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@sanki92

sanki92 commented Oct 3, 2025

Copy link
Copy Markdown
Contributor Author

@kev-cao Implementation complete! Two-stage select with context/closer prioritization is now in place, and duress skipping logic has been removed. Ready for your review!

Comment thread pkg/util/retry/retry_test.go Outdated
Comment thread pkg/util/retry/retry_test.go
@blathers-crl

blathers-crl Bot commented Oct 3, 2025

Copy link
Copy Markdown

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

  • We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
  • Please ensure your git commit message contains a release note.
  • When CI has completed, please ensure no errors have appeared.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Comment thread pkg/util/retry/retry_test.go Outdated
// Under duress, closing a channel will not necessarily stop the retry
// loop immediately, so we skip this test under duress.
skipUnderDuress: true,
expectedTimeSpent: time.Millisecond,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the context/stopper is canceled before the retry loop, the expected time would be 0 instead of 1.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed ✅

@blathers-crl

blathers-crl Bot commented Oct 3, 2025

Copy link
Copy Markdown

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

  • We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
  • Please ensure your git commit message contains a release note.
  • When CI has completed, please ensure no errors have appeared.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Comment thread pkg/util/retry/retry_test.go Outdated
@@ -549,13 +534,6 @@ func TestRetryWithMaxDuration(t *testing.T) {
t, tc.expectedTimeSpent, timeSource.Since(start), "expected time does not match actual spent time",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll also need to get rid of the condition or else the tests are essentially a no-op.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Let me know if you spot anything else.

@blathers-crl

blathers-crl Bot commented Oct 3, 2025

Copy link
Copy Markdown

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

  • We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
  • Please ensure your git commit message contains a release note.
  • When CI has completed, please ensure no errors have appeared.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@kev-cao

kev-cao commented Oct 3, 2025

Copy link
Copy Markdown
Contributor

Does the test still pass?

@sanki92

sanki92 commented Oct 3, 2025

Copy link
Copy Markdown
Contributor Author

Can't test locally - need Bazel build system for generated code. Logic should be correct though!

@kev-cao

kev-cao commented Oct 4, 2025

Copy link
Copy Markdown
Contributor

Hmm, you should be able to build following the instructions here. In any case, the test fails because the two subtests report that 1 millisecond has passed.

@blathers-crl

blathers-crl Bot commented Oct 4, 2025

Copy link
Copy Markdown

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

  • We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
  • Please ensure your git commit message contains a release note.
  • When CI has completed, please ensure no errors have appeared.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@sanki92

sanki92 commented Oct 4, 2025

Copy link
Copy Markdown
Contributor Author

@kev-cao Fixed the expected times back to 1ms. Based on your feedback that the test was failing because it reported 1ms elapsed (when we expected 0), I analyzed the code and realized the backingOffHook advances manual time even when cancellation is detected immediately.

Also my system isn't able to handle the build process properly to test locally, but the logic should be correct now.
image

@kev-cao

kev-cao commented Oct 4, 2025

Copy link
Copy Markdown
Contributor

I think the implementation needs updating. If the context is canceled before the retry loop ever runs, then no backoff will have been performed.

@blathers-crl

blathers-crl Bot commented Oct 4, 2025

Copy link
Copy Markdown

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

  • We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
  • Please ensure your git commit message contains a release note.
  • When CI has completed, please ensure no errors have appeared.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@sanki92

sanki92 commented Oct 4, 2025

Copy link
Copy Markdown
Contributor Author

I've made the changes. Could you check it locally when you get a chance? My setup can't handle the build. Thanks!

This commit teaches `util.Retry` to prioritize context cancellations and
stoppers over retry attempts. This ensures more consistent behaviors and
reduces test flakes.

Fixes: cockroachdb#154764

Release note: None
@kev-cao kev-cao force-pushed the fix-retry-test-timing-154764 branch from 7a49f1b to 8c3abab Compare October 6, 2025 15:44
@kev-cao kev-cao changed the title util/retry: increase timing tolerance for cancellation tests util: prioritize cancellations in retry loop Oct 6, 2025

@kev-cao kev-cao left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM! Updated the commit message and PR to follow our conventions. I'll get one more set of eyes on this before we merge it. Thanks for the contribution!

@kev-cao kev-cao requested review from dt and yuzefovich and removed request for dt October 6, 2025 15:48

@yuzefovich yuzefovich left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Probably worth backporting to 25.4?

@yuzefovich reviewed 1 of 2 files at r3, 2 of 2 files at r5, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @kev-cao and @sanki92)

@kev-cao kev-cao added the backport-25.4.x Flags PRs that need to be backported to 25.4 label Oct 6, 2025
@kev-cao

kev-cao commented Oct 6, 2025

Copy link
Copy Markdown
Contributor

bors r=kev-cao,yuzefovich

@craig

craig Bot commented Oct 6, 2025

Copy link
Copy Markdown
Contributor

@craig craig Bot merged commit 52833a4 into cockroachdb:master Oct 6, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-25.4.x Flags PRs that need to be backported to 25.4 O-community Originated from the community v26.1.0-prerelease

Projects

None yet

Development

Successfully merging this pull request may close these issues.

util/retry: TestRetryWithMaxDuration failed

4 participants