[IMPROVED] NRG: Step down/pause quorum if we're being overrun#7853
[IMPROVED] NRG: Step down/pause quorum if we're being overrun#7853MauriceVanVeen wants to merge 1 commit intomainfrom
Conversation
a9c46e6 to
d98f0dd
Compare
d98f0dd to
04cc280
Compare
04cc280 to
cbf0c97
Compare
|
@codex, please review. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cbf0c97b3e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
18686ac to
316d247
Compare
316d247 to
8fb6410
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8fb6410331
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Need to check other places that use these proposals, as they likely can't handle proposals to be dropped. |
8fb6410 to
8f1427d
Compare
|
Updated to now step down if the leader is falling behind/being overrun. That's loads safer, and gives some "breathing room". A new leader election will only start if all committed entries have been applied, and the new leader will only allow new stream writes/consumer operations to get added until after all uncommitted entries in the log got quorum and applied. |
8f1427d to
b0d4f33
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b0d4f3392b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
b0d4f33 to
ccdfc67
Compare
…g several concerns Apply the changes from PR nats-io#7853 (NRG: Step down/pause quorum if overrun) and add comprehensive tests that reproduce potential issues: BUG - uint64 underflow in quorum pause diff calculation: When quorumPaused=true and max(applied, papplied) > commit (e.g. after snapshot install), `diff := n.commit - applied` underflows to a huge uint64 value, permanently locking the follower in paused state. CONCERN - quorumPaused not reset on state transitions: The quorumPaused flag persists across candidate/leader/follower transitions. A node that was paused as follower remains paused through election cycles without going through the threshold check. CONCERN - Missing overrun checks on peer proposals: ProposeAddPeer, ProposeRemovePeer, and handleForwardedRemovePeerProposal do not have overrun protection, allowing membership changes to grow the WAL even when the leader should be stepping down. CONCERN - Catchup subs blocked by quorum pause: The pause check uses `sub != nil` not `isNew`, so catchup entries are also blocked, preventing recovery through catchup. CONCERN - quorumPaused persists across leader changes: When a new leader takes over, a paused follower stays paused from the previous leader's overrun state. https://claude.ai/code/session_01W9U7SsaXRxehB9Cw34jHuW
ccdfc67 to
0843178
Compare
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
0843178 to
7798592
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7798592a2d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Sync local raft.go with PR's latest force-push (March 6): - Rename stepDownIfOverrun -> isLeaderOverrun - Add commit := max(n.commit, n.papplied) to fix uint64 underflow - Reset quorumPaused in switchToCandidate Update tests: - TestNRGQuorumPausedResetOnCandidateTransition: now asserts fix works - TestNRGQuorumPausedNoUnderflowWhenAppliedExceedsCommit: asserts fix - Remaining concerns (peer proposals, catchup sub, leader change) unchanged https://claude.ai/code/session_01W9U7SsaXRxehB9Cw34jHuW
This PR adds a protective measure to ensure we can guard against unbounded WAL growth. Currently, overloaded servers could see their meta log (or any replicated stream/consumer log) grow well over several GBs, eventually requiring the log to be manually deleted on the server in order to recover.
The threshold is reasonably high. We keep incoming append entries cached in
n.paeand this starts logging a warning atpaeWarnThreshold: 10kand eventually caps the cache size atpaeDropThreshold: 20kat which point new entries aren't cached and need to be loaded from disk instead when they are committed. Both the above protective measures only kick in when going overpauseQuorumThreshold: 100kappend entries that haven't gotten quorum on the leader, or that have been committed but not yet applied on the follower. Slow followers that are not required for quorum will bound their log growth. If the leader itself is being overrun/slow we step it down. Under normal circumstances the natural flow control will ensure neither the followers nor the leader can fall behind. However, followers outside of quorum should "pause quorum" earlier than the leader steps down. The latter might help finding another server to become leader that's faster than the previous one, or the system as a whole is overloaded and we slow down to protect ourselves.Signed-off-by: Maurice van Veen github@mauricevanveen.com