Skip to content

[IMPROVED] NRG: Step down/pause quorum if we're being overrun#7853

Open
MauriceVanVeen wants to merge 1 commit intomainfrom
maurice/nrg-overrun-protection
Open

[IMPROVED] NRG: Step down/pause quorum if we're being overrun#7853
MauriceVanVeen wants to merge 1 commit intomainfrom
maurice/nrg-overrun-protection

Conversation

@MauriceVanVeen
Copy link
Copy Markdown
Member

@MauriceVanVeen MauriceVanVeen commented Feb 19, 2026

This PR adds a protective measure to ensure we can guard against unbounded WAL growth. Currently, overloaded servers could see their meta log (or any replicated stream/consumer log) grow well over several GBs, eventually requiring the log to be manually deleted on the server in order to recover.

  • The leader will step down if it has reached a certain threshold of uncommitted and unapplied entries. We already wait to apply all entries in the log before we signal we're the leader to the upper-layer, so this has no impact during leader changes. This protection ensures once we're leader we're not being spammed with proposals faster than we can commit and apply them.
  • The followers will store entries in their logs before the leader can mark them as having quorum/being committed. If a follower is slower to apply entries than the leader can make it add new entries and mark them as committed, the WAL on this follower will grow unbounded. And since all to-be-applied entries are pushed into the apply queue, this eventually makes the server go OOM. This protection ensures the follower will temporarily stop accepting new writes to work through the apply backlog first. This bounds the total committed but not-yet-applied entries. Allowing the follower to be caught up by the leader from a snapshot, instead of continuously storing new append entries and indefinitely growing its log.

The threshold is reasonably high. We keep incoming append entries cached in n.pae and this starts logging a warning at paeWarnThreshold: 10k and eventually caps the cache size at paeDropThreshold: 20k at which point new entries aren't cached and need to be loaded from disk instead when they are committed. Both the above protective measures only kick in when going over pauseQuorumThreshold: 100k append entries that haven't gotten quorum on the leader, or that have been committed but not yet applied on the follower. Slow followers that are not required for quorum will bound their log growth. If the leader itself is being overrun/slow we step it down. Under normal circumstances the natural flow control will ensure neither the followers nor the leader can fall behind. However, followers outside of quorum should "pause quorum" earlier than the leader steps down. The latter might help finding another server to become leader that's faster than the previous one, or the system as a whole is overloaded and we slow down to protect ourselves.

Signed-off-by: Maurice van Veen github@mauricevanveen.com

@MauriceVanVeen
Copy link
Copy Markdown
Member Author

@codex, please review.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cbf0c97b3e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@MauriceVanVeen MauriceVanVeen force-pushed the maurice/nrg-overrun-protection branch 2 times, most recently from 18686ac to 316d247 Compare March 5, 2026 16:05
@MauriceVanVeen MauriceVanVeen marked this pull request as ready for review March 5, 2026 16:25
@MauriceVanVeen MauriceVanVeen requested a review from a team as a code owner March 5, 2026 16:25
chatgpt-codex-connector[bot]

This comment was marked as resolved.

@MauriceVanVeen MauriceVanVeen requested a review from sciascid March 5, 2026 17:05
@MauriceVanVeen MauriceVanVeen force-pushed the maurice/nrg-overrun-protection branch from 316d247 to 8fb6410 Compare March 5, 2026 17:05
Copy link
Copy Markdown
Contributor

@sciascid sciascid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8fb6410331

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@MauriceVanVeen
Copy link
Copy Markdown
Member Author

Need to check other places that use these proposals, as they likely can't handle proposals to be dropped.
The "pausing quorum" on followers will be fine, but dropping proposals on the leader might not, and perhaps it's better then to step down when that happens, as we'll stop accepting new changes and the new leader will need to get quorum on uncommitted entries first too. Need to think about this more..

@MauriceVanVeen MauriceVanVeen marked this pull request as draft March 6, 2026 06:46
@MauriceVanVeen MauriceVanVeen force-pushed the maurice/nrg-overrun-protection branch from 8fb6410 to 8f1427d Compare March 6, 2026 08:28
@MauriceVanVeen
Copy link
Copy Markdown
Member Author

Updated to now step down if the leader is falling behind/being overrun. That's loads safer, and gives some "breathing room". A new leader election will only start if all committed entries have been applied, and the new leader will only allow new stream writes/consumer operations to get added until after all uncommitted entries in the log got quorum and applied.

@MauriceVanVeen MauriceVanVeen marked this pull request as ready for review March 6, 2026 08:56
chatgpt-codex-connector[bot]

This comment was marked as resolved.

@MauriceVanVeen MauriceVanVeen force-pushed the maurice/nrg-overrun-protection branch from 8f1427d to b0d4f33 Compare March 6, 2026 09:05
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b0d4f3392b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@MauriceVanVeen MauriceVanVeen changed the title [IMPROVED] NRG: Drop proposals/pause quorum if we're being overrun [IMPROVED] NRG: Step down/pause quorum if we're being overrun Mar 6, 2026
@MauriceVanVeen MauriceVanVeen force-pushed the maurice/nrg-overrun-protection branch from b0d4f33 to ccdfc67 Compare March 6, 2026 10:22
wallyqs pushed a commit to wallyqs/nats-server that referenced this pull request Mar 6, 2026
…g several concerns

Apply the changes from PR nats-io#7853 (NRG: Step down/pause quorum if overrun)
and add comprehensive tests that reproduce potential issues:

BUG - uint64 underflow in quorum pause diff calculation:
  When quorumPaused=true and max(applied, papplied) > commit (e.g. after
  snapshot install), `diff := n.commit - applied` underflows to a huge
  uint64 value, permanently locking the follower in paused state.

CONCERN - quorumPaused not reset on state transitions:
  The quorumPaused flag persists across candidate/leader/follower
  transitions. A node that was paused as follower remains paused
  through election cycles without going through the threshold check.

CONCERN - Missing overrun checks on peer proposals:
  ProposeAddPeer, ProposeRemovePeer, and handleForwardedRemovePeerProposal
  do not have overrun protection, allowing membership changes to grow
  the WAL even when the leader should be stepping down.

CONCERN - Catchup subs blocked by quorum pause:
  The pause check uses `sub != nil` not `isNew`, so catchup entries
  are also blocked, preventing recovery through catchup.

CONCERN - quorumPaused persists across leader changes:
  When a new leader takes over, a paused follower stays paused
  from the previous leader's overrun state.

https://claude.ai/code/session_01W9U7SsaXRxehB9Cw34jHuW
@MauriceVanVeen MauriceVanVeen force-pushed the maurice/nrg-overrun-protection branch from ccdfc67 to 0843178 Compare March 6, 2026 11:09
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
@MauriceVanVeen MauriceVanVeen force-pushed the maurice/nrg-overrun-protection branch from 0843178 to 7798592 Compare March 6, 2026 11:47
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7798592a2d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

wallyqs pushed a commit to wallyqs/nats-server that referenced this pull request Mar 6, 2026
Sync local raft.go with PR's latest force-push (March 6):
- Rename stepDownIfOverrun -> isLeaderOverrun
- Add commit := max(n.commit, n.papplied) to fix uint64 underflow
- Reset quorumPaused in switchToCandidate

Update tests:
- TestNRGQuorumPausedResetOnCandidateTransition: now asserts fix works
- TestNRGQuorumPausedNoUnderflowWhenAppliedExceedsCommit: asserts fix
- Remaining concerns (peer proposals, catchup sub, leader change) unchanged

https://claude.ai/code/session_01W9U7SsaXRxehB9Cw34jHuW
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants