Raft batching by sciascid · Pull Request #7890 · nats-io/nats-server

sciascid · 2026-03-02T09:51:52Z

Optimized Raft batching

Signed-off-by: Daniele Sciascia daniele@nats.io

For performance evaluation only. Signed-off-by: Daniele Sciascia <daniele@nats.io>

https://claude.ai/code/session_01HzzdNa4faUbuXyGrzZCRVq

MauriceVanVeen · 2026-03-02T12:39:36Z

server/stream.go

+		// We can recover from the snapshot and the tail
+		// of the log.
+		fsCfg.SyncAlways = false
+		fsCfg.SyncOnFlush = true


The stream can be scaled down and up at any time, would these need to be updated then as well?

Mentioned this also in another comment relating to fsyncing in FlushAllPending, but perhaps this improved fsyncing strategy should be in its own PR?

MauriceVanVeen · 2026-03-02T14:51:27Z

server/raft.go

+		n.RUnlock()
 		return werr
 	}
+	n.RUnlock()


ProposeMulti below is still holding the n.RLock() when pushing the proposal, should that also be the case here? Then it could just be a defer n.RUnlock() above.

MauriceVanVeen · 2026-03-02T14:54:14Z

server/raft.go

-	defer n.Unlock()
-	// Don't check if we're leader before sending and storing, this is used on scaleup.
-	n.sendAppendEntryLocked([]*Entry{{EntrySnapshot, data}}, false)
+	n.prop.push(newProposedEntry(newEntry(EntrySnapshot, data), _EMPTY_))


Could this be reverted or was there a specific reason to potentially batch this?

SendSnapshot isn't used often enough to have it be eligible for batching is what I'm thinking (only once during a stream/consumer scaleup). Also as the use of SendSnapshot has lead to issues in the past, I'd probably want to not touch it unless we're sure about it.

MauriceVanVeen · 2026-03-02T15:01:51Z

server/raft.go

+			"index":                 index,
+			"reserved":              reserved,
+			"membershipChangeIndex": n.membChangeIndex,


Could you also add the account, group and ID here?

Suggested change

"index": index,

"reserved": reserved,

"membershipChangeIndex": n.membChangeIndex,

"n.accName": n.accName,

"n.group": n.group,

"n.id": n.id,

"index": index,

"reserved": reserved,

"membershipChangeIndex": n.membChangeIndex,

MauriceVanVeen · 2026-03-02T15:07:04Z

server/raft.go

+	l := len(ae.entries)
+	if l == 0 {
+		return false
+	}
+	if l == 1 {
+		return ae.entries[0].Type != EntryLeaderTransfer
+	}


Could optionally shorten:

Suggested change

l := len(ae.entries)

if l == 0 {

return false

}

if l == 1 {

return ae.entries[0].Type != EntryLeaderTransfer

}

if l := len(ae.entries); l == 0 {

return false

} else if l == 1 {

return ae.entries[0].Type != EntryLeaderTransfer

}

MauriceVanVeen · 2026-03-02T15:08:24Z

server/raft.go

 	// Specifically done while holding the lock to not race.
+	n.RLock()
 	if checkLeader && n.State() != Leader {
+		n.RUnlock()


The lock needs to be held when calling n.debug below, as its value may otherwise race if the config is reloaded.

MauriceVanVeen · 2026-03-02T15:10:39Z

server/raft.go

-	n.aflr = n.pindex
+	index, err := n.sendAppendEntry([]*Entry{peerState}, true)
+	if err != nil {
+		return


The error is ignored here, should we log about this and step down?

MauriceVanVeen · 2026-03-02T15:18:58Z

server/filestore.go

-	fs.mu.Lock()
-	defer fs.mu.Unlock()
-	fs.checkAndFlushLastBlock()
+	if fs.fcfg.SyncOnFlush {


I'm not sure if this change should be made as part of this PR. Or is actually safe.. although I do agree that fsync-ing for every append entry and then once again for every single message is quite slow.

Imagine 5 message blocks that were written to but not synced, then this should be fine. But if the OS did decide to sync some blocks or only the last block, we'd come up and not be able to recover the data written in the previous blocks? (as we check for the recovered last sequence and when replaying from the snapshot during recovery skip over any messages below this last sequence)

A safer change, for the time being, would be to immediately sync only the last block when we flush and close the last message block to open the next in fs.newMsgBlockForWrite.

We call FlushAllPending before a stream snapshot is created, if SyncOnFlush is set then we will call into syncBlocks, causing all message blocks that are marked as needSync to be synced to stable storage. My understanding is that if a snapshot was installed in Raft, then all indexes up to last sequence in the snapshot will be synced. Upon recovery, we seem to restore the last snapshot, and reset the raft state starting from the last sequence found in that snapshot. All subsequent sequences are then replayed.

A safer change, for the time being, would be to immediately sync only the last block when we flush and close the last message block to open the next in fs.newMsgBlockForWrite.

I don't think that would be enough. Suppose we have one full block b1 and then creating a new block b2, which is only partially full. b1 was synced when b2 was created. Taking a snapshot at this point would include b1, but could possibly loose b2. Meaning that we still need to sync b2, i.e the last block, before we take the snapshot.

Upon recovery, we seem to restore the last snapshot, and reset the raft state starting from the last sequence found in that snapshot. All subsequent sequences are then replayed.

Indeed, but replaying for the stream does not mean we start of with the snapshot only and then try persisting the replayed messages again.

Imagine we have no snapshot and 100 messages spread in various append entries. It could be possible the OS choose to sync the last message block on its own before we tried to sync? In that case it could be our last sequence of 100 revives the restart, but all prior messages didn't.

The following lines would then see you skipping messages 1-99 since our last sequence is 100:

nats-server/server/jetstream_cluster.go

Lines 4145 to 4147 in 8692a5d

if lseq-clfs < last {

s.Debugf("Apply stream entries for '%s > %s' skipping message with sequence %d with last of %d",

mset.accountLocked(needLock), mset.nameLocked(needLock), lseq+1-clfs, last)

I don't think that would be enough. Suppose we have one full block b1 and then creating a new block b2, which is only partially full. b1 was synced when b2 was created. Taking a snapshot at this point would include b1, but could possibly loose b2. Meaning that we still need to sync b2, i.e the last block, before we take the snapshot.

Yep, would be okay by me to also sync b2 when we try to install the snapshot. I was mostly highlighting that we wouldn't get into a situation where the OS chose to sync b2 first before we got to b1, then being restarted and only recovering b2 but not b1 (since that wasn't synced yet). The logic outlined in my previous message would then see you replay but skip the lost messages since the last sequence is higher.

Notice that the OS/filesystem could choose to reorder within a file. In that case, I believe this optimized syncing strategy, combined with the above skipping wouldn't work nicely. If we want to optimize this, I think we need a way to replay all entries in the raft wal, in a idempotent way.

Yeah, I had been thinking about that as well. Instead of having replay "skip" messages that were already processed (due to high last seq), truncate based on the snapshot instead. That would reliably clean up anything in the stream "after" the snapshot and replaying will write all the messages in the same order again. But am still wondering then whether that should be in its own PR and this PR remains focused primarily on the Raft batching in raft.go.

Did some experiments with lazyfs. I can confirm pretty much everything that was discussed here. One thing to notice, and is relevant to this discussion: If the filesystem reorders writes within a block file, leaving a hole, recovery will set last sequence for that block file to whatever it found before the hole (if anything).
That's still problematic, we can end up with gaps between files, or entire blocks missing.
The truncate strategy, is apparently working though.

Previously with SyncAlways enabled we had one fd.Sync() call per raft append entry, plus one for each message batched in the append entry. This resulted in unnecessary calls fd.Sync(). A replicated stream can recover from the stream snapshot and the Raft log. So we keep the syncing the Raft log as usual, and introduce sync the stream's filestore prior to to taking a snapshot. Signed-off-by: Daniele Sciascia <daniele@nats.io>

Reduce contention between Propose() and sendAppendEntry() by changing Propose() to take a read lock and by avoiding holding raft.mu across storeToWAL(), which may block on IO. This requires stored append entries to be emitted only from the leader goroutine. To enforce that: * do not store EntryLeaderTransfer in the Raft log * push EntryPeerState and EntrySnapshot through the proposal queue * batch normal proposals in runAsLeader() * keep EntrySnapshot and membership changes standalone so they are not batched with normal entries * let EntrySnapshot skip the leader check when appended from the leader loop Signed-off-by: Daniele Sciascia <daniele@nats.io>

Cap batches to the configured max payload size and to 65535 entries, the maximum representable in an AppendEntry message. Signed-off-by: Daniele Sciascia <daniele@nats.io>

Baseline: collect metrics

4dfdb87

For performance evaluation only. Signed-off-by: Daniele Sciascia <daniele@nats.io>

sciascid mentioned this pull request Mar 2, 2026

Raft batching #7355

Closed

sciascid force-pushed the raft-batching-opt branch from f12ebe6 to 82a15b0 Compare March 2, 2026 09:57

wallyqs pushed a commit to wallyqs/nats-server that referenced this pull request Mar 2, 2026

Add detailed code review for PR nats-io#7890 (Raft Batching)

4761cde

https://claude.ai/code/session_01HzzdNa4faUbuXyGrzZCRVq

MauriceVanVeen reviewed Mar 2, 2026

View reviewed changes

sciascid added 3 commits March 10, 2026 10:00

Raft batching: Maximize batch size

5a0b857

Cap batches to the configured max payload size and to 65535 entries, the maximum representable in an AppendEntry message. Signed-off-by: Daniele Sciascia <daniele@nats.io>

sciascid force-pushed the raft-batching-opt branch from 82a15b0 to 5a0b857 Compare March 10, 2026 09:02

	if lseq-clfs < last {
	s.Debugf("Apply stream entries for '%s > %s' skipping message with sequence %d with last of %d",
	mset.accountLocked(needLock), mset.nameLocked(needLock), lseq+1-clfs, last)

Uh oh!

Conversation

sciascid commented Mar 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MauriceVanVeen Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MauriceVanVeen Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MauriceVanVeen Mar 3, 2026 •

edited

Loading

MauriceVanVeen Mar 3, 2026 •

edited

Loading