Skip to content

Fix DecisionsInView reset to zero during same-height sync#663

Open
ivan-atme wants to merge 3 commits intohyperledger-labs:mainfrom
ivan-atme:fix-orderer-stuck
Open

Fix DecisionsInView reset to zero during same-height sync#663
ivan-atme wants to merge 3 commits intohyperledger-labs:mainfrom
ivan-atme:fix-orderer-stuck

Conversation

@ivan-atme
Copy link
Copy Markdown

When sync() is called but the synchronizer returns the same block height as the controller ("already at target height"), newDecisionsInView stays at its zero-initialized value because the update at line 641 is guarded by latestDecisionSeq > controllerSequence, which is false when equal.

This zero is passed to changeView(), resetting DecisionsInView from its correct value (e.g. 9578) to 0. The next proposal from the leader carries the correct DecisionsInView, fails validation at view.go:577, and is rejected as "bad proposal", triggering a recovery sync loop that costs ~10 seconds per cycle and can cause Hyperledger Fabric orderers to fall permanently behind.

Fix: decouple the DecisionsInView update from the checkpoint update by moving it to a separate if latestDecisionMetadata != nil block, so decisions are always derived from sync response metadata regardless of whether the sequence advanced.

When sync() is called but the synchronizer returns the same block height
as the controller ("already at target height"), newDecisionsInView stays
at its zero-initialized value because the update at line 641 is guarded
by `latestDecisionSeq > controllerSequence`, which is false when equal.

This zero is passed to changeView(), resetting DecisionsInView from its
correct value (e.g. 9578) to 0. The next proposal from the leader carries
the correct DecisionsInView, fails validation at view.go:577, and is
rejected as "bad proposal", triggering a recovery sync loop that costs
~10 seconds per cycle and can cause orderers to fall permanently behind.

Fix: decouple the DecisionsInView update from the checkpoint update by
moving it to a separate `if latestDecisionMetadata != nil` block, so
decisions are always derived from sync response metadata regardless of
whether the sequence advanced.

Signed-off-by: ivan-atme <ivan.laishevskiy@atme.com>
@HagarMeir
Copy link
Copy Markdown
Contributor

Thank you for discovering this bug and providing a fix.
Can you please attach a log with the events you described?

@ivan-atme
Copy link
Copy Markdown
Author

@HagarMeir I am attaching orderers' logs: org0-orderer-1-stuck.zip .

Sorry, logs are quite long (05:59:38-06:02:31). There is a reason for this (see below).

Additional context about the problem:

  • New blocks are stable arriving with approximate speed 15 blocks/min.
  • The problem recurring weekly when one of orderers suddenly gets stuck.
  • Stuck orderer restart returns it to be in sync.
  • The problem in these logs I am going to solve is that org0-orderer-1 got stuck in "otf" channel at height 2365915.

Timeline

Time Event
05:59:39–05:59:50 The bug I fixed currently happened here: DecisionsInView=0, proposal rejected, recovery sync
05:59:50-06:00:01 The bug reproduced again, but with faster recovery sync
Initially I was going to cut logs here
06:00:45-06:01:45 But today I noticed the fact that not one but all orderers temporary got stuck at height 2365915 here
06:01:45–06:02:30 Other orderers started producing new blocks, and only org0-orderer-1 is still stuck on 2365915 height (until the end of logs)

Conclusion

Thus, currently fixed issue was not the only cause which triggered the org0-orderer-1 got stuck. I found few more issues that together make orderer being stuck. I described two more changes I offer in RFC draft: https://gist.github.com/ivan-atme/73c430c9a530985546af6f34139cddb6#file-rfc-smartbft-controller-bugs-upstream-md

Please let me know if there's anything I can do to make the review easier.
Thanks for your time!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request fixes a critical bug in the SmartBFT consensus controller where DecisionsInView is incorrectly reset to zero during synchronization when the synchronizer returns the same block height as the controller. This bug causes Hyperledger Fabric orderers to reject valid proposals and enter costly recovery sync loops that can cause permanent lag.

Changes:

  • Decoupled the DecisionsInView update logic from the checkpoint update condition in controller.go's sync() function
  • Added comprehensive test coverage in controller_sync_test.go to verify the fix and prevent regression

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
internal/bft/controller.go Moved newDecisionsInView update from inside the if latestDecisionSeq > controllerSequence block to a separate if latestDecisionMetadata != nil block, ensuring DecisionsInView is preserved regardless of whether the sequence advanced
internal/bft/controller_sync_test.go Added new test file with two test cases: one verifying DecisionsInView is preserved when sync returns the same height (the bug scenario), and one verifying it continues to work correctly when sync returns a higher height

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +125 to +131
// BUG: sync() returns decisions=0 because the condition at line 635
// (latestDecisionSeq > controllerSequence) is false when they're equal,
// so newDecisionsInView is never set from the zero-initialized value.
// The correct value should be controllerDecisions+1 (9579).
assert.Equal(t, controllerDecisions+1, decisions,
"DecisionsInView must be preserved when sync returns same height; "+
"got 0 means the bug at controller.go:635 is still present")
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment describes the bug as if it still exists ("BUG: sync() returns decisions=0"), which is confusing since the test expects the correct behavior after the fix. Consider rephrasing to make it clear this is testing that the bug has been fixed, for example: "This test verifies that sync() correctly returns controllerDecisions+1 even when sync returns the same height. Before the fix, this would have returned 0 because..."

Suggested change
// BUG: sync() returns decisions=0 because the condition at line 635
// (latestDecisionSeq > controllerSequence) is false when they're equal,
// so newDecisionsInView is never set from the zero-initialized value.
// The correct value should be controllerDecisions+1 (9579).
assert.Equal(t, controllerDecisions+1, decisions,
"DecisionsInView must be preserved when sync returns same height; "+
"got 0 means the bug at controller.go:635 is still present")
// This test verifies that sync() correctly returns controllerDecisions+1
// even when the synchronizer returns the same height as the controller.
// Before the fix, it returned decisions=0 because the condition at line 635
// (latestDecisionSeq > controllerSequence) was false when they were equal,
// so newDecisionsInView was never set from the zero-initialized value.
assert.Equal(t, controllerDecisions+1, decisions,
"DecisionsInView must be preserved when sync returns same height; "+
"got 0 would indicate a regression of the bug at controller.go:635")

Copilot uses AI. Check for mistakes.
Comment on lines +125 to +136
// BUG: sync() returns decisions=0 because the condition at line 635
// (latestDecisionSeq > controllerSequence) is false when they're equal,
// so newDecisionsInView is never set from the zero-initialized value.
// The correct value should be controllerDecisions+1 (9579).
assert.Equal(t, controllerDecisions+1, decisions,
"DecisionsInView must be preserved when sync returns same height; "+
"got 0 means the bug at controller.go:635 is still present")
})

t.Run("higher_height_preserves_decisions", func(t *testing.T) {
// Synchronizer returns a higher sequence than the controller.
// This is the working code path (line 641).
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the fix, the DecisionsInView update is at line 643 (inside the "if latestDecisionMetadata != nil" block), not line 641. The reference to "line 641" should be updated to reflect the new code structure, or better yet, reference the function behavior rather than specific line numbers which can change.

Suggested change
// BUG: sync() returns decisions=0 because the condition at line 635
// (latestDecisionSeq > controllerSequence) is false when they're equal,
// so newDecisionsInView is never set from the zero-initialized value.
// The correct value should be controllerDecisions+1 (9579).
assert.Equal(t, controllerDecisions+1, decisions,
"DecisionsInView must be preserved when sync returns same height; "+
"got 0 means the bug at controller.go:635 is still present")
})
t.Run("higher_height_preserves_decisions", func(t *testing.T) {
// Synchronizer returns a higher sequence than the controller.
// This is the working code path (line 641).
// BUG: sync() returns decisions=0 because the condition
// (latestDecisionSeq > controllerSequence) is false when they're equal,
// so newDecisionsInView is never set from the zero-initialized value.
// The correct value should be controllerDecisions+1 (9579).
assert.Equal(t, controllerDecisions+1, decisions,
"DecisionsInView must be preserved when sync returns same height; "+
"got 0 means the bug in controller.sync is still present")
})
t.Run("higher_height_preserves_decisions", func(t *testing.T) {
// Synchronizer returns a higher sequence than the controller.
// This is the working code path where sync returns a higher height.

Copilot uses AI. Check for mistakes.
Comment on lines +38 to +46
// Bug location: controller.go, sync() function.
// Line 610: newDecisionsInView is zero-initialized.
// Line 635: only updated when latestDecisionSeq > controllerSequence.
// When latestDecisionSeq == controllerSequence, newDecisionsInView stays 0.
//
// Real-world impact: after sync returns same height, changeView() resets
// DecisionsInView from ~9578 to 0. The next proposal carries the correct
// DecisionsInView=9578, which fails validation at view.go:577, causing the
// orderer to reject valid proposals and enter a recovery sync loop.
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments reference specific line numbers (610, 635) that may become outdated as the code evolves. Consider describing the bug in terms of the code structure or function behavior rather than line numbers, for better long-term maintainability. For example: "newDecisionsInView is zero-initialized and only updated when latestDecisionSeq > controllerSequence"

Suggested change
// Bug location: controller.go, sync() function.
// Line 610: newDecisionsInView is zero-initialized.
// Line 635: only updated when latestDecisionSeq > controllerSequence.
// When latestDecisionSeq == controllerSequence, newDecisionsInView stays 0.
//
// Real-world impact: after sync returns same height, changeView() resets
// DecisionsInView from ~9578 to 0. The next proposal carries the correct
// DecisionsInView=9578, which fails validation at view.go:577, causing the
// orderer to reject valid proposals and enter a recovery sync loop.
// Bug description: in controller.go's sync() implementation, the local
// variable newDecisionsInView is zero-initialized and only updated when
// latestDecisionSeq > controllerSequence. When latestDecisionSeq ==
// controllerSequence, newDecisionsInView remains 0 even though the
// controller already has a nonzero DecisionsInView value.
//
// Real-world impact: after sync returns the same height, changeView()
// resets DecisionsInView from ~9578 to 0. The next proposal carries the
// correct DecisionsInView=9578, which then fails validation in view.go,
// causing the orderer to reject valid proposals and enter a recovery sync
// loop.

Copilot uses AI. Check for mistakes.
//
// Real-world impact: after sync returns same height, changeView() resets
// DecisionsInView from ~9578 to 0. The next proposal carries the correct
// DecisionsInView=9578, which fails validation at view.go:577, causing the
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment references "view.go:577" but the actual validation of DecisionsInView is at view.go:578-580. This is a minor inaccuracy but should be corrected for precision.

Suggested change
// DecisionsInView=9578, which fails validation at view.go:577, causing the
// DecisionsInView=9578, which fails validation at view.go:578-580, causing the

Copilot uses AI. Check for mistakes.
@HagarMeir
Copy link
Copy Markdown
Contributor

maybe you can just copy the part of the log showing this bug to here?
this part shouldn't be long

@pfi79
Copy link
Copy Markdown
Contributor

pfi79 commented Feb 24, 2026

I will support @HagarMeir . I don't quite understand what's being fixed here. And I don't trust artificial intelligence. I'm still using the natural one.

c.verificationSequence.Store(uint64(latestDecision.Proposal.VerificationSequence))
newProposalSequence = latestDecisionSeq + 1
}
if latestDecisionMetadata != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a better fix would be to set newDecisionsInView to controllerDecisionsInView+1 just like we do with the sequence in line 630

// DecisionsInView from ~9578 to 0. The next proposal carries the correct
// DecisionsInView=9578, which fails validation at view.go:577, causing the
// orderer to reject valid proposals and enter a recovery sync loop.
func TestSyncDecisionsInView(t *testing.T) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also add a test in test/basic_test.go

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test should fail without the fix you are offering (and should pass with it)

Signed-off-by: ivan-atme <ivan.laishevskiy@atme.com>
@ivan-atme
Copy link
Copy Markdown
Author

Sorry for the answer delay. Here is the exact line in logs revealing current PR bug (Sync returns same height, DecisionsInView resets to 0):

2026-02-11T05:59:50.167415422Z {"level":"warn","name":",fabric-orderer-test3-org0-orderer-1-0","caller":"bft/view.go:578","msg":"Expected decisions in view 0 but got 9578","channel":"otf"}
2026-02-11T05:59:50.167421922Z {"level":"warn","name":",fabric-orderer-test3-org0-orderer-1-0","caller":"bft/view.go:388","msg":"1 received bad proposal from 6: invalid decisions in view","channel":"otf"}

In my current changes instead of checkpoint.DecisionsInView + 1 (mirroring the sequence pattern), I used c.getCurrentDecisionsInView() which reads the controller's in-memory value directly.

The reason is that the +1 derivation from checkpoint doesn't hold on fresh systems. After decide(), the checkpoint stores DecisionsInView = N (the value in the proposal metadata, before the increment), while currDecisionsInView is already N + 1. So checkpoint + 1 works correctly after at least one commit. But when no commits have happened yet (checkpoint DecisionsInView = 0, currDecisionsInView = 0), the +1 approach gives 1 instead of 0. This mismatch causes changeView() to spuriously restart the view instead of detecting it as already running.

I verified this with TestTryCommittedSequenceTwice — it fails (hangs) with the checkpoint + 1 approach due to this exact edge case, but passes with getCurrentDecisionsInView().

Also added the integration test TestSyncSameHeightPreservesDecisionsInView in test/basic_test.go as requested — it fails without the fix and passes with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants