refactor: use beacon chain deadlines for QBFT instance cleanup by diegomrsantos · Pull Request #719 · sigp/anchor

diegomrsantos · 2025-10-25T16:18:27Z

Problem, Evidence, and Context (Required)

qbft_manager was cleaning instances up on a fixed slot-based TTL instead of the actual duty validity window, which could remove live or already-decided instances before their beacon-chain deadline.
For SyncCommittee, the previous same-slot cleanup was also tighter than the next-slot inclusion window that the client already models by starting consensus at 2/3 of the slot.
This is worth doing now because it addresses the behavior reported in #720 and keeps the PR scoped to a bounded qbft_manager lifecycle improvement.
Evidence: the previous retention model was based on an internal slot count rather than per-role beacon-chain inclusion windows.
Relevant links: #720, #914 for the pre-existing late-caller race that remains out of scope here, and #916 for the broader timing-policy follow-up.

Change Overview (Required)

Replace fixed slot-based cleanup in qbft_manager with role-specific beacon-chain deadlines.
Keep completed instances registered until their deadline instead of dropping them immediately.
Keep sync-committee instances alive through the next slot rather than deleting them at the slot boundary.
Add and tighten retention-focused tests so they prove the instance actually completes before asserting it remains registered.
Add end-to-end sync-committee coverage using the real late-in-slot start offset and deadline cleanup path.
Reviewer reading order: start with the deadline-based retention change, then the sync-committee deadline adjustment, then the retention tests.
Intentionally did not change validator/message-validation timing policy outside qbft_manager, and did not try to solve the pre-existing late-caller race in #914. The broader timing-policy alignment work is tracked separately in #916.

Risks, Trade-offs, and Mitigations (Required)

Main risk is keeping QBFT instances alive longer than before and relying on deadline-based cleanup instead of the old fixed TTL.
Trade-off: this improves qbft_manager retention semantics without yet centralizing all timing policy across other components.
Risk is mitigated by keeping the behavior change scoped to qbft_manager, aligning SyncCommittee cleanup with the modeled next-slot inclusion window, and adding deadline-focused coverage, including an end-to-end sync-committee scenario.

Validation (Required)

cargo test -p qbft_manager
cargo fmt --all --check
make lint
Commit hooks on the follow-up commits also ran:
- cargo fmt --all
- cargo clippy --all
- cargo sort workspace

Rollback (Required for behavior or runtime changes; optional otherwise)

Safe to revert by reverting this PR's commits.
No schema, config, or data migration impact.
Revert would restore the old slot-based cleanup behavior.

Blockers / Dependencies (Optional)

N/A

Additional Info / Next Steps (Optional)

The late-caller race is tracked separately in #914.
The broader timing-policy/source-of-truth follow-up is tracked in #916.
The tester guidance now explicitly prefers #[tokio::test(start_paused = true)] when simulated time is safe, which matches the testing approach used in the new retention tests.

anchor/qbft_manager/src/tests.rs

anchor/qbft_manager/src/lib.rs

dknopik · 2025-12-09T13:47:31Z

anchor/qbft_manager/src/lib.rs

+                // Branch 1: Instance completed - clean immediately
+                Some(id) = completion_rx.recv() => {
+                    match id {
+                        InstanceId::BeaconVote(id) => {
+                            self.beacon_vote_instances.remove(&id);
+                        }
+                        InstanceId::ValidatorConsensus(id) => {
+                            self.validator_consensus_data_instances.remove(&id);
+                        }
+                    }
+                }


There is a problem with this approach:

In theory, there might be a race condition where some tasks try to register their oneshot channel to an instance after it has completed. This might e.g. be the case if multiple validator attestation duties wait for the same committee instance. If we start the instance late (e.g. because of a struggling BN), the first thread will start the instance, which might complete immediately due to replayed messages, giving no opportunity for the other tasks to register their listeners. This is why the current code cleans up at a fixed time regardless of completion.

Instead, we could move the cleanup time in this branch - to give some time (til end of next slot?) to get the instance result. Wdyt?

I need more context to understand what's described in the first paragraph.

Good catch. This race exists, but it is pre-existing, not introduced by this PR.

Before this PR, the instance task stays alive in Decided state and the registry entry lingers until the slot-based cleaner removes it. Once the cleaner drops the entry, the tx is dropped, rx.recv() returns None, and the task exits. A late caller after that point hits Vacant, spawns a new instance, and hangs until timeout. The grace window is larger under the old cleanup scheme, but the underlying late-caller behavior is still there.

This PR does make that behavior easier to hit by breaking out of the loop on Decided and removing the entry immediately via completion notification.

The broader issue is that independent code paths can call decide_instance for the same CommitteeInstanceId at different times. Grouping validators per committee, as in #834, improves that by reducing duplicate local callers within a duty path. Applying the same pattern to sync committee signing would help for the same reason.

But that change alone does not fully eliminate the late-caller behavior; that still depends on the cleanup / lifecycle semantics in qbft_manager.

We could modify ManagedInstance to either hold a channel sender to a running instance, or a resulting value D obtained from a finished instance to accommodate late callers. This would allow us to clean up the instance as soon as it is finished by dropping the sender and storing the finished value. I am unsure how long the resulting value should be kept - 1 slot?

As you said, the underlying root cause is the possibility of late callers, but I am unsure how to prevent this without a major refactor.

Good suggestion. What we can do in this PR is remove the immediate cleanup on completion (Branch 1 in the cleaner) and the break after Decided in the instance task — so the instance stays alive until its beacon chain deadline expires. The Decided state already handles late Initialize messages by returning the cached result (see instance.rs line 79-83), so late callers would get the decided value immediately.

The pre-existing late-caller race (where callers arrive after the entry is removed entirely) is a separate concern — I'll create an issue to track that for a follow-up PR.

Adds test to verify QBFT Committee instances can reach late rounds (9+) as configured with max_round=12. The test creates a Committee instance, forces round changes by keeping operators offline, then advances through multiple slots while verifying the instance survives to reach round 10. Currently fails - instance is cleaned up after 2 slots, reaching round 9 but unable to complete it (needs 120s, gets 8s).

Replace slot-based cleanup with duty-specific beacon chain inclusion deadlines. This allows QBFT instances to progress through all configured rounds without premature removal. Key changes: - Separate instance identity from manager metadata using ManagedInstance wrapper - Calculate duty-specific deadlines per EIP-7045 (attestations valid until end of epoch E+1) - Add slots_per_epoch configuration parameter - Implement dual-trigger cleaner (completion notification + deadline timeout) Fixes instances being cleaned after 2 slots, now properly respecting beacon chain inclusion windows (32-63 slots for attestations).

Improve test readability by applying Setup/Execute/Assert structure: - Replace magic numbers with named constants (SINGLE_INSTANCE, TWO_INSTANCES, etc.) - Add mandatory section comments (// SETUP, // EXECUTE, // ASSERT) to all new tests - Split oversized test_role_based_deadline_calculations into 6 focused tests (one per role) - Add descriptive assertion messages explaining what must be true - Named all literals in new tests (OLD_CLEANUP_SLOT, BEACON_DEADLINE_SLOT, etc.) All 23 tests pass (up from 18 due to role deadline test split). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add detailed explanation for Committee/Aggregator deadline calculation: - Document the calculation formula: (E+2) * slots_per_epoch - 1 - Explain that this represents the last slot for on-chain inclusion - Reference EIP-7045 specification Enhance ManagedInstance documentation: - Convert to doc comment for better API documentation - Clarify that it tracks both channel and beacon chain deadline - Explain its role in the cleanup task 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Resolve conflicts in qbft_manager/src/lib.rs: - Keep deadline calculation code from this branch - Use upstream's fixed comment typo ("Generate" not "General") - Drop redundant with_quorum_size (default already matches) - Use pre-extracted role variable for with_max_rounds

…rounds

diegomrsantos · 2026-03-26T17:54:38Z

@claude do a final review

diegomrsantos · 2026-03-26T18:48:30Z

@claude It looks like the latest versions of both branches weren't used in your review. Do it again

claude-code-actions-sigp · 2026-03-26T18:49:00Z

Claude finished @diegomrsantos's task in 2m 19s —— View job

Final Review of PR #719

Read the full diff against origin/unstable
Review qbft_manager/src/lib.rs changes
Review qbft_manager/src/instance.rs changes
Review qbft_manager/src/tests.rs changes
Review tester-subagent.md changes
Post final review

Summary

The PR replaces fixed slot-based cleanup (QBFT_RETAIN_SLOTS = 1) with role-specific beacon-chain deadlines computed by calculate_deadline. Decided instances now stay alive in the registry until the cleaner removes them at deadline expiry, serving late Initialize messages via the existing Decided handler (instance.rs:79-83).

Changes are well-scoped to qbft_manager with good test coverage across deadline calculation, retention semantics, epoch boundaries, and late-round survival.

Issues Found

1. `pass_to_instance` now spawns instances on inbound network messages (behavioral change)

lib.rs:378-402 — In the base code, pass_to_instance already called get_or_spawn_instance, so the spawning behavior on network messages is not new — it existed before this PR. The only change here is adding the deadline parameter to the existing call. I retract the concern from the prior review; this is not a behavioral change introduced by this PR.

2. `Proposer` / `SyncCommittee` same-slot deadline: confirm this is intentional

lib.rs:61-63 — Proposer and SyncCommittee get deadline = slot. The cleaner runs at the start of each new slot and checks managed.deadline >= current_slot. When current_slot becomes slot + 1, the instance is removed.

For Proposer this is correct — proposals are strictly slot-bound.

For SyncCommittee, sync committee contributions are aggregated at slot_start + 2/3 * slot_duration, and the aggregate is included in the next slot's block. The same-slot deadline means the instance will be cleaned at the start of slot + 1, which should be fine since the QBFT consensus for a sync committee contribution should complete well within the slot. However, if there's any risk of the contribution consensus spanning past the slot boundary (e.g., due to round changes), the instance would be cleaned before completion. This seems acceptable for now given the PR scope, but worth monitoring.

Verdict: non-blocking — the comment in the code ("Must be in the same slot") correctly documents the intent.

Observations (non-blocking)

calculate_deadline is correct for EIP-7045: (epoch + 2) * slots_per_epoch - 1 correctly yields the last slot of epoch E+1 for attestations in epoch E. The comment is clear and well-documented.
VoluntaryExit / ValidatorRegistration deadline (slot + slots_per_epoch): This is a reasonable heuristic. These aren't time-critical operations, and one epoch provides a generous window.
instance.rs change is minimal and correct — the only change is adding a comment on the RecvResult::Closed branch (instance.rs:360). The break after Closed already existed in the base. The comment clarifies that no notification is needed because either the Decided transition already sent it, or the cleaner dropped the registry entry (which closes rx).
Test quality is good — tests use start_paused = true with tokio::time::advance() + yield_now() for deterministic timer control. The deadline calculation unit tests cover all roles. The integration tests (test_committee_can_reach_late_rounds, test_cleanup_removes_only_expired_instances, test_cleanup_across_epoch_boundary) verify the actual retention behavior end-to-end.
Clone bound addition on QbftDecidable::Id (lib.rs:460): Required because get_or_spawn_instance now needs to clone the id for map.entry(id.clone()) since it's used after the entry call. This is a minor trait signature change that correctly supports the new ManagedInstance insertion pattern.
Late-caller race acknowledged as pre-existing and out of scope per the review discussion thread and PR description, tracked in fix: late-caller race in QBFT instance lifecycle #914.

Verdict

No blocking issues. The core changes are correct and well-tested. The only item to keep an eye on is the SyncCommittee same-slot deadline if round changes push sync committee consensus past the slot boundary — but this is acceptable for the current scope.

diegomrsantos · 2026-03-26T20:03:45Z

Follow-up to Claude's note about the SyncCommittee same-slot deadline: this is now fixed on the PR branch.

0968f4950 changes SyncCommittee cleanup to keep instances alive through the next-slot inclusion window while leaving Proposer same-slot.
0968f4950 also adds an end-to-end paused-time test that uses the real late-in-slot start offset, crosses the first slot boundary without quorum, then proves the instance still reaches consensus and is cleaned at the following deadline boundary.
1035f58bd fixes the resulting CI clippy::clone_on_copy failure in the new test helper.
2681708e8 rewrites the new sync-committee test into a clearer Arrange / Act / Assert structure.

Validation after the follow-ups:

make lint
cargo test -p qbft_manager

This comment was marked as outdated.

Sign in to view

claude-code-actions-sigp bot reviewed Oct 25, 2025

View reviewed changes

anchor/qbft_manager/src/tests.rs Outdated Show resolved Hide resolved

claude-code-actions-sigp bot reviewed Oct 25, 2025

View reviewed changes

anchor/qbft_manager/src/tests.rs Show resolved Hide resolved

claude-code-actions-sigp bot reviewed Oct 25, 2025

View reviewed changes

anchor/qbft_manager/src/tests.rs Show resolved Hide resolved

claude-code-actions-sigp bot reviewed Oct 25, 2025

View reviewed changes

anchor/qbft_manager/src/tests.rs Show resolved Hide resolved

diegomrsantos mentioned this pull request Oct 25, 2025

QBFT instances cleaned up too aggressively, preventing late rounds #720

Open

diegomrsantos changed the base branch from stable to unstable October 25, 2025 16:35

diegomrsantos force-pushed the test/qbft-late-rounds branch from 4cd49f3 to 1d995b4 Compare October 28, 2025 20:47

diegomrsantos marked this pull request as draft October 28, 2025 20:48

diegomrsantos force-pushed the test/qbft-late-rounds branch 2 times, most recently from 2cc3d0d to 4767f9d Compare October 28, 2025 23:26

diegomrsantos added the claude-recheck triggers claude review workflow to re-run label Oct 29, 2025

diegomrsantos self-assigned this Oct 29, 2025

diegomrsantos added QBFT and removed claude-recheck triggers claude review workflow to re-run labels Oct 29, 2025

diegomrsantos marked this pull request as ready for review October 29, 2025 19:23

This comment was marked as outdated.

Sign in to view

claude-code-actions-sigp bot reviewed Oct 29, 2025

View reviewed changes

anchor/qbft_manager/src/tests.rs Outdated Show resolved Hide resolved

claude-code-actions-sigp bot reviewed Oct 29, 2025

View reviewed changes

anchor/qbft_manager/src/tests.rs Outdated Show resolved Hide resolved

claude-code-actions-sigp bot reviewed Oct 29, 2025

View reviewed changes

anchor/qbft_manager/src/lib.rs Show resolved Hide resolved

claude-code-actions-sigp bot reviewed Oct 29, 2025

View reviewed changes

anchor/qbft_manager/src/lib.rs Outdated Show resolved Hide resolved

dknopik added the v1.2.0 label Nov 7, 2025

diegomrsantos changed the title ~~test: add test for Committee instances reaching late rounds~~ refactor: use beacon chain deadlines for QBFT instance cleanup Nov 12, 2025

diegomrsantos requested a review from dknopik November 12, 2025 13:08

dknopik removed the v1.2.0 label Dec 9, 2025

dknopik reviewed Dec 9, 2025

View reviewed changes

diegomrsantos and others added 4 commits March 9, 2026 23:35

style: apply formatting after rebase

8347f06

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

diegomrsantos force-pushed the test/qbft-late-rounds branch from 739902c to 8347f06 Compare March 9, 2026 23:19

diegomrsantos marked this pull request as draft March 10, 2026 21:47

diegomrsantos added 2 commits March 25, 2026 14:34

Merge remote-tracking branch 'upstream/unstable' into test/qbft-late-…

c0d4c92

…rounds

diegomrsantos force-pushed the test/qbft-late-rounds branch from de8b9d5 to c0d4c92 Compare March 26, 2026 12:28

diegomrsantos mentioned this pull request Mar 26, 2026

fix: late-caller race in QBFT instance lifecycle #914

Open

diegomrsantos added 4 commits March 26, 2026 15:21

qbft_manager: keep instances alive until deadline

261537a

test: tighten qbft_manager retention coverage

caaf7b4

chore: satisfy qbft_manager check-fmt

021acee

docs: prefer paused tokio time in tester guidance

7b26f65

diegomrsantos mentioned this pull request Mar 26, 2026

centralize QBFT timing policy across duties, validation, and retention #916

Open

This comment was marked as outdated.

Sign in to view

diegomrsantos added 4 commits March 26, 2026 20:19

docs: generalize tester invariant guidance

e1495f2

qbft_manager: align sync committee deadline

0968f49

test: fix clippy clone-on-copy

1035f58

test: clarify sync committee coverage

2681708

diegomrsantos added 2 commits March 26, 2026 21:04

test: satisfy rustfmt

051656e

Merge branch 'unstable' into test/qbft-late-rounds

0f1a28a

diegomrsantos mentioned this pull request Mar 27, 2026

test: clarify qbft_manager harness setup #917

Draft

Conversation

diegomrsantos commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem, Evidence, and Context (Required)

Change Overview (Required)

Risks, Trade-offs, and Mitigations (Required)

Validation (Required)

Rollback (Required for behavior or runtime changes; optional otherwise)

Blockers / Dependencies (Optional)

Additional Info / Next Steps (Optional)

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dknopik Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

diegomrsantos Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

diegomrsantos Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dknopik Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

diegomrsantos Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

diegomrsantos commented Mar 26, 2026

Uh oh!

This comment was marked as outdated.

diegomrsantos commented Mar 26, 2026

Uh oh!

claude-code-actions-sigp bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Final Review of PR #719

Summary

Issues Found

1. pass_to_instance now spawns instances on inbound network messages (behavioral change)

2. Proposer / SyncCommittee same-slot deadline: confirm this is intentional

Observations (non-blocking)

Verdict

Uh oh!

diegomrsantos commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

diegomrsantos commented Oct 25, 2025 •

edited

Loading

diegomrsantos Mar 10, 2026 •

edited

Loading

claude-code-actions-sigp bot commented Mar 26, 2026 •

edited

Loading

1. `pass_to_instance` now spawns instances on inbound network messages (behavioral change)

2. `Proposer` / `SyncCommittee` same-slot deadline: confirm this is intentional