refactor: use beacon chain deadlines for QBFT instance cleanup#719
refactor: use beacon chain deadlines for QBFT instance cleanup#719diegomrsantos wants to merge 17 commits intosigp:unstablefrom
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
4cd49f3 to
1d995b4
Compare
2cc3d0d to
4767f9d
Compare
This comment was marked as outdated.
This comment was marked as outdated.
anchor/qbft_manager/src/lib.rs
Outdated
| // Branch 1: Instance completed - clean immediately | ||
| Some(id) = completion_rx.recv() => { | ||
| match id { | ||
| InstanceId::BeaconVote(id) => { | ||
| self.beacon_vote_instances.remove(&id); | ||
| } | ||
| InstanceId::ValidatorConsensus(id) => { | ||
| self.validator_consensus_data_instances.remove(&id); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
There is a problem with this approach:
In theory, there might be a race condition where some tasks try to register their oneshot channel to an instance after it has completed. This might e.g. be the case if multiple validator attestation duties wait for the same committee instance. If we start the instance late (e.g. because of a struggling BN), the first thread will start the instance, which might complete immediately due to replayed messages, giving no opportunity for the other tasks to register their listeners. This is why the current code cleans up at a fixed time regardless of completion.
Instead, we could move the cleanup time in this branch - to give some time (til end of next slot?) to get the instance result. Wdyt?
There was a problem hiding this comment.
I need more context to understand what's described in the first paragraph.
There was a problem hiding this comment.
Good catch. This race exists, but it is pre-existing, not introduced by this PR.
Before this PR, the instance task stays alive in Decided state and the registry entry lingers until the slot-based cleaner removes it. Once the cleaner drops the entry, the tx is dropped, rx.recv() returns None, and the task exits. A late caller after that point hits Vacant, spawns a new instance, and hangs until timeout. The grace window is larger under the old cleanup scheme, but the underlying late-caller behavior is still there.
This PR does make that behavior easier to hit by breaking out of the loop on Decided and removing the entry immediately via completion notification.
The broader issue is that independent code paths can call decide_instance for the same CommitteeInstanceId at different times. Grouping validators per committee, as in #834, improves that by reducing duplicate local callers within a duty path. Applying the same pattern to sync committee signing would help for the same reason.
But that change alone does not fully eliminate the late-caller behavior; that still depends on the cleanup / lifecycle semantics in qbft_manager.
There was a problem hiding this comment.
We could modify ManagedInstance to either hold a channel sender to a running instance, or a resulting value D obtained from a finished instance to accommodate late callers. This would allow us to clean up the instance as soon as it is finished by dropping the sender and storing the finished value. I am unsure how long the resulting value should be kept - 1 slot?
As you said, the underlying root cause is the possibility of late callers, but I am unsure how to prevent this without a major refactor.
There was a problem hiding this comment.
Good suggestion. What we can do in this PR is remove the immediate cleanup on completion (Branch 1 in the cleaner) and the break after Decided in the instance task — so the instance stays alive until its beacon chain deadline expires. The Decided state already handles late Initialize messages by returning the cached result (see instance.rs line 79-83), so late callers would get the decided value immediately.
The pre-existing late-caller race (where callers arrive after the entry is removed entirely) is a separate concern — I'll create an issue to track that for a follow-up PR.
Adds test to verify QBFT Committee instances can reach late rounds (9+) as configured with max_round=12. The test creates a Committee instance, forces round changes by keeping operators offline, then advances through multiple slots while verifying the instance survives to reach round 10. Currently fails - instance is cleaned up after 2 slots, reaching round 9 but unable to complete it (needs 120s, gets 8s).
Replace slot-based cleanup with duty-specific beacon chain inclusion deadlines. This allows QBFT instances to progress through all configured rounds without premature removal. Key changes: - Separate instance identity from manager metadata using ManagedInstance wrapper - Calculate duty-specific deadlines per EIP-7045 (attestations valid until end of epoch E+1) - Add slots_per_epoch configuration parameter - Implement dual-trigger cleaner (completion notification + deadline timeout) Fixes instances being cleaned after 2 slots, now properly respecting beacon chain inclusion windows (32-63 slots for attestations).
Improve test readability by applying Setup/Execute/Assert structure: - Replace magic numbers with named constants (SINGLE_INSTANCE, TWO_INSTANCES, etc.) - Add mandatory section comments (// SETUP, // EXECUTE, // ASSERT) to all new tests - Split oversized test_role_based_deadline_calculations into 6 focused tests (one per role) - Add descriptive assertion messages explaining what must be true - Named all literals in new tests (OLD_CLEANUP_SLOT, BEACON_DEADLINE_SLOT, etc.) All 23 tests pass (up from 18 due to role deadline test split). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Add detailed explanation for Committee/Aggregator deadline calculation: - Document the calculation formula: (E+2) * slots_per_epoch - 1 - Explain that this represents the last slot for on-chain inclusion - Reference EIP-7045 specification Enhance ManagedInstance documentation: - Convert to doc comment for better API documentation - Clarify that it tracks both channel and beacon chain deadline - Explain its role in the cleanup task 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
739902c to
8347f06
Compare
Resolve conflicts in qbft_manager/src/lib.rs:
- Keep deadline calculation code from this branch
- Use upstream's fixed comment typo ("Generate" not "General")
- Drop redundant with_quorum_size (default already matches)
- Use pre-extracted role variable for with_max_rounds
de8b9d5 to
c0d4c92
Compare
|
@claude do a final review |
This comment was marked as outdated.
This comment was marked as outdated.
|
@claude It looks like the latest versions of both branches weren't used in your review. Do it again |
|
Claude finished @diegomrsantos's task in 2m 19s —— View job Final Review of PR #719
SummaryThe PR replaces fixed slot-based cleanup ( Changes are well-scoped to Issues Found1.
|
|
Follow-up to Claude's note about the
Validation after the follow-ups:
|
Problem, Evidence, and Context (Required)
qbft_managerwas cleaning instances up on a fixed slot-based TTL instead of the actual duty validity window, which could remove live or already-decided instances before their beacon-chain deadline.SyncCommittee, the previous same-slot cleanup was also tighter than the next-slot inclusion window that the client already models by starting consensus at2/3of the slot.qbft_managerlifecycle improvement.Change Overview (Required)
qbft_managerwith role-specific beacon-chain deadlines.qbft_manager, and did not try to solve the pre-existing late-caller race in #914. The broader timing-policy alignment work is tracked separately in #916.Risks, Trade-offs, and Mitigations (Required)
qbft_managerretention semantics without yet centralizing all timing policy across other components.qbft_manager, aligningSyncCommitteecleanup with the modeled next-slot inclusion window, and adding deadline-focused coverage, including an end-to-end sync-committee scenario.Validation (Required)
cargo test -p qbft_managercargo fmt --all --checkmake lintcargo fmt --allcargo clippy --allcargo sort workspaceRollback (Required for behavior or runtime changes; optional otherwise)
Blockers / Dependencies (Optional)
Additional Info / Next Steps (Optional)
#[tokio::test(start_paused = true)]when simulated time is safe, which matches the testing approach used in the new retention tests.