Skip to content

refactor: use beacon chain deadlines for QBFT instance cleanup#719

Draft
diegomrsantos wants to merge 17 commits intosigp:unstablefrom
diegomrsantos:test/qbft-late-rounds
Draft

refactor: use beacon chain deadlines for QBFT instance cleanup#719
diegomrsantos wants to merge 17 commits intosigp:unstablefrom
diegomrsantos:test/qbft-late-rounds

Conversation

@diegomrsantos
Copy link
Copy Markdown
Member

@diegomrsantos diegomrsantos commented Oct 25, 2025

Problem, Evidence, and Context (Required)

  • qbft_manager was cleaning instances up on a fixed slot-based TTL instead of the actual duty validity window, which could remove live or already-decided instances before their beacon-chain deadline.
  • For SyncCommittee, the previous same-slot cleanup was also tighter than the next-slot inclusion window that the client already models by starting consensus at 2/3 of the slot.
  • This is worth doing now because it addresses the behavior reported in #720 and keeps the PR scoped to a bounded qbft_manager lifecycle improvement.
  • Evidence: the previous retention model was based on an internal slot count rather than per-role beacon-chain inclusion windows.
  • Relevant links: #720, #914 for the pre-existing late-caller race that remains out of scope here, and #916 for the broader timing-policy follow-up.

Change Overview (Required)

  • Replace fixed slot-based cleanup in qbft_manager with role-specific beacon-chain deadlines.
  • Keep completed instances registered until their deadline instead of dropping them immediately.
  • Keep sync-committee instances alive through the next slot rather than deleting them at the slot boundary.
  • Add and tighten retention-focused tests so they prove the instance actually completes before asserting it remains registered.
  • Add end-to-end sync-committee coverage using the real late-in-slot start offset and deadline cleanup path.
  • Reviewer reading order: start with the deadline-based retention change, then the sync-committee deadline adjustment, then the retention tests.
  • Intentionally did not change validator/message-validation timing policy outside qbft_manager, and did not try to solve the pre-existing late-caller race in #914. The broader timing-policy alignment work is tracked separately in #916.

Risks, Trade-offs, and Mitigations (Required)

  • Main risk is keeping QBFT instances alive longer than before and relying on deadline-based cleanup instead of the old fixed TTL.
  • Trade-off: this improves qbft_manager retention semantics without yet centralizing all timing policy across other components.
  • Risk is mitigated by keeping the behavior change scoped to qbft_manager, aligning SyncCommittee cleanup with the modeled next-slot inclusion window, and adding deadline-focused coverage, including an end-to-end sync-committee scenario.

Validation (Required)

  • cargo test -p qbft_manager
  • cargo fmt --all --check
  • make lint
  • Commit hooks on the follow-up commits also ran:
    • cargo fmt --all
    • cargo clippy --all
    • cargo sort workspace

Rollback (Required for behavior or runtime changes; optional otherwise)

  • Safe to revert by reverting this PR's commits.
  • No schema, config, or data migration impact.
  • Revert would restore the old slot-based cleanup behavior.

Blockers / Dependencies (Optional)

  • N/A

Additional Info / Next Steps (Optional)

  • The late-caller race is tracked separately in #914.
  • The broader timing-policy/source-of-truth follow-up is tracked in #916.
  • The tester guidance now explicitly prefers #[tokio::test(start_paused = true)] when simulated time is safe, which matches the testing approach used in the new retention tests.

@claude-code-actions-sigp

This comment was marked as outdated.

@diegomrsantos diegomrsantos changed the base branch from stable to unstable October 25, 2025 16:35
@diegomrsantos diegomrsantos marked this pull request as draft October 28, 2025 20:48
@diegomrsantos diegomrsantos force-pushed the test/qbft-late-rounds branch 2 times, most recently from 2cc3d0d to 4767f9d Compare October 28, 2025 23:26
@diegomrsantos diegomrsantos added the claude-recheck triggers claude review workflow to re-run label Oct 29, 2025
@diegomrsantos diegomrsantos self-assigned this Oct 29, 2025
@diegomrsantos diegomrsantos added QBFT and removed claude-recheck triggers claude review workflow to re-run labels Oct 29, 2025
@diegomrsantos diegomrsantos marked this pull request as ready for review October 29, 2025 19:23
@claude-code-actions-sigp

This comment was marked as outdated.

@dknopik dknopik added the v1.2.0 label Nov 7, 2025
@diegomrsantos diegomrsantos changed the title test: add test for Committee instances reaching late rounds refactor: use beacon chain deadlines for QBFT instance cleanup Nov 12, 2025
@dknopik dknopik removed the v1.2.0 label Dec 9, 2025
Comment on lines +342 to +352
// Branch 1: Instance completed - clean immediately
Some(id) = completion_rx.recv() => {
match id {
InstanceId::BeaconVote(id) => {
self.beacon_vote_instances.remove(&id);
}
InstanceId::ValidatorConsensus(id) => {
self.validator_consensus_data_instances.remove(&id);
}
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a problem with this approach:

In theory, there might be a race condition where some tasks try to register their oneshot channel to an instance after it has completed. This might e.g. be the case if multiple validator attestation duties wait for the same committee instance. If we start the instance late (e.g. because of a struggling BN), the first thread will start the instance, which might complete immediately due to replayed messages, giving no opportunity for the other tasks to register their listeners. This is why the current code cleans up at a fixed time regardless of completion.

Instead, we could move the cleanup time in this branch - to give some time (til end of next slot?) to get the instance result. Wdyt?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need more context to understand what's described in the first paragraph.

Copy link
Copy Markdown
Member Author

@diegomrsantos diegomrsantos Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This race exists, but it is pre-existing, not introduced by this PR.

Before this PR, the instance task stays alive in Decided state and the registry entry lingers until the slot-based cleaner removes it. Once the cleaner drops the entry, the tx is dropped, rx.recv() returns None, and the task exits. A late caller after that point hits Vacant, spawns a new instance, and hangs until timeout. The grace window is larger under the old cleanup scheme, but the underlying late-caller behavior is still there.

This PR does make that behavior easier to hit by breaking out of the loop on Decided and removing the entry immediately via completion notification.

The broader issue is that independent code paths can call decide_instance for the same CommitteeInstanceId at different times. Grouping validators per committee, as in #834, improves that by reducing duplicate local callers within a duty path. Applying the same pattern to sync committee signing would help for the same reason.

But that change alone does not fully eliminate the late-caller behavior; that still depends on the cleanup / lifecycle semantics in qbft_manager.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could modify ManagedInstance to either hold a channel sender to a running instance, or a resulting value D obtained from a finished instance to accommodate late callers. This would allow us to clean up the instance as soon as it is finished by dropping the sender and storing the finished value. I am unsure how long the resulting value should be kept - 1 slot?

As you said, the underlying root cause is the possibility of late callers, but I am unsure how to prevent this without a major refactor.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. What we can do in this PR is remove the immediate cleanup on completion (Branch 1 in the cleaner) and the break after Decided in the instance task — so the instance stays alive until its beacon chain deadline expires. The Decided state already handles late Initialize messages by returning the cached result (see instance.rs line 79-83), so late callers would get the decided value immediately.

The pre-existing late-caller race (where callers arrive after the entry is removed entirely) is a separate concern — I'll create an issue to track that for a follow-up PR.

diegomrsantos and others added 4 commits March 9, 2026 23:35
Adds test to verify QBFT Committee instances can reach late rounds
(9+) as configured with max_round=12.

The test creates a Committee instance, forces round changes by
keeping operators offline, then advances through multiple slots
while verifying the instance survives to reach round 10.

Currently fails - instance is cleaned up after 2 slots, reaching
round 9 but unable to complete it (needs 120s, gets 8s).
Replace slot-based cleanup with duty-specific beacon chain inclusion
deadlines. This allows QBFT instances to progress through all configured
rounds without premature removal.

Key changes:
- Separate instance identity from manager metadata using ManagedInstance wrapper
- Calculate duty-specific deadlines per EIP-7045 (attestations valid until end of epoch E+1)
- Add slots_per_epoch configuration parameter
- Implement dual-trigger cleaner (completion notification + deadline timeout)

Fixes instances being cleaned after 2 slots, now properly respecting
beacon chain inclusion windows (32-63 slots for attestations).
Improve test readability by applying Setup/Execute/Assert structure:

- Replace magic numbers with named constants (SINGLE_INSTANCE, TWO_INSTANCES, etc.)
- Add mandatory section comments (// SETUP, // EXECUTE, // ASSERT) to all new tests
- Split oversized test_role_based_deadline_calculations into 6 focused tests (one per role)
- Add descriptive assertion messages explaining what must be true
- Named all literals in new tests (OLD_CLEANUP_SLOT, BEACON_DEADLINE_SLOT, etc.)

All 23 tests pass (up from 18 due to role deadline test split).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add detailed explanation for Committee/Aggregator deadline calculation:
- Document the calculation formula: (E+2) * slots_per_epoch - 1
- Explain that this represents the last slot for on-chain inclusion
- Reference EIP-7045 specification

Enhance ManagedInstance documentation:
- Convert to doc comment for better API documentation
- Clarify that it tracks both channel and beacon chain deadline
- Explain its role in the cleanup task

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@diegomrsantos diegomrsantos force-pushed the test/qbft-late-rounds branch from 739902c to 8347f06 Compare March 9, 2026 23:19
@diegomrsantos diegomrsantos marked this pull request as draft March 10, 2026 21:47
Resolve conflicts in qbft_manager/src/lib.rs:
- Keep deadline calculation code from this branch
- Use upstream's fixed comment typo ("Generate" not "General")
- Drop redundant with_quorum_size (default already matches)
- Use pre-extracted role variable for with_max_rounds
@diegomrsantos
Copy link
Copy Markdown
Member Author

@claude do a final review

@claude-code-actions-sigp

This comment was marked as outdated.

@diegomrsantos
Copy link
Copy Markdown
Member Author

@claude It looks like the latest versions of both branches weren't used in your review. Do it again

@claude-code-actions-sigp
Copy link
Copy Markdown

claude-code-actions-sigp bot commented Mar 26, 2026

Claude finished @diegomrsantos's task in 2m 19s —— View job


Final Review of PR #719

  • Read the full diff against origin/unstable
  • Review qbft_manager/src/lib.rs changes
  • Review qbft_manager/src/instance.rs changes
  • Review qbft_manager/src/tests.rs changes
  • Review tester-subagent.md changes
  • Post final review

Summary

The PR replaces fixed slot-based cleanup (QBFT_RETAIN_SLOTS = 1) with role-specific beacon-chain deadlines computed by calculate_deadline. Decided instances now stay alive in the registry until the cleaner removes them at deadline expiry, serving late Initialize messages via the existing Decided handler (instance.rs:79-83).

Changes are well-scoped to qbft_manager with good test coverage across deadline calculation, retention semantics, epoch boundaries, and late-round survival.


Issues Found

1. pass_to_instance now spawns instances on inbound network messages (behavioral change)

lib.rs:378-402 — In the base code, pass_to_instance already called get_or_spawn_instance, so the spawning behavior on network messages is not new — it existed before this PR. The only change here is adding the deadline parameter to the existing call. I retract the concern from the prior review; this is not a behavioral change introduced by this PR.

2. Proposer / SyncCommittee same-slot deadline: confirm this is intentional

lib.rs:61-63Proposer and SyncCommittee get deadline = slot. The cleaner runs at the start of each new slot and checks managed.deadline >= current_slot. When current_slot becomes slot + 1, the instance is removed.

For Proposer this is correct — proposals are strictly slot-bound.

For SyncCommittee, sync committee contributions are aggregated at slot_start + 2/3 * slot_duration, and the aggregate is included in the next slot's block. The same-slot deadline means the instance will be cleaned at the start of slot + 1, which should be fine since the QBFT consensus for a sync committee contribution should complete well within the slot. However, if there's any risk of the contribution consensus spanning past the slot boundary (e.g., due to round changes), the instance would be cleaned before completion. This seems acceptable for now given the PR scope, but worth monitoring.

Verdict: non-blocking — the comment in the code ("Must be in the same slot") correctly documents the intent.


Observations (non-blocking)

  • calculate_deadline is correct for EIP-7045: (epoch + 2) * slots_per_epoch - 1 correctly yields the last slot of epoch E+1 for attestations in epoch E. The comment is clear and well-documented.

  • VoluntaryExit / ValidatorRegistration deadline (slot + slots_per_epoch): This is a reasonable heuristic. These aren't time-critical operations, and one epoch provides a generous window.

  • instance.rs change is minimal and correct — the only change is adding a comment on the RecvResult::Closed branch (instance.rs:360). The break after Closed already existed in the base. The comment clarifies that no notification is needed because either the Decided transition already sent it, or the cleaner dropped the registry entry (which closes rx).

  • Test quality is good — tests use start_paused = true with tokio::time::advance() + yield_now() for deterministic timer control. The deadline calculation unit tests cover all roles. The integration tests (test_committee_can_reach_late_rounds, test_cleanup_removes_only_expired_instances, test_cleanup_across_epoch_boundary) verify the actual retention behavior end-to-end.

  • Clone bound addition on QbftDecidable::Id (lib.rs:460): Required because get_or_spawn_instance now needs to clone the id for map.entry(id.clone()) since it's used after the entry call. This is a minor trait signature change that correctly supports the new ManagedInstance insertion pattern.

  • Late-caller race acknowledged as pre-existing and out of scope per the review discussion thread and PR description, tracked in fix: late-caller race in QBFT instance lifecycle #914.


Verdict

No blocking issues. The core changes are correct and well-tested. The only item to keep an eye on is the SyncCommittee same-slot deadline if round changes push sync committee consensus past the slot boundary — but this is acceptable for the current scope.


@diegomrsantos
Copy link
Copy Markdown
Member Author

Follow-up to Claude's note about the SyncCommittee same-slot deadline: this is now fixed on the PR branch.

  • 0968f4950 changes SyncCommittee cleanup to keep instances alive through the next-slot inclusion window while leaving Proposer same-slot.
  • 0968f4950 also adds an end-to-end paused-time test that uses the real late-in-slot start offset, crosses the first slot boundary without quorum, then proves the instance still reaches consensus and is cleaned at the following deadline boundary.
  • 1035f58bd fixes the resulting CI clippy::clone_on_copy failure in the new test helper.
  • 2681708e8 rewrites the new sync-committee test into a clearer Arrange / Act / Assert structure.

Validation after the follow-ups:

  • make lint
  • cargo test -p qbft_manager

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants