Skip to content

feat: partial sig with fallback#890

Open
shane-moore wants to merge 15 commits intosigp:unstablefrom
shane-moore:feat/partial-sig-with-fallback
Open

feat: partial sig with fallback#890
shane-moore wants to merge 15 commits intosigp:unstablefrom
shane-moore:feat/partial-sig-with-fallback

Conversation

@shane-moore
Copy link
Copy Markdown
Member

@shane-moore shane-moore commented Mar 17, 2026

Problem, Evidence, and Context

Change Overview

  • After Lagrange reconstruction, verify the combined signature against the validator's master BLS pubkey
  • On verification failure, fall back to per-share verification using operator share pubkeys from DB, evict invalid shares, and continue waiting for replacement shares to restore quorum
  • Add resolveDuplicateSignature handling when conflicting partial sigs arrive from the same operator
  • Add metrics for signature verification failures and operator evictions
  • ~300 lines of production code, ~700 lines of tests

Risks, Trade-offs, and Mitigations

  • The fallback path queries the DB via spawn_blocking. This only fires when reconstruction verification fails, which should be rare in normal operation
  • Duplicate resolution also hits the DB, but the message validator's per-operator rate limit (MAX_MESSAGES_PER_ROUND = 1) bounds this to at most one query per operator per collector instance
  • No changes to the happy path -> reconstruction success follows the same code path as before

Validation

  • 8 unit tests covering verify_reconstructed_signature, verify_partial_signature, find_invalid_shares, resolve_duplicate_signature, and try_combine_and_verify
  • 4 integration tests: happy path, fallback eviction + recovery, duplicate resolution, and evicted operator rejection

Rollback

  • Clean revert, no data/schema changes. Reverting removes verification and returns to the prior behavior of trusting all partial signatures.

Additional Info / Next Steps

  • Tests can be split to a separate PR if preferred; kept here because they validate the core logic.

@shane-moore shane-moore changed the title Feat/partial sig with fallback feat: partial sig with fallback Mar 17, 2026
Comment on lines +593 to +604
warn!(?operator_id, "Conflicting signature from operator");
match fetch_share_pubkeys_by_validator_index(&database, validator_index).await {
Ok(pubkeys) => {
resolve_duplicate_signature(
&mut signature_share,
operator_id,
&signature,
signing_root,
&pubkeys,
);
}
Err(err) => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance / DoS concern: Every conflicting signature from the same operator triggers a spawn_blocking + DB query. A malicious operator flooding distinct signatures for the same operator_id will hit this path on every message, causing unbounded DB queries.

Consider caching the share_pubkeys lookup result (e.g., in a local variable that persists across loop iterations), so the DB is only queried once per operator conflict:

Suggested change
warn!(?operator_id, "Conflicting signature from operator");
match fetch_share_pubkeys_by_validator_index(&database, validator_index).await {
Ok(pubkeys) => {
resolve_duplicate_signature(
&mut signature_share,
operator_id,
&signature,
signing_root,
&pubkeys,
);
}
Err(err) => {
if received_different_share {
warn!(?operator_id, "Conflicting signature from operator");
// TODO: consider caching share_pubkeys across loop iterations to avoid
// repeated DB queries from a malicious operator sending many distinct sigs
match fetch_share_pubkeys_by_validator_index(&database, validator_index).await {
Ok(pubkeys) => {
resolve_duplicate_signature(
&mut signature_share,
operator_id,
&signature,
signing_root,
&pubkeys,
);
}
Err(err) => {
error!(
?err,
?validator_index,
"DB lookup failed for duplicate resolution"
);
}
}

Copy link
Copy Markdown
Member Author

@shane-moore shane-moore Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message validator enforces MAX_MESSAGES_PER_ROUND = 1 per (operator, slot, kind), so at most ~2 messages from the same operator can reach the collector per slot (accounting for a small race window in the validation queue). This means the duplicate resolution path fires at most once per operator per collector instance -> bounded by committee size (4-13), not by attacker volume. A cache would save a handful of DB calls on an already-rare path so not worth the added complexity.

@diegomrsantos
Copy link
Copy Markdown
Member

Could we please use the current PR template for the PRs?

@shane-moore shane-moore marked this pull request as draft March 17, 2026 00:44
@shane-moore shane-moore marked this pull request as ready for review March 17, 2026 00:44
@diegomrsantos
Copy link
Copy Markdown
Member

Could we please use the current PR template for the PRs?

It seems we also need to get #870 merged into stable

Comment on lines +616 to +647
if let Some(threshold) = threshold
&& signature_share.len() as u64 >= threshold
{
let signature = match combine_signatures(mem::take(&mut signature_share)) {
Ok(signature) => Arc::new(signature),
Err(err) => {
let Some(validator_pk) = &validator_pubkey else {
error!("No validator pubkey available for verification");
return;
};

match try_combine_and_verify(&signature_share, validator_pk, signing_root) {
CombineOutcome::Success(signature) => {
trace!(?signature, "Successfully recovered signature");
for notifier in mem::take(&mut notifiers) {
if notifier.send(Arc::clone(&signature)).is_err() {
warn!("Callback dropped since signature is no longer relevant");
}
}
full_signature = Some(signature);
}
CombineOutcome::CombineFailed(err) => {
error!(?err, "Failed to recover signature");
return;
}
};

trace!(?signature, "Successfully recovered signature");

for notifier in mem::take(&mut notifiers) {
if notifier.send(Arc::clone(&signature)).is_err() {
warn!("Callback dropped - signature is no longer relevant");
CombineOutcome::VerificationFailed => {
metrics::inc_counter(&metrics::SIGNATURE_VERIFICATION_FAILURES_TOTAL);
warn!("Reconstructed signature failed verification so run fallback");
let share_pubkeys = match fetch_share_pubkeys(&database, validator_pk).await {
Ok(pubkeys) => pubkeys,
Err(err) => {
error!(?err, "Failed to look up share pubkeys");
return;
}
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: fallback can leave valid shares stranded when count remains >= threshold

After find_invalid_shares removes bad shares, if signature_share.len() >= threshold still holds (e.g. threshold=3, had 5 shares, 1 bad removed → 4 remain), the code falls through and blocks on rx.recv().await without re-attempting combination. If no further messages arrive, the collector is stuck forever holding enough valid shares to succeed.

This can happen in practice when more than threshold partial signatures arrive before the first combine attempt, and only a minority are invalid.

Fix: wrap the threshold check in a loop so it re-attempts after eviction:

Suggested change
if let Some(threshold) = threshold
&& signature_share.len() as u64 >= threshold
{
let signature = match combine_signatures(mem::take(&mut signature_share)) {
Ok(signature) => Arc::new(signature),
Err(err) => {
let Some(validator_pk) = &validator_pubkey else {
error!("No validator pubkey available for verification");
return;
};
match try_combine_and_verify(&signature_share, validator_pk, signing_root) {
CombineOutcome::Success(signature) => {
trace!(?signature, "Successfully recovered signature");
for notifier in mem::take(&mut notifiers) {
if notifier.send(Arc::clone(&signature)).is_err() {
warn!("Callback dropped since signature is no longer relevant");
}
}
full_signature = Some(signature);
}
CombineOutcome::CombineFailed(err) => {
error!(?err, "Failed to recover signature");
return;
}
};
trace!(?signature, "Successfully recovered signature");
for notifier in mem::take(&mut notifiers) {
if notifier.send(Arc::clone(&signature)).is_err() {
warn!("Callback dropped - signature is no longer relevant");
CombineOutcome::VerificationFailed => {
metrics::inc_counter(&metrics::SIGNATURE_VERIFICATION_FAILURES_TOTAL);
warn!("Reconstructed signature failed verification so run fallback");
let share_pubkeys = match fetch_share_pubkeys(&database, validator_pk).await {
Ok(pubkeys) => pubkeys,
Err(err) => {
error!(?err, "Failed to look up share pubkeys");
return;
}
};
// Re-check threshold after potential eviction of invalid shares.
while let Some(threshold) = threshold {
if (signature_share.len() as u64) < threshold {
break;
}
let Some(validator_pk) = &validator_pubkey else {
error!("No validator pubkey available for verification");
return;
};
match try_combine_and_verify(&signature_share, validator_pk, signing_root) {
CombineOutcome::Success(signature) => {
trace!(?signature, "Successfully recovered signature");
for notifier in mem::take(&mut notifiers) {
if notifier.send(Arc::clone(&signature)).is_err() {
warn!("Callback dropped since signature is no longer relevant");
}
}
full_signature = Some(signature);
break;
}
CombineOutcome::CombineFailed(err) => {
error!(?err, "Failed to recover signature");
return;
}
CombineOutcome::VerificationFailed => {
metrics::inc_counter(&metrics::SIGNATURE_VERIFICATION_FAILURES_TOTAL);
warn!("Reconstructed signature failed verification so run fallback");
let share_pubkeys = match fetch_share_pubkeys(&database, validator_pk).await {
Ok(pubkeys) => pubkeys,
Err(err) => {
error!(?err, "Failed to look up share pubkeys");
return;
}
};
let invalid_operators =
find_invalid_shares(&signature_share, signing_root, &share_pubkeys);
if invalid_operators.is_empty() {
let operators: Vec<_> = signature_share.keys().copied().collect();
error!(
?signing_root,
?operators,
"Verification failed but no individual share was invalid"
);
return;
}
warn!(?invalid_operators, "Removing invalid shares");
for op in &invalid_operators {
signature_share.remove(op);
}
// Loop back to re-check threshold with remaining shares
}
}
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point, when network partials arrive before RegisterNotifier sets the threshold, they accumulate unchecked. The first combine then runs with len >> threshold, and after eviction len can still be >= threshold with all valid shares remaining. The if let didn't re-check after eviction, so it fell through to rx.recv() waiting for a message that might never come.

Changed if let to while let so the threshold check re-runs immediately after eviction. Each iteration either succeeds, drops below threshold, or returns on fatal error -> no unnecessary wait.

@shane-moore
Copy link
Copy Markdown
Member Author

Could we please use the current PR template for the PRs?

It seems we also need to get #870 merged into stable

true true! I can make a pr for this tomorrow

@shane-moore shane-moore added the claude-recheck triggers claude review workflow to re-run label Mar 17, 2026
@github-actions github-actions bot removed the claude-recheck triggers claude review workflow to re-run label Mar 17, 2026
@diegomrsantos
Copy link
Copy Markdown
Member

PR Split Suggestion

This PR implements three spec functions, DB queries, plumbing, metrics, and tests all at once — +995 LOC across 4 crates. Per our project guidelines (>200 LOC or >3 crates → propose a PR stack), here's a suggested split:

Proposed 3-PR Stack

PR 1: refactor: add share pubkey DB queries (~75 LOC, 1 crate)

File Change
database/src/share_operations.rs get_share_pubkeys_for_validator, get_share_pubkeys_for_validator_index
database/src/sql_operations.rs Two SQL constants

Pure data layer, zero behavior change, can merge immediately.


PR 2: feat: post-reconstruction signature verification with fallback (~250 LOC prod + ~400 LOC tests, 3 crates) — depends on PR 1

File Change
client/src/lib.rs Pass database.clone() (1 line)
validator_store/src/lib.rs Add validator_pubkey to ValidatorSigningData (1 line)
signature_collector/Cargo.toml Add metrics + database deps
signature_collector/src/lib.rs Wire DB through manager; try_combine_and_verify, verify_reconstructed_signature, verify_partial_signature, find_invalid_shares, fetch_share_pubkeys; CombineOutcome enum; while loop retry; metrics
signature_collector/src/metrics.rs Counter metric
signature_collector/src/tests.rs Unit tests for find_invalid_shares, try_combine_and_verify; integration tests for happy path + fallback eviction/recovery

Core spec compliance: ReconstructSignature + FallBackAndVerifyEachSignature. The 1-line changes in client/validator_store are mechanical wiring.


PR 3: feat: duplicate signature resolution (~60 LOC prod + ~265 LOC tests, 1 crate) — depends on PR 2

File Change
signature_collector/src/lib.rs resolve_duplicate_signature, fetch_share_pubkeys_by_validator_index, duplicate detection in message loop
signature_collector/src/tests.rs Unit tests for resolve_duplicate_*; integration tests for duplicate resolution

Spec function: resolveDuplicateSignature. Logically independent code path from the fallback (triggered by conflicting sigs from same operator, not by reconstruction failure). Also isolates the review comment about the while loop bug when eviction leaves count >= threshold.

Why this split

  • Each PR has a single purpose and stays under ~200 LOC of production code
  • Tests ship with the behavior they validate
  • The flagged bug (fallback + count >= threshold) can be addressed in PR 2 without tangling with duplicate resolution
  • If 2-PR split is preferred, merge PRs 2+3 — production code would be ~255 LOC, borderline but reasonable

Resolve merge conflicts:
- sql_operations.rs: keep both PR queries and upstream GET_OWN_SHARE
- signature_collector/lib.rs: keep both HashSet (PR) and Future (upstream) imports

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants