Q&A: FederationHealthMonitor configuration — poll interval, score threshold, circuit breaker, SSE stream, Grafana monitoring #317
Unanswered
web3guru888
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Q&A: FederationHealthMonitor Configuration
Answers to the most common configuration and design questions for
FederationHealthMonitor(Issue #315).Q1: How do I choose
poll_interval_ms?A: The right value depends on your SLA for detecting failures:
The poll loop calls
snapshot()on all four components. Eachsnapshot()should be O(1) in-memory reads — no network calls. So even 1 s intervals are fine for local federations.Q2: How does
score_thresholdinteract withtrip_count?A: The circuit breaker trips when
overall_score < score_thresholdfortrip_countconsecutive polls:Recommendation: Start with defaults (0.5 / 3). Tune
trip_counthigher if you experience false positives during transient network blips.Q3: What happens when the circuit breaker opens?
A: Three things happen simultaneously:
FederationHealthEvent.circuit_open = Trueis emitted on the SSE streamCognitiveCycle._phase_federation()writesfederation.circuit_open = Trueto the Blackboardfederation.circuit_opencan pause federation operationsWhat does NOT happen automatically:
FederatedTaskRouter.route()does not internally query the health monitor — you must wire the circuit breaker check into your routing middleware. Example:Q4: How do I reset the circuit breaker in production?
A: Call
reset_circuit()— it resetsconsecutive_low = 0, which immediately closes the circuit:The decision to require manual reset is intentional: in a BFT system, automatic recovery from a cluster-wide failure risks masking an ongoing fault. Add an
auto_reset_after_sparameter if your use case requires automatic recovery.Q5: How are the component score weights calibrated?
A: The default weights reflect failure criticality:
To override for your deployment, subclass
InMemoryFederationHealthMonitorand override_collect():Q6: How do I subscribe to the SSE stream from a FastAPI endpoint?
A: Use
StreamingResponsewithtext/event-stream:Q7: How do I build a Grafana dashboard for federation health?
A: Four panels cover the essentials:
asi_federation_health_score{component="overall"}asi_federation_health_scoreasi_federation_circuit_openasi_federation_component_health{health="failed"}Alert rule (Prometheus format):
Beta Was this translation helpful? Give feedback.
All reactions