Show & Tell: Phase 11.2 — AlignmentMonitor (rolling window scores, harmonic mean, 5 alignment dimensions) #341
web3guru888
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Issue: #340 | Phase: 11.2 — Safety & Alignment
What is
AlignmentMonitor?AlignmentMonitoris the continuous complement toSafetyFilter(#337). WhileSafetyFiltergates individual goals and sub-tasks at the point of entry,AlignmentMonitorwatches the running system over time and asks: is the overall pattern of behaviour remaining aligned with our constitutional specification?It does this by tracking five alignment dimensions, computing rolling window scores, and emitting alerts or pausing the autonomy loop when scores drift below configurable thresholds.
Architecture
The 5 Alignment Dimensions
CONSTITUTIONALCAPABILITY_SCOPEGOAL_PRIORITYRESOURCE_USAGEFEDERATION_TRUSTScore computation
AlignmentSample.scoreis always in [0.0, 1.0]. Component code normalises raw measurements before callingrecord(). Example forCONSTITUTIONAL:Rolling window —
_compute_window()Empty window = 1.0 is an intentional benefit-of-the-doubt default.
SafetyFilteris the hard gate;AlignmentMonitoris a trend sensor — it should only raise alarms when there is positive evidence of drift.Harmonic mean — why it matters
overall_score()uses harmonic mean across the 5 dimensions:Example: one dimension drops to 0.1 (capability scope violations), others stay at 1.0:
Harmonic mean correctly surfaces the weak dimension — exactly what you want for safety-critical systems.
Prometheus metrics
asi_alignment_samples_totalasi_alignment_scoreasi_alignment_alerts_totalasi_alignment_critical_totalasi_alignment_overall_scoreGrafana panels:
asi_alignment_score{agent_id="$agent"}— 5-line chart, one per dimensionasi_alignment_overall_score{agent_id="$agent"}— single stat with threshold colouring (green ≥ 0.8, amber ≥ 0.7, red < 0.7)rate(asi_alignment_alerts_total[5m])— alert rate sparklinerate(asi_alignment_critical_total[5m])— critical breach rate alert ruleOpen questions for the community
CONSTITUTIONALscore with a higher weight (e.g. 3×)?overall_score()behave whenagent_id="*"(fleet-wide score)?AlignmentSnapshotbe written toFederatedBlackboardso the federation can observe alignment health of peers?Beta Was this translation helpful? Give feedback.
All reactions