Show & Tell: Phase 11.2 — AlignmentMonitor (rolling window scores, harmonic mean, 5 alignment dimensions) #341

web3guru888 · 2026-04-12T23:03:52Z

web3guru888
Apr 12, 2026
Maintainer

Issue: #340 | Phase: 11.2 — Safety & Alignment

What is `AlignmentMonitor`?

AlignmentMonitor is the continuous complement to SafetyFilter (#337). While SafetyFilter gates individual goals and sub-tasks at the point of entry, AlignmentMonitor watches the running system over time and asks: is the overall pattern of behaviour remaining aligned with our constitutional specification?

It does this by tracking five alignment dimensions, computing rolling window scores, and emitting alerts or pausing the autonomy loop when scores drift below configurable thresholds.

Architecture

SafetyFilter ──► AlignmentAwareSafetyFilter ──► AlignmentMonitor.record()
                                                         │
                                              ┌──────────▼──────────────┐
                                              │  Ring Buffer per dim     │
                                              │  (max_samples_per_dim)   │
                                              └──────────┬──────────────┘
                                                         │
                                              ┌──────────▼──────────────┐
                                              │  _compute_window()       │
                                              │  rolling mean / min      │
                                              └──────────┬──────────────┘
                                                         │
                                        ┌────────────────▼──────────────────┐
                                        │  _check_threshold()               │
                                        │  score < alert_threshold?         │
                                        │    → AlignmentAlert               │
                                        │  score < critical_threshold?      │
                                        │    → pause_callback()             │
                                        └───────────────────────────────────┘
                                                         │
                                              ┌──────────▼──────────────┐
                                              │  overall_score()         │
                                              │  harmonic mean 5 dims    │
                                              └──────────┬──────────────┘
                                                         │
                                              CognitiveCycle._tick() gates

The 5 Alignment Dimensions

Dimension	What it measures	Healthy score
`CONSTITUTIONAL`	SafetyFilter block/critical rate	1.0 = no violations
`CAPABILITY_SCOPE`	Tasks using only authorised capabilities	1.0 = all in-scope
`GOAL_PRIORITY`	Escalation rate vs baseline	1.0 = no abnormal escalation
`RESOURCE_USAGE`	CPU/memory within declared budget	1.0 = within budget
`FEDERATION_TRUST`	Cross-peer trust score stability	1.0 = no trust degradation

Score computation

AlignmentSample.score is always in [0.0, 1.0]. Component code normalises raw measurements before calling record(). Example for CONSTITUTIONAL:

Event	score
Goal allowed (no violations)	1.0
Goal blocked (BLOCK severity)	0.5
Goal blocked (CRITICAL severity)	0.2

Rolling window — `_compute_window()`

def _compute_window(self, dimension, agent_id, window_ms):
    now = time.monotonic()
    cutoff = now - window_ms / 1000.0
    recent = [s for s in self._samples[dimension][agent_id]
              if s.timestamp >= cutoff]
    if not recent:
        # benefit of doubt — no signal = assume aligned
        return AlignmentWindow(..., mean_score=1.0, min_score=1.0, sample_count=0)
    scores = [s.score for s in recent]
    return AlignmentWindow(...,
        mean_score=sum(scores) / len(scores),
        min_score=min(scores),
        sample_count=len(scores),
    )

Empty window = 1.0 is an intentional benefit-of-the-doubt default. SafetyFilter is the hard gate; AlignmentMonitor is a trend sensor — it should only raise alarms when there is positive evidence of drift.

Harmonic mean — why it matters

overall_score() uses harmonic mean across the 5 dimensions:

H = 5 / (1/x₁ + 1/x₂ + 1/x₃ + 1/x₄ + 1/x₅)

Example: one dimension drops to 0.1 (capability scope violations), others stay at 1.0:

Mean type	Score
Arithmetic	(1+1+1+1+0.1)/5 = 0.82
Harmonic	5/(1+1+1+1+10) = 0.36

Harmonic mean correctly surfaces the weak dimension — exactly what you want for safety-critical systems.

Prometheus metrics

Metric	Type	Description
`asi_alignment_samples_total`	Counter	Samples ingested per dimension/agent
`asi_alignment_score`	Gauge	Current window mean score
`asi_alignment_alerts_total`	Counter	Alert events triggered
`asi_alignment_critical_total`	Counter	Critical threshold breaches
`asi_alignment_overall_score`	Gauge	Harmonic mean across all 5 dimensions

Grafana panels:

asi_alignment_score{agent_id="$agent"} — 5-line chart, one per dimension
asi_alignment_overall_score{agent_id="$agent"} — single stat with threshold colouring (green ≥ 0.8, amber ≥ 0.7, red < 0.7)
rate(asi_alignment_alerts_total[5m]) — alert rate sparkline
rate(asi_alignment_critical_total[5m]) — critical breach rate alert rule

Open questions for the community

Dimension weights — should the harmonic mean use equal weights, or should CONSTITUTIONAL score with a higher weight (e.g. 3×)?
Cross-agent aggregation — how should overall_score() behave when agent_id="*" (fleet-wide score)?
Persistence — should AlignmentSnapshot be written to FederatedBlackboard so the federation can observe alignment health of peers?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Show & Tell: Phase 11.2 — AlignmentMonitor (rolling window scores, harmonic mean, 5 alignment dimensions) #341

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Show & Tell: Phase 11.2 — AlignmentMonitor (rolling window scores, harmonic mean, 5 alignment dimensions) #341

Uh oh!

web3guru888 Apr 12, 2026 Maintainer

What is AlignmentMonitor?

Architecture

The 5 Alignment Dimensions

Score computation

Rolling window — _compute_window()

Harmonic mean — why it matters

Prometheus metrics

Open questions for the community

Replies: 0 comments

web3guru888
Apr 12, 2026
Maintainer

What is `AlignmentMonitor`?

Rolling window — `_compute_window()`