Overview
ValueAligner implements human value alignment through preference learning, RLHF-style feedback loops, and constitutional AI principles. It learns human preferences from demonstrated behavior, explicit feedback, and constitutional principles, then ensures the system's goals and actions remain aligned with those preferences over time.
Motivation
Value alignment is the central challenge of AI safety. Even with correct ethical frameworks (31.1) and norm tracking (31.3), the system must continuously learn and adapt to human values that may be implicit, evolving, or difficult to formalize. ValueAligner implements the core alignment mechanisms described by Russell (2019), Christiano et al. (2017), and Bai et al. (2022).
Core Data Structures
from __future__ import annotations
import enum
from dataclasses import dataclass, field
from typing import Protocol, runtime_checkable
class AlignmentMethod(enum.Enum):
RLHF = "rlhf" # Reinforcement Learning from Human Feedback
CONSTITUTIONAL = "constitutional" # Constitutional AI (Bai et al.)
INVERSE_REWARD = "inverse_reward" # Inverse Reward Design (Russell)
COOPERATIVE_IRL = "cooperative_irl" # Cooperative Inverse RL
DEBATE = "debate" # AI Safety via Debate (Irving et al.)
AMPLIFICATION = "amplification" # Iterated Amplification (Christiano)
class PreferenceType(enum.Enum):
PAIRWISE = "pairwise" # A > B comparison
RATING = "rating" # absolute score
RANKING = "ranking" # ordered list
CONSTITUTIONAL = "constitutional" # principle-based
class AlignmentStatus(enum.Enum):
ALIGNED = "aligned"
UNCERTAIN = "uncertain"
MISALIGNED = "misaligned"
DRIFTING = "drifting" # was aligned, trending away
@dataclass(frozen=True)
class HumanPreference:
"""A single human preference signal."""
preference_id: str
preference_type: PreferenceType
context: dict[str, object]
chosen: str # preferred action/outcome
rejected: str | None = None # for pairwise comparisons
rating: float | None = None # for rating type
confidence: float = 1.0
source: str = "human" # human | constitutional | inferred
@dataclass(frozen=True)
class ConstitutionalPrinciple:
"""A constitutional AI principle for self-supervision."""
principle_id: str
description: str
category: str # harmlessness | helpfulness | honesty
priority: int
examples: tuple[str, ...] = ()
@dataclass(frozen=True)
class AlignmentReport:
"""Report on current alignment status."""
timestamp: str
overall_status: AlignmentStatus
alignment_score: float # 0.0 to 1.0
drift_rate: float # change per evaluation cycle
top_misalignments: tuple[str, ...]
preference_coverage: float # fraction of value space covered
recommendations: tuple[str, ...]
@dataclass(frozen=True)
class RewardModel:
"""Learned reward model from human preferences."""
model_id: str
method: AlignmentMethod
accuracy: float
num_preferences: int
last_updated: str
constitutional_principles: tuple[str, ...] = ()
@runtime_checkable
class ValueAlignerProtocol(Protocol):
async def learn_preference(
self, preference: HumanPreference
) -> None: ...
async def evaluate_alignment(
self, action_id: str, context: dict[str, object]
) -> AlignmentReport: ...
async def constitutional_review(
self, action_id: str, principles: list[ConstitutionalPrinciple]
) -> list[str]: ...
async def detect_drift(
self, recent_actions: list[dict[str, object]]
) -> AlignmentStatus: ...
async def get_reward_model(self) -> RewardModel: ...
Algorithm — RLHF + Constitutional AI Pipeline
FUNCTION learn_preferences(preferences):
# Phase 1: Supervised Fine-Tuning (SFT) on demonstrations
FOR demo IN demonstrations:
update_policy(demo, supervised=True)
# Phase 2: Reward Model Training
FOR pref IN pairwise_preferences:
reward_model.train(pref.chosen, pref.rejected, pref.context)
# Phase 3: Constitutional Self-Critique
FOR principle IN constitutional_principles:
critique = self_critique(current_policy, principle)
IF critique.violation_detected:
revision = self_revise(critique, principle)
reward_model.add_constitutional_preference(revision)
# Phase 4: Policy Optimization (PPO-style)
WHILE NOT converged:
action = policy.sample(context)
reward = reward_model.score(action, context)
policy.update(action, reward, kl_penalty=beta)
FUNCTION detect_drift(recent_actions):
window_scores = []
FOR action IN recent_actions:
score = reward_model.score(action)
window_scores.append(score)
trend = linear_regression(window_scores)
IF trend.slope < -DRIFT_THRESHOLD:
RETURN AlignmentStatus.DRIFTING
IF mean(window_scores) < ALIGNMENT_THRESHOLD:
RETURN AlignmentStatus.MISALIGNED
IF std(window_scores) > UNCERTAINTY_THRESHOLD:
RETURN AlignmentStatus.UNCERTAIN
RETURN AlignmentStatus.ALIGNED
Prometheus Metrics
| Metric |
Type |
Description |
value_aligner_preferences_learned_total |
Counter |
Total preferences ingested (by type, source) |
value_aligner_alignment_score |
Gauge |
Current alignment score (0-1) |
value_aligner_drift_rate |
Gauge |
Value drift rate per cycle |
value_aligner_constitutional_reviews_total |
Counter |
Constitutional principle reviews performed |
value_aligner_reward_model_accuracy |
Gauge |
Current reward model accuracy |
Test Targets (12)
- Learn pairwise preference — reward model updates correctly
- Learn rating preference — scalar reward calibration
- Constitutional review detects principle violation
- Constitutional self-revision improves alignment score
- Drift detection — declining scores trigger DRIFTING status
- No drift — stable scores return ALIGNED
- High variance — returns UNCERTAIN status
- Preference coverage tracking — measures value space coverage
- RLHF pipeline end-to-end — preference → reward → policy update
- Multiple alignment methods produce comparable results
- Concurrent preference learning — thread safety
- Serialization round-trip for AlignmentReport and RewardModel
References
- Russell — Human Compatible (value alignment, cooperative inverse RL)
- Christiano et al. — Deep RL from Human Feedback (RLHF methodology)
- Bai et al. — Constitutional AI (harmlessness from AI feedback)
- Gabriel — AI, Values, and Alignment (alignment taxonomy)
- Irving et al. — AI Safety via Debate (debate as alignment mechanism)
- Hadfield-Menell et al. — Inverse Reward Design (reward misspecification)
- Leike et al. — Scalable Agent Alignment via Reward Modeling (recursive reward modeling)
Overview
ValueAlignerimplements human value alignment through preference learning, RLHF-style feedback loops, and constitutional AI principles. It learns human preferences from demonstrated behavior, explicit feedback, and constitutional principles, then ensures the system's goals and actions remain aligned with those preferences over time.Motivation
Value alignment is the central challenge of AI safety. Even with correct ethical frameworks (31.1) and norm tracking (31.3), the system must continuously learn and adapt to human values that may be implicit, evolving, or difficult to formalize.
ValueAlignerimplements the core alignment mechanisms described by Russell (2019), Christiano et al. (2017), and Bai et al. (2022).Core Data Structures
Algorithm — RLHF + Constitutional AI Pipeline
Prometheus Metrics
value_aligner_preferences_learned_totalvalue_aligner_alignment_scorevalue_aligner_drift_ratevalue_aligner_constitutional_reviews_totalvalue_aligner_reward_model_accuracyTest Targets (12)
References