Phase 31.4 — ValueAligner: Human Value Alignment via RLHF & Constitutional AI

## Overview

`ValueAligner` implements **human value alignment** through preference learning, RLHF-style feedback loops, and constitutional AI principles. It learns human preferences from demonstrated behavior, explicit feedback, and constitutional principles, then ensures the system's goals and actions remain aligned with those preferences over time.

### Motivation

Value alignment is the central challenge of AI safety. Even with correct ethical frameworks (31.1) and norm tracking (31.3), the system must continuously learn and adapt to human values that may be implicit, evolving, or difficult to formalize. `ValueAligner` implements the core alignment mechanisms described by Russell (2019), Christiano et al. (2017), and Bai et al. (2022).

## Core Data Structures

```python
from __future__ import annotations
import enum
from dataclasses import dataclass, field
from typing import Protocol, runtime_checkable

class AlignmentMethod(enum.Enum):
    RLHF = "rlhf"                          # Reinforcement Learning from Human Feedback
    CONSTITUTIONAL = "constitutional"       # Constitutional AI (Bai et al.)
    INVERSE_REWARD = "inverse_reward"       # Inverse Reward Design (Russell)
    COOPERATIVE_IRL = "cooperative_irl"     # Cooperative Inverse RL
    DEBATE = "debate"                       # AI Safety via Debate (Irving et al.)
    AMPLIFICATION = "amplification"         # Iterated Amplification (Christiano)

class PreferenceType(enum.Enum):
    PAIRWISE = "pairwise"          # A > B comparison
    RATING = "rating"              # absolute score
    RANKING = "ranking"            # ordered list
    CONSTITUTIONAL = "constitutional"  # principle-based

class AlignmentStatus(enum.Enum):
    ALIGNED = "aligned"
    UNCERTAIN = "uncertain"
    MISALIGNED = "misaligned"
    DRIFTING = "drifting"  # was aligned, trending away

@dataclass(frozen=True)
class HumanPreference:
    """A single human preference signal."""
    preference_id: str
    preference_type: PreferenceType
    context: dict[str, object]
    chosen: str  # preferred action/outcome
    rejected: str | None = None  # for pairwise comparisons
    rating: float | None = None  # for rating type
    confidence: float = 1.0
    source: str = "human"  # human | constitutional | inferred

@dataclass(frozen=True)
class ConstitutionalPrinciple:
    """A constitutional AI principle for self-supervision."""
    principle_id: str
    description: str
    category: str  # harmlessness | helpfulness | honesty
    priority: int
    examples: tuple[str, ...] = ()

@dataclass(frozen=True)
class AlignmentReport:
    """Report on current alignment status."""
    timestamp: str
    overall_status: AlignmentStatus
    alignment_score: float  # 0.0 to 1.0
    drift_rate: float  # change per evaluation cycle
    top_misalignments: tuple[str, ...]
    preference_coverage: float  # fraction of value space covered
    recommendations: tuple[str, ...]

@dataclass(frozen=True)
class RewardModel:
    """Learned reward model from human preferences."""
    model_id: str
    method: AlignmentMethod
    accuracy: float
    num_preferences: int
    last_updated: str
    constitutional_principles: tuple[str, ...] = ()

@runtime_checkable
class ValueAlignerProtocol(Protocol):
    async def learn_preference(
        self, preference: HumanPreference
    ) -> None: ...

    async def evaluate_alignment(
        self, action_id: str, context: dict[str, object]
    ) -> AlignmentReport: ...

    async def constitutional_review(
        self, action_id: str, principles: list[ConstitutionalPrinciple]
    ) -> list[str]: ...

    async def detect_drift(
        self, recent_actions: list[dict[str, object]]
    ) -> AlignmentStatus: ...

    async def get_reward_model(self) -> RewardModel: ...
```

## Algorithm — RLHF + Constitutional AI Pipeline

```
FUNCTION learn_preferences(preferences):
    # Phase 1: Supervised Fine-Tuning (SFT) on demonstrations
    FOR demo IN demonstrations:
        update_policy(demo, supervised=True)
    
    # Phase 2: Reward Model Training
    FOR pref IN pairwise_preferences:
        reward_model.train(pref.chosen, pref.rejected, pref.context)
    
    # Phase 3: Constitutional Self-Critique
    FOR principle IN constitutional_principles:
        critique = self_critique(current_policy, principle)
        IF critique.violation_detected:
            revision = self_revise(critique, principle)
            reward_model.add_constitutional_preference(revision)
    
    # Phase 4: Policy Optimization (PPO-style)
    WHILE NOT converged:
        action = policy.sample(context)
        reward = reward_model.score(action, context)
        policy.update(action, reward, kl_penalty=beta)

FUNCTION detect_drift(recent_actions):
    window_scores = []
    FOR action IN recent_actions:
        score = reward_model.score(action)
        window_scores.append(score)
    
    trend = linear_regression(window_scores)
    IF trend.slope < -DRIFT_THRESHOLD:
        RETURN AlignmentStatus.DRIFTING
    IF mean(window_scores) < ALIGNMENT_THRESHOLD:
        RETURN AlignmentStatus.MISALIGNED
    IF std(window_scores) > UNCERTAINTY_THRESHOLD:
        RETURN AlignmentStatus.UNCERTAIN
    RETURN AlignmentStatus.ALIGNED
```

## Prometheus Metrics

| Metric | Type | Description |
|--------|------|-------------|
| `value_aligner_preferences_learned_total` | Counter | Total preferences ingested (by type, source) |
| `value_aligner_alignment_score` | Gauge | Current alignment score (0-1) |
| `value_aligner_drift_rate` | Gauge | Value drift rate per cycle |
| `value_aligner_constitutional_reviews_total` | Counter | Constitutional principle reviews performed |
| `value_aligner_reward_model_accuracy` | Gauge | Current reward model accuracy |

## Test Targets (12)

1. Learn pairwise preference — reward model updates correctly
2. Learn rating preference — scalar reward calibration
3. Constitutional review detects principle violation
4. Constitutional self-revision improves alignment score
5. Drift detection — declining scores trigger DRIFTING status
6. No drift — stable scores return ALIGNED
7. High variance — returns UNCERTAIN status
8. Preference coverage tracking — measures value space coverage
9. RLHF pipeline end-to-end — preference → reward → policy update
10. Multiple alignment methods produce comparable results
11. Concurrent preference learning — thread safety
12. Serialization round-trip for AlignmentReport and RewardModel

## References

- Russell — *Human Compatible* (value alignment, cooperative inverse RL)
- Christiano et al. — *Deep RL from Human Feedback* (RLHF methodology)
- Bai et al. — *Constitutional AI* (harmlessness from AI feedback)
- Gabriel — *AI, Values, and Alignment* (alignment taxonomy)
- Irving et al. — *AI Safety via Debate* (debate as alignment mechanism)
- Hadfield-Menell et al. — *Inverse Reward Design* (reward misspecification)
- Leike et al. — *Scalable Agent Alignment via Reward Modeling* (recursive reward modeling)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 31.4 — ValueAligner: Human Value Alignment via RLHF & Constitutional AI #662

Overview

Motivation

Core Data Structures

Algorithm — RLHF + Constitutional AI Pipeline

Prometheus Metrics

Test Targets (12)

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Metric	Type	Description
`value_aligner_preferences_learned_total`	Counter	Total preferences ingested (by type, source)
`value_aligner_alignment_score`	Gauge	Current alignment score (0-1)
`value_aligner_drift_rate`	Gauge	Value drift rate per cycle
`value_aligner_constitutional_reviews_total`	Counter	Constitutional principle reviews performed
`value_aligner_reward_model_accuracy`	Gauge	Current reward model accuracy

Phase 31.4 — ValueAligner: Human Value Alignment via RLHF & Constitutional AI #662

Description

Overview

Motivation

Core Data Structures

Algorithm — RLHF + Constitutional AI Pipeline

Prometheus Metrics

Test Targets (12)

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions