How to wire a cheap-tier LLM as a mechanical verifier of a higher-tier agent's adherence to a behavioral rule. The "stupid" framing is intentional: the verifier's literal, instruction-following nature is a feature — it runs the test script exactly without creative reinterpretation. The cost asymmetry (cheap verifier auditing expensive subject) is what makes A/B-testing rule efficacy affordable.
This is the recipe behind the framework's ../tests/ fixtures and the methodology in ../docs/self-test.md.
| Pattern | Tier choice | Check type | Goal |
|---|---|---|---|
| Stupid-agent review (this recipe) | Cheap-tier verifier (Haiku, gpt-4o-mini) auditing higher-tier subject | Structural / boolean — heading present, quote verbatim, link resolves, vendor-neutral terms | Measure whether a behavioral rule moves metrics |
| LLM-as-judge | Same-tier or higher-tier judge | Quality / preference rating | Score output quality |
| Linter / static analysis | No LLM | AST or regex pattern match | Catch deterministic violations |
| Capability eval (AgentBench, τ²-bench) | Variable | Task success rate | Measure what an agent can do |
| Constitutional AI critique-revise | Same-tier self-critique | Quality + values alignment | Improve a single output |
| Pre-commit shell hook | No LLM | Deterministic shell test | Block bad commits |
The stupid-agent verifier is the only one that combines cheap-tier-as-feature + structural-not-quality checks + rule-efficacy as the goal.
The pattern composes from three published primitives, none of which alone names the full shape:
- Multi-Agent Verification (MAV) — arxiv 2502.20379, Feb 2025 — establishes that off-the-shelf cheap-tier LLMs (Gemini-1.5-Flash, GPT-4o-mini) can produce reliable binary True/False approvals at low cost. The closest theoretical analogue.
- Rule Based Rewards — Anthropic, arxiv 2411.01111 — decomposes safety policy into binary propositions graded by a separate LLM. Validates that rule-shift behavior is measurable, in an RL-training context. Stupid-agent review applies the same decomposition at runtime.
- SycEval — arxiv 2502.08177 — empirically measured rule-shift impact (sycophancy rates 56–62% across model families). Demonstrates the methodology is real; the framework's
../engines/calibration.mdis its purpose-built primitive.
Use the stupid-agent gate when all four hold:
- The check is structural — heading present, link target exists, quote matches verbatim, term banned, count correct.
- The check is boolean — pass/fail, not graded.
- The artifact under review is text — Markdown, JSON, code, prompt body. (Stupid agents are bad at runtime behavior; that's what fixtures and capability evals are for.)
- The cost of running it must be cheap enough to run on every change — otherwise it's a special-case audit, not a gate.
Skip the stupid-agent gate when:
- A regex or AST-based linter would suffice (cheaper, more deterministic — use the linter).
- The judgment is qualitative ("does this prose flow well?") — that's LLM-as-judge territory, use a smarter tier.
- The runtime cost matters more than the verification cost (production agent paths) — use deterministic hooks per
../conduct/hooks.md.
A stupid-agent review wires three roles. Adopters tend to confuse them.
| Role | Tier | Job | Example |
|---|---|---|---|
| Subject | Top-tier or mid-tier | Performs the task under test | Sonnet writing the diff for a bug fix |
| Verifier | Low-tier | Runs the boolean test against the subject's output | Haiku checking "does this output contain a ## Signature heading?" |
| Orchestrator | Mid-tier or top-tier | Dispatches both, scores the A/B delta, logs the fixture result | The CI runner or /converge-style skill |
The cost asymmetry matters. The subject is expensive because the work is hard. The verifier is cheap because the test is mechanical. Mixing the tiers — using top-tier as the verifier — wastes budget; using cheap-tier as the subject misses the failure modes the rule is supposed to prevent.
Stupid-agent review is most useful as the runtime for an A/B fixture per ../docs/self-test.md. The contract:
- Two subject runs — identical task, identical model, identical temperature. One with the conduct module loaded, one without.
- One verifier pass — the stupid agent applies the fixture's pass criterion to both outputs, returning a structured per-check result.
- Delta computation — the orchestrator compares the two verifier results and records the rule-attributable shift.
Without the baseline run, you have not measured rule efficacy — you have measured the rule-loaded model. The baseline is the contract.
Promptfoo supports cheap-tier graders out of the box. Override the default grader to a low-tier model and write structural assertions. Sketch:
# promptfooconfig.yaml
prompts:
- file://prompts/baseline.txt # task only
- file://prompts/treatment.txt # task + conduct/discipline.md prepended
providers:
- id: anthropic:messages:claude-sonnet-4-6
config: { temperature: 0 }
defaultTest:
options:
provider: anthropic:messages:claude-haiku-4-5 # the stupid agent
assert:
- type: llm-rubric
value: |
Does the output contain `(page - 1) * perPage` exactly?
Answer ONLY "yes" or "no".
- type: llm-rubric
value: |
Does the output add a JSDoc block (lines starting with `/**`)?
Answer ONLY "yes" or "no".
tests:
- vars: { task: "fix the off-by-one in this pagination function: ..." }Run two prompt variants, score with a Haiku grader, compare the deltas. Promptfoo handles the dispatch; you write the structural assertions.
Inspect-AI supports custom scorers including model-graded checks. The same A/B shape, with a custom scorer that runs the cheap-tier verifier:
from inspect_ai import Task, eval
from inspect_ai.scorer import scorer, Score
from inspect_ai.solver import generate, system_message
@scorer
def stupid_agent_check(criterion: str):
async def _scorer(state, target):
# Dispatch a Haiku-tier check on state.output.completion
# Return Score(value=1.0 if pass else 0.0, explanation=...)
...
return _scorer
baseline = Task(
dataset=[...],
solver=generate(),
scorer=stupid_agent_check("contains `(page - 1) * perPage`"),
)
treatment = Task(
dataset=[...],
solver=[system_message(open("conduct/discipline.md").read()), generate()],
scorer=stupid_agent_check("contains `(page - 1) * perPage`"),
)
eval([baseline, treatment], model="anthropic/claude-sonnet-4-6")Inspect-AI handles the eval harness; the cheap-tier scorer is your responsibility. The framework's contribution is the fixture format and the rule-efficacy A/B discipline, not the runner.
- Top-tier verifier. Burning Opus on "does this string contain
(page - 1)" wastes budget without improving accuracy. Cheap-tier is the design choice, not a fallback. - Quality-judgment in the verifier prompt. "Is this code clean?" is LLM-as-judge, not stupid-agent review. Mechanical = boolean = no adverbs.
- Treating one fixture run as a measurement. A single A/B is a smoke test, not a calibration. The honest claim is "this module showed delta on this fixture / model / prompt." Generalization requires more runs.
- Skipping the baseline. "I loaded the module and the model behaved well" is not evidence of rule efficacy. The baseline is the contract.
- Adjusting the criterion after seeing results. The criterion is a prediction. If the baseline passes, the criterion was wrong or the fixture invited drift insufficiently — not a reason to weaken the criterion.
- Confusing capability evals with rule-efficacy evals. AgentBench measures whether the agent succeeds at the task. Stupid-agent review measures whether the rule moves metrics. Different goals, different infrastructure.
| Module | Role |
|---|---|
../docs/self-test.md |
Methodology: A/B fixture format, pass-criterion discipline, per-module inventory |
../tests/ |
Per-module fixtures applying this recipe |
../engines/calibration.md |
The session-level analogue for the doubt-engine module specifically |
./eval-harnesses.md |
External benchmark suites (τ²-bench, SYCON-Bench, AgentDojo) — capability evals, not rule-efficacy |
../conduct/hooks.md |
Runtime gates for production paths — not the same as offline A/B fixtures |
Stupid-agent review fits between hooks (runtime, deterministic, cheap-per-call) and eval-harnesses (offline, capability-focused, expensive-per-batch). It is the offline, rule-focused, cheap-per-fixture gate the framework needs to move from "operational hypothesis" to "verified module."
This pattern is partial-prior-art, build-new as of 2026-05-05. The implementation primitives exist (Promptfoo cheap graders, Inspect-AI custom scorers, MAV's binary aspect verifiers). The named pattern — cheap-tier-as-feature + structural-not-quality + rule-efficacy as the goal — is not published as a unified architecture. This recipe is the framework's first attempt at naming and shipping it.
The first fixture using this recipe is ../tests/discipline.fixture.md. It honestly logs both a failed run (over-constrained prompt, no delta) and a passing run (drift-inviting prompt, clear delta). One fixture is not a verified framework — it is the start of one.