CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

CL-bench (Context Learning Benchmark) is a benchmark for evaluating LLMs' ability to learn new knowledge from provided context. It contains 1,899 expert-annotated tasks across 4 context categories and 18 sub-categories. The dataset is hosted on HuggingFace (tencent/CL-bench).

Setup

pip install openai tqdm

The dataset file CL-bench.jsonl must be downloaded separately from HuggingFace before running scripts.

Commands

Inference (generate model outputs)

IMPORTANT: Always use --workers 5 or higher (minimum 5). Never run with fewer than 5 workers.

IMPORTANT: Always start serve2.py with UVICORN_WORKERS=5 (minimum 5).

# Requires OPENAI_API_KEY env var or --api-key flag
python infer.py --model gpt-5.1 --input CL-bench.jsonl --output outputs/gpt5-1.jsonl --workers 5

# Custom OpenAI-compatible API
python infer.py --model deepseek-chat --base-url https://api.deepseek.com/v1 --api-key KEY --workers 5

# Concurrent inference
python infer.py --model gpt-5.1 --workers 20

Evaluation (grade model outputs)

The default judge uses Claude Opus 4.6 via Anthropic's OpenAI-compatible endpoint:

# Default eval command (Claude Opus 4.6 judge)
python eval.py --input outputs/MODEL.jsonl \
  --judge-model claude-opus-4-6 \
  --base-url "https://api.anthropic.com/v1/" \
  --api-key "$ANTHROPIC_API_KEY"

# To re-run eval, delete the _graded.jsonl first (checkpoint resumption skips existing)
rm -f outputs/MODEL_graded.jsonl

# Concurrent evaluation
python eval.py --input outputs/MODEL.jsonl \
  --judge-model claude-opus-4-6 \
  --base-url "https://api.anthropic.com/v1/" \
  --api-key "$ANTHROPIC_API_KEY" \
  --workers 5

Eval input JSONL format

Each line must have:

{"idx": "<task_id>", "messages": [...], "model_output": "<answer text>", "rubrics": [...], "metadata": {"task_id": "...", "context_category": "...", ...}}

Architecture

The codebase consists of two standalone Python scripts forming a two-stage pipeline:

infer.py — Sends dataset tasks to an OpenAI-compatible API and collects model responses. Reads CL-bench.jsonl (messages + rubrics + metadata per task), calls the model, writes results to outputs/. Supports checkpoint resumption: if the output file already exists, completed task IDs are skipped.
eval.py — Uses an LLM judge to grade model outputs with binary scores (0/1). Constructs a detailed grading prompt with rubrics, parses the judge's JSON response to extract scores. Also supports checkpoint resumption. After grading, calculate_statistics() prints solving rates overall and per context_category.

Both scripts use ThreadPoolExecutor for concurrency (controlled by --workers) and share the same JSONL read/write/append patterns. Both use metadata.task_id as the stable unique identifier for checkpoint resumption.

Data Flow

CL-bench.jsonl → infer.py → outputs/{model}.jsonl → eval.py → outputs/{model}_graded.jsonl

Scoring

Binary all-or-nothing: score 1 only if ALL rubric requirements are satisfied; otherwise 0. Empty model outputs automatically score 0. Final metric is Solving Rate = score_1 / total.

Template Authoring Standards

See reasoner_2/CLAUDE.md for the authoritative template authoring standards. The standards are enforced when Claude Code is launched from reasoner_2/.

Governance

This project is governed by reasoner_2/.claude/ hooks. Launch Claude Code from reasoner_2/ for full enforcement including:

Plan validation (plan mode + /critical-gaps + /validate-framework)
Architecture documentation (context bootstrap, self-update reminders)
Commit gates (complexity assessment, AI attribution blocking)
Notable changes log maintenance

Mandatory Workflow

Plan Mode → write plan → /critical-gaps → ExitPlanMode → [User Approves] → Implementation → /validate-framework → Complete

NEVER skip /critical-gaps before ExitPlanMode. NEVER skip /validate-framework after implementation.

Pre-Inference Testing (MUST run before infer.py)

Before running any infer.py inference command, you MUST complete these steps in order:

Syntax validation — Verify all Python files compile:
- ~/Documents/Work/reasoner/reasoner_2/serve2.py
- ~/Documents/Work/reasoner/reasoner_2/query/gemini_query.py
- ~/Documents/Work/reasoner/reasoner_2/query/stage_config.py
Template validation — Verify all Jinja2 templates parse without errors:
- All *.jinja2 files in ~/Documents/Work/reasoner/reasoner_2/query/templates/stages/
serve2.py health check — Confirm serve2.py is running and responding on the expected port
Pinpointer health check — Confirm Pinpointer is running at http://localhost:3000 (start with cd ~/Documents/Work/reasoner/pinpointer && npm run dev & if needed). serve2.py must be started with PINPOINTER_URL=http://localhost:3000.

4-sample smoke test — Run inference on 4-sample dataset and evaluate results:

# Run 4-sample inference (minimum 5 workers)
python3 infer.py --model reasoner-2 --base-url http://localhost:8001/v1 --input CL-bench-4sample.jsonl --output outputs/test-4sample.jsonl --workers 5
# Evaluate results
python3 eval.py --input outputs/test-4sample.jsonl --judge-model claude-opus-4-6 --base-url "https://api.anthropic.com/v1/" --api-key "$ANTHROPIC_API_KEY" --workers 5

Analyze results — Check for regressions in previously passing tasks
Address issues — Fix any failures before proceeding with full inference

A PreToolUse hook (.claude/hooks/pre-inference-check.sh) automatically validates steps 1-3 when any infer.py command is run. If it fails, fix the reported issues before retrying.

NEVER run full inference without completing the 4-sample smoke test first.

Git Commit Rules

NEVER include "Co-Authored-By" lines or any mention of Claude/AI in commit messages.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Project Overview

Setup

Commands

Inference (generate model outputs)

Evaluation (grade model outputs)

Eval input JSONL format

Architecture

Data Flow

Scoring

Template Authoring Standards

Governance

Mandatory Workflow

Pre-Inference Testing (MUST run before infer.py)

Git Commit Rules

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Project Overview

Setup

Commands

Inference (generate model outputs)

Evaluation (grade model outputs)

Eval input JSONL format

Architecture

Data Flow

Scoring

Template Authoring Standards

Governance

Mandatory Workflow

Pre-Inference Testing (MUST run before infer.py)

Git Commit Rules