This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
CL-bench (Context Learning Benchmark) is a benchmark for evaluating LLMs' ability to learn new knowledge from provided context. It contains 1,899 expert-annotated tasks across 4 context categories and 18 sub-categories. The dataset is hosted on HuggingFace (tencent/CL-bench).
pip install openai tqdmThe dataset file CL-bench.jsonl must be downloaded separately from HuggingFace before running scripts.
IMPORTANT: Always use --workers 5 or higher (minimum 5). Never run with fewer than 5 workers.
IMPORTANT: Always start serve2.py with UVICORN_WORKERS=5 (minimum 5).
# Requires OPENAI_API_KEY env var or --api-key flag
python infer.py --model gpt-5.1 --input CL-bench.jsonl --output outputs/gpt5-1.jsonl --workers 5
# Custom OpenAI-compatible API
python infer.py --model deepseek-chat --base-url https://api.deepseek.com/v1 --api-key KEY --workers 5
# Concurrent inference
python infer.py --model gpt-5.1 --workers 20The default judge uses Claude Opus 4.6 via Anthropic's OpenAI-compatible endpoint:
# Default eval command (Claude Opus 4.6 judge)
python eval.py --input outputs/MODEL.jsonl \
--judge-model claude-opus-4-6 \
--base-url "https://api.anthropic.com/v1/" \
--api-key "$ANTHROPIC_API_KEY"
# To re-run eval, delete the _graded.jsonl first (checkpoint resumption skips existing)
rm -f outputs/MODEL_graded.jsonl
# Concurrent evaluation
python eval.py --input outputs/MODEL.jsonl \
--judge-model claude-opus-4-6 \
--base-url "https://api.anthropic.com/v1/" \
--api-key "$ANTHROPIC_API_KEY" \
--workers 5Each line must have:
{"idx": "<task_id>", "messages": [...], "model_output": "<answer text>", "rubrics": [...], "metadata": {"task_id": "...", "context_category": "...", ...}}The codebase consists of two standalone Python scripts forming a two-stage pipeline:
-
infer.py— Sends dataset tasks to an OpenAI-compatible API and collects model responses. ReadsCL-bench.jsonl(messages + rubrics + metadata per task), calls the model, writes results tooutputs/. Supports checkpoint resumption: if the output file already exists, completed task IDs are skipped. -
eval.py— Uses an LLM judge to grade model outputs with binary scores (0/1). Constructs a detailed grading prompt with rubrics, parses the judge's JSON response to extract scores. Also supports checkpoint resumption. After grading,calculate_statistics()prints solving rates overall and percontext_category.
Both scripts use ThreadPoolExecutor for concurrency (controlled by --workers) and share the same JSONL read/write/append patterns. Both use metadata.task_id as the stable unique identifier for checkpoint resumption.
CL-bench.jsonl → infer.py → outputs/{model}.jsonl → eval.py → outputs/{model}_graded.jsonl
Binary all-or-nothing: score 1 only if ALL rubric requirements are satisfied; otherwise 0. Empty model outputs automatically score 0. Final metric is Solving Rate = score_1 / total.
See reasoner_2/CLAUDE.md for the authoritative template authoring standards. The standards are enforced when Claude Code is launched from reasoner_2/.
This project is governed by reasoner_2/.claude/ hooks. Launch Claude Code from reasoner_2/ for full enforcement including:
- Plan validation (plan mode + /critical-gaps + /validate-framework)
- Architecture documentation (context bootstrap, self-update reminders)
- Commit gates (complexity assessment, AI attribution blocking)
- Notable changes log maintenance
Plan Mode → write plan → /critical-gaps → ExitPlanMode → [User Approves] → Implementation → /validate-framework → Complete
NEVER skip /critical-gaps before ExitPlanMode. NEVER skip /validate-framework after implementation.
Before running any infer.py inference command, you MUST complete these steps in order:
- Syntax validation — Verify all Python files compile:
~/Documents/Work/reasoner/reasoner_2/serve2.py~/Documents/Work/reasoner/reasoner_2/query/gemini_query.py~/Documents/Work/reasoner/reasoner_2/query/stage_config.py
- Template validation — Verify all Jinja2 templates parse without errors:
- All
*.jinja2files in~/Documents/Work/reasoner/reasoner_2/query/templates/stages/
- All
- serve2.py health check — Confirm serve2.py is running and responding on the expected port
- Pinpointer health check — Confirm Pinpointer is running at
http://localhost:3000(start withcd ~/Documents/Work/reasoner/pinpointer && npm run dev &if needed). serve2.py must be started withPINPOINTER_URL=http://localhost:3000. - 4-sample smoke test — Run inference on 4-sample dataset and evaluate results:
# Run 4-sample inference (minimum 5 workers) python3 infer.py --model reasoner-2 --base-url http://localhost:8001/v1 --input CL-bench-4sample.jsonl --output outputs/test-4sample.jsonl --workers 5 # Evaluate results python3 eval.py --input outputs/test-4sample.jsonl --judge-model claude-opus-4-6 --base-url "https://api.anthropic.com/v1/" --api-key "$ANTHROPIC_API_KEY" --workers 5
- Analyze results — Check for regressions in previously passing tasks
- Address issues — Fix any failures before proceeding with full inference
A PreToolUse hook (.claude/hooks/pre-inference-check.sh) automatically validates steps 1-3 when any infer.py command is run. If it fails, fix the reported issues before retrying.
NEVER run full inference without completing the 4-sample smoke test first.
- NEVER include "Co-Authored-By" lines or any mention of Claude/AI in commit messages.