Skip to content

Latest commit

 

History

History
130 lines (90 loc) · 5.98 KB

File metadata and controls

130 lines (90 loc) · 5.98 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

CL-bench (Context Learning Benchmark) is a benchmark for evaluating LLMs' ability to learn new knowledge from provided context. It contains 1,899 expert-annotated tasks across 4 context categories and 18 sub-categories. The dataset is hosted on HuggingFace (tencent/CL-bench).

Setup

pip install openai tqdm

The dataset file CL-bench.jsonl must be downloaded separately from HuggingFace before running scripts.

Commands

Inference (generate model outputs)

IMPORTANT: Always use --workers 5 or higher (minimum 5). Never run with fewer than 5 workers.

IMPORTANT: Always start serve2.py with UVICORN_WORKERS=5 (minimum 5).

# Requires OPENAI_API_KEY env var or --api-key flag
python infer.py --model gpt-5.1 --input CL-bench.jsonl --output outputs/gpt5-1.jsonl --workers 5

# Custom OpenAI-compatible API
python infer.py --model deepseek-chat --base-url https://api.deepseek.com/v1 --api-key KEY --workers 5

# Concurrent inference
python infer.py --model gpt-5.1 --workers 20

Evaluation (grade model outputs)

The default judge uses Claude Opus 4.6 via Anthropic's OpenAI-compatible endpoint:

# Default eval command (Claude Opus 4.6 judge)
python eval.py --input outputs/MODEL.jsonl \
  --judge-model claude-opus-4-6 \
  --base-url "https://api.anthropic.com/v1/" \
  --api-key "$ANTHROPIC_API_KEY"

# To re-run eval, delete the _graded.jsonl first (checkpoint resumption skips existing)
rm -f outputs/MODEL_graded.jsonl

# Concurrent evaluation
python eval.py --input outputs/MODEL.jsonl \
  --judge-model claude-opus-4-6 \
  --base-url "https://api.anthropic.com/v1/" \
  --api-key "$ANTHROPIC_API_KEY" \
  --workers 5

Eval input JSONL format

Each line must have:

{"idx": "<task_id>", "messages": [...], "model_output": "<answer text>", "rubrics": [...], "metadata": {"task_id": "...", "context_category": "...", ...}}

Architecture

The codebase consists of two standalone Python scripts forming a two-stage pipeline:

  1. infer.py — Sends dataset tasks to an OpenAI-compatible API and collects model responses. Reads CL-bench.jsonl (messages + rubrics + metadata per task), calls the model, writes results to outputs/. Supports checkpoint resumption: if the output file already exists, completed task IDs are skipped.

  2. eval.py — Uses an LLM judge to grade model outputs with binary scores (0/1). Constructs a detailed grading prompt with rubrics, parses the judge's JSON response to extract scores. Also supports checkpoint resumption. After grading, calculate_statistics() prints solving rates overall and per context_category.

Both scripts use ThreadPoolExecutor for concurrency (controlled by --workers) and share the same JSONL read/write/append patterns. Both use metadata.task_id as the stable unique identifier for checkpoint resumption.

Data Flow

CL-bench.jsonl → infer.py → outputs/{model}.jsonl → eval.py → outputs/{model}_graded.jsonl

Scoring

Binary all-or-nothing: score 1 only if ALL rubric requirements are satisfied; otherwise 0. Empty model outputs automatically score 0. Final metric is Solving Rate = score_1 / total.

Template Authoring Standards

See reasoner_2/CLAUDE.md for the authoritative template authoring standards. The standards are enforced when Claude Code is launched from reasoner_2/.

Governance

This project is governed by reasoner_2/.claude/ hooks. Launch Claude Code from reasoner_2/ for full enforcement including:

  • Plan validation (plan mode + /critical-gaps + /validate-framework)
  • Architecture documentation (context bootstrap, self-update reminders)
  • Commit gates (complexity assessment, AI attribution blocking)
  • Notable changes log maintenance

Mandatory Workflow

Plan Mode → write plan → /critical-gaps → ExitPlanMode → [User Approves] → Implementation → /validate-framework → Complete

NEVER skip /critical-gaps before ExitPlanMode. NEVER skip /validate-framework after implementation.

Pre-Inference Testing (MUST run before infer.py)

Before running any infer.py inference command, you MUST complete these steps in order:

  1. Syntax validation — Verify all Python files compile:
    • ~/Documents/Work/reasoner/reasoner_2/serve2.py
    • ~/Documents/Work/reasoner/reasoner_2/query/gemini_query.py
    • ~/Documents/Work/reasoner/reasoner_2/query/stage_config.py
  2. Template validation — Verify all Jinja2 templates parse without errors:
    • All *.jinja2 files in ~/Documents/Work/reasoner/reasoner_2/query/templates/stages/
  3. serve2.py health check — Confirm serve2.py is running and responding on the expected port
  4. Pinpointer health check — Confirm Pinpointer is running at http://localhost:3000 (start with cd ~/Documents/Work/reasoner/pinpointer && npm run dev & if needed). serve2.py must be started with PINPOINTER_URL=http://localhost:3000.
  5. 4-sample smoke test — Run inference on 4-sample dataset and evaluate results:
    # Run 4-sample inference (minimum 5 workers)
    python3 infer.py --model reasoner-2 --base-url http://localhost:8001/v1 --input CL-bench-4sample.jsonl --output outputs/test-4sample.jsonl --workers 5
    # Evaluate results
    python3 eval.py --input outputs/test-4sample.jsonl --judge-model claude-opus-4-6 --base-url "https://api.anthropic.com/v1/" --api-key "$ANTHROPIC_API_KEY" --workers 5
  6. Analyze results — Check for regressions in previously passing tasks
  7. Address issues — Fix any failures before proceeding with full inference

A PreToolUse hook (.claude/hooks/pre-inference-check.sh) automatically validates steps 1-3 when any infer.py command is run. If it fails, fix the reported issues before retrying.

NEVER run full inference without completing the 4-sample smoke test first.

Git Commit Rules

  • NEVER include "Co-Authored-By" lines or any mention of Claude/AI in commit messages.