A practical implementation of LLM evaluation skills built using insights from Eugene Yan's LLM-Evaluators research and Vercel AI SDK 6.
This repository demonstrates how to build production-ready LLM evaluation skills as part of the Agent Skills for Context Engineering project. It serves as a practical example of:
- Skill Development: How to transform research insights into executable agent skills
- Tool Design: Best practices for building AI tools with proper schemas and error handling
- Evaluation Patterns: Implementation of LLM-as-a-Judge patterns for quality assessment
This project is an example implementation to be added to:
It builds upon the foundational skills from:
- 📚
skills/context-fundamentals- Context engineering principles - 🔧
skills/tool-design- Tool design best practices
Evaluating AI-generated content is challenging. Traditional metrics (BLEU, ROUGE) often miss nuances that matter. Eugene Yan's research on LLM-Evaluators identifies practical patterns for using LLMs to judge LLM outputs.
Key insights we implemented:
| Insight | Implementation |
|---|---|
| Direct scoring works best for objective criteria | directScore tool with rubric support |
| Pairwise comparison is more reliable for preferences | pairwiseCompare tool with position swapping |
| Position bias affects pairwise judgments | Automatic position swapping in comparisons |
| Chain-of-thought improves reliability | All evaluations require justification with evidence |
| Clear rubrics reduce variance | generateRubric tool for consistent standards |
We leveraged AI SDK 6's new patterns:
- Agent Abstraction: Reusable
EvaluatorAgentclass with multiple capabilities - Type-safe Tools: Zod schemas for all inputs/outputs
- Structured Output: JSON responses parsed and validated
- Error Handling: Graceful degradation when API calls fail
┌─────────────────────────────────────────────────────────────────────┐
│ LLM-as-a-Judge Skills │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Skills │ │ Prompts │ │ Tools │ │
│ │ (MD docs) │───▶│ (templates)│───▶│ (TypeScript impl) │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌─────────────────────────┐ │
│ └─────────────────────────────▶│ EvaluatorAgent │ │
│ │ ├── score() │ │
│ │ ├── compare() │ │
│ │ ├── generateRubric() │ │
│ │ └── chat() │ │
│ └─────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ OpenAI GPT-5.2 API │ │
│ └─────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
llm-as-judge-skills/
├── skills/ # Foundational knowledge (MD docs)
│ ├── llm-evaluator/ # LLM-as-a-Judge patterns
│ │ └── llm-evaluator.md # Evaluation methods, metrics, bias mitigation
│ ├── context-fundamentals/ # Context engineering principles
│ │ └── context-fundamentals.md # Managing context effectively
│ └── tool-design/ # Tool design best practices
│ └── tool-design.md # Schema design, error handling
│
├── prompts/ # Prompt templates
│ ├── evaluation/
│ │ ├── direct-scoring-prompt.md # Scoring prompt template
│ │ └── pairwise-comparison-prompt.md # Comparison prompt template
│ ├── research/
│ │ └── research-synthesis-prompt.md
│ └── agent-system/
│ └── orchestrator-prompt.md
│
├── tools/ # Tool documentation (MD)
│ ├── evaluation/
│ │ ├── direct-score.md # Direct scoring tool spec
│ │ ├── pairwise-compare.md # Pairwise comparison spec
│ │ └── generate-rubric.md # Rubric generation spec
│ ├── research/
│ │ ├── web-search.md
│ │ └── read-url.md
│ └── orchestration/
│ └── delegate-to-agent.md
│
├── agents/ # Agent documentation (MD)
│ ├── evaluator-agent/
│ │ └── evaluator-agent.md
│ ├── research-agent/
│ │ └── research-agent.md
│ └── orchestrator-agent/
│ └── orchestrator-agent.md
│
├── src/ # TypeScript implementation
│ ├── tools/evaluation/
│ │ ├── direct-score.ts # 165 lines - Direct scoring implementation
│ │ ├── pairwise-compare.ts # 255 lines - Pairwise with bias mitigation
│ │ └── generate-rubric.ts # 162 lines - Rubric generation
│ ├── agents/
│ │ └── evaluator.ts # 112 lines - EvaluatorAgent class
│ ├── config/
│ │ └── index.ts # Configuration and validation
│ └── index.ts # Main exports
│
├── tests/ # Test suite
│ ├── evaluation.test.ts # 9 tests for tools
│ ├── skills.test.ts # 10 tests for skills
│ └── setup.ts # Test configuration
│
└── examples/ # Usage examples
├── basic-evaluation.ts
├── pairwise-comparison.ts
├── generate-rubric.ts
└── full-evaluation-workflow.ts
Purpose: Evaluate a single response against defined criteria with numerical scores.
When to Use:
- Factual accuracy checks
- Instruction following assessment
- Content quality grading
- Compliance verification
Implementation Highlights:
// From src/tools/evaluation/direct-score.ts
const systemPrompt = `You are an expert evaluator. Assess the response against each criterion.
For each criterion:
1. Find specific evidence in the response
2. Score according to the rubric (1-5 scale)
3. Justify your score
4. Suggest one improvement
Be objective and consistent. Base scores on explicit evidence.`;Key Features:
- Weighted criteria support
- Chain-of-thought justification required
- Evidence extraction from response
- Improvement suggestions per criterion
- Configurable rubrics (1-3, 1-5, 1-10 scales)
Example Usage:
const result = await executeDirectScore({
response: 'Quantum entanglement is like having two magical coins...',
prompt: 'Explain quantum entanglement to a high school student',
criteria: [
{ name: 'Accuracy', description: 'Scientific correctness', weight: 0.4 },
{ name: 'Clarity', description: 'Understandable for audience', weight: 0.3 },
{ name: 'Engagement', description: 'Interesting and memorable', weight: 0.3 }
],
rubric: { scale: '1-5' }
});
// Output:
// {
// success: true,
// scores: [
// { criterion: 'Accuracy', score: 4, justification: '...', evidence: [...] },
// { criterion: 'Clarity', score: 5, justification: '...', evidence: [...] },
// { criterion: 'Engagement', score: 4, justification: '...', evidence: [...] }
// ],
// overallScore: 4.33,
// weightedScore: 4.3,
// summary: { assessment: '...', strengths: [...], weaknesses: [...] }
// }Purpose: Compare two responses and determine which is better, with position bias mitigation.
When to Use:
- A/B testing responses
- Preference evaluation
- Style and tone assessment
- Ranking quality differences
Implementation Highlights:
// Position bias mitigation: evaluate twice with swapped positions
if (input.swapPositions) {
// First pass: A first, B second
const pass1 = await evaluatePair(input.responseA, input.responseB, ...);
// Second pass: B first, A second
const pass2 = await evaluatePair(input.responseB, input.responseA, ...);
// Map pass2 result back and check consistency
const pass2WinnerMapped = pass2.winner === 'A' ? 'B' : pass2.winner === 'B' ? 'A' : 'TIE';
const consistent = pass1.winner === pass2WinnerMapped;
// If inconsistent, return TIE with lower confidence
if (!consistent) {
finalWinner = 'TIE';
finalConfidence = 0.5;
}
}Key Features:
- Position Swapping: Automatically runs evaluation twice with swapped positions
- Consistency Check: Detects when position affects judgment
- Confidence Scoring: 0-1 confidence based on consistency
- Per-criterion Comparison: Detailed breakdown for each aspect
- Bias-aware Prompting: Explicit instructions to ignore length and position
Example Usage:
const result = await executePairwiseCompare({
responseA: GOOD_RESPONSE,
responseB: POOR_RESPONSE,
prompt: 'Explain quantum entanglement',
criteria: ['accuracy', 'clarity', 'completeness', 'engagement'],
allowTie: true,
swapPositions: true // Enable position bias mitigation
});
// Output:
// {
// success: true,
// winner: 'A',
// confidence: 0.85,
// positionConsistency: { consistent: true, firstPassWinner: 'A', secondPassWinner: 'A' },
// comparison: [
// { criterion: 'accuracy', winner: 'A', reasoning: '...' },
// { criterion: 'clarity', winner: 'A', reasoning: '...' },
// ...
// ]
// }Purpose: Create detailed scoring rubrics for consistent evaluation standards.
When to Use:
- Establishing evaluation criteria
- Training human evaluators
- Ensuring consistency across evaluations
- Documenting quality standards
Implementation Highlights:
// Strictness affects the generated rubric:
// - lenient: Lower bar for passing scores
// - balanced: Fair, typical expectations
// - strict: High standards, critical evaluation
const userPrompt = `Create a scoring rubric for:
**Criterion**: ${input.criterionName}
**Description**: ${input.criterionDescription}
**Scale**: ${input.scale}
**Domain**: ${input.domain}
Generate:
1. Clear descriptions for each score level
2. Specific characteristics that define each level
3. Brief example text for each level
4. General scoring guidelines
5. Edge cases with guidance`;Key Features:
- Domain-specific terminology
- Configurable strictness levels
- Example generation for each level
- Edge case guidance
- Scoring guidelines
Example Usage:
const result = await executeGenerateRubric({
criterionName: 'Code Readability',
criterionDescription: 'How easy the code is to understand and maintain',
scale: '1-5',
domain: 'software engineering',
includeExamples: true,
strictness: 'balanced'
});
// Output:
// {
// success: true,
// levels: [
// { score: 1, label: 'Poor', description: '...', characteristics: [...], example: '...' },
// { score: 2, label: 'Below Average', ... },
// { score: 3, label: 'Average', ... },
// { score: 4, label: 'Good', ... },
// { score: 5, label: 'Excellent', ... }
// ],
// scoringGuidelines: [...],
// edgeCases: [{ situation: '...', guidance: '...' }]
// }Purpose: High-level agent that combines all evaluation tools with conversational capability.
Implementation:
export class EvaluatorAgent {
private model: string;
private temperature: number;
constructor(config?: EvaluatorAgentConfig) {
this.model = config?.model || 'gpt-5.2';
this.temperature = config?.temperature || 0.3;
}
// Score a response
async score(input: DirectScoreInput) { ... }
// Compare two responses
async compare(input: PairwiseCompareInput) { ... }
// Generate a rubric
async generateRubric(input: GenerateRubricInput) { ... }
// Full workflow: generate rubric then score
async evaluateWithGeneratedRubric(response, prompt, criteria) { ... }
// Chat-based evaluation
async chat(userMessage: string) { ... }
}All 19 tests pass successfully. Here are the actual test logs from our test run:
> [email protected] test
> vitest run --testTimeout=120000
RUN v2.1.9 /Users/muratcankoylan/app_readwren
✓ tests/skills.test.ts (10 tests) 159317ms
✓ LLM Evaluator Skill Tests > Direct Scoring Skill > should use chain-of-thought in scoring 4439ms
✓ LLM Evaluator Skill Tests > Direct Scoring Skill > should handle multiple weighted criteria 7218ms
✓ LLM Evaluator Skill Tests > Pairwise Comparison Skill > should mitigate position bias with swap 13002ms
✓ LLM Evaluator Skill Tests > Pairwise Comparison Skill > should identify clear winner for quality difference 25914ms
✓ LLM Evaluator Skill Tests > Rubric Generation Skill > should generate domain-specific rubrics 37165ms
✓ LLM Evaluator Skill Tests > Rubric Generation Skill > should provide edge case guidance 29088ms
✓ LLM Evaluator Skill Tests > Context Fundamentals Skill Application > should utilize provided context in evaluation 11133ms
✓ Skill Input/Output Validation > should validate DirectScore input schema 4733ms
✓ Skill Input/Output Validation > should validate PairwiseCompare output structure 4123ms
✓ Skill Input/Output Validation > should validate GenerateRubric output structure 22500ms
✓ tests/evaluation.test.ts (9 tests) 216353ms
✓ Direct Score Tool > should score a response against criteria 13219ms
✓ Direct Score Tool > should provide lower scores for poor responses 14834ms
✓ Pairwise Compare Tool > should correctly identify the better response 29254ms
✓ Pairwise Compare Tool > should handle similar responses appropriately 14418ms
✓ Pairwise Compare Tool > should provide comparison details for each criterion 9931ms
✓ Generate Rubric Tool > should generate a complete rubric 24106ms
✓ Generate Rubric Tool > should respect strictness setting 57919ms
✓ Evaluator Agent > should provide integrated evaluation workflow 48112ms
✓ Evaluator Agent > should support chat-based evaluation 4558ms
Test Files 2 passed (2)
Tests 19 passed (19)
Start at 00:25:16
Duration 216.66s (transform 68ms, setup 32ms, collect 148ms, tests 375.67s, environment 0ms, prepare 105ms)
| Test Category | Tests | Pass Rate | Avg Duration |
|---|---|---|---|
| Direct Scoring | 4 | 100% | 9.9s |
| Pairwise Comparison | 4 | 100% | 17.9s |
| Rubric Generation | 4 | 100% | 33.2s |
| Context Integration | 1 | 100% | 11.1s |
| Agent Integration | 2 | 100% | 26.3s |
| Schema Validation | 4 | 100% | 8.8s |
During testing, we confirmed Eugene Yan's research findings:
Test: "should mitigate position bias with swap" - 13002ms
Result: Position consistency check correctly detected and mitigated bias
When comparing identical responses, the system correctly returns TIE. When comparing clearly different quality responses, the winner is consistent across position swaps.
Tests confirm that requiring justification produces more reliable evaluations:
Test: "should use chain-of-thought in scoring" - 4439ms
Result: All scores include justifications >20 characters with specific evidence
The rubric generator adapts to the specified domain:
Test: "should generate domain-specific rubrics" - 37165ms
Result: Software engineering rubric included terms like "variable", "function", "comment"
Test: "should handle multiple weighted criteria" - 7218ms
Result: weightedScore differs from overallScore when weights are unequal
The context fundamentals skill proves valuable:
Test: "should utilize provided context in evaluation" - 11133ms
Result: Medical context allowed technical terminology to score well
git clone https://github.com/muratcankoylan/llm-as-judge-skills.git
cd llm-as-judge-skills
npm installCreate a .env file:
OPENAI_API_KEY=your_openai_api_key_here
OPENAI_MODEL=gpt-5.2 npm testimport { EvaluatorAgent } from './src/agents/evaluator';
const agent = new EvaluatorAgent();
// Score a response
const scoreResult = await agent.score({
response: 'Your AI-generated response',
prompt: 'The original prompt',
criteria: [
{ name: 'Accuracy', description: 'Factual correctness', weight: 1 }
]
});
console.log(`Score: ${scoreResult.overallScore}/5`);
// Compare two responses
const compareResult = await agent.compare({
responseA: 'First response',
responseB: 'Second response',
prompt: 'The prompt',
criteria: ['quality', 'completeness'],
allowTie: true,
swapPositions: true
});
console.log(`Winner: ${compareResult.winner} (confidence: ${compareResult.confidence})`);This project is designed to be added to the examples section of the main repository:
Agent-Skills-for-Context-Engineering/
├── skills/
│ ├── context-fundamentals/ # Foundation (referenced by this project)
│ └── tool-design/ # Foundation (referenced by this project)
├── examples/
│ └── llm-as-judge-skills/ # ← This project
│ ├── README.md
│ ├── skills/
│ ├── tools/
│ ├── agents/
│ └── src/
- Skills → Prompts → Tools: Shows the progression from knowledge (MD files) to executable code
- Context Engineering: Applies context fundamentals in evaluation prompts
- Tool Design Patterns: Implements Zod schemas, error handling, and clear interfaces
- Agent Architecture: Uses AI SDK patterns for agent abstraction
interface DirectScoreInput {
response: string; // The response to evaluate
prompt: string; // Original prompt
context?: string; // Additional context
criteria: Array<{
name: string; // Criterion name
description: string; // What it measures
weight: number; // Relative importance (0-1)
}>;
rubric?: {
scale: '1-3' | '1-5' | '1-10';
levelDescriptions?: Record<string, string>;
};
}interface PairwiseCompareInput {
responseA: string; // First response
responseB: string; // Second response
prompt: string; // Original prompt
context?: string; // Additional context
criteria: string[]; // Comparison aspects
allowTie?: boolean; // Allow tie verdict (default: true)
swapPositions?: boolean; // Mitigate position bias (default: true)
}interface GenerateRubricInput {
criterionName: string; // Name of criterion
criterionDescription: string; // What it measures
scale?: '1-3' | '1-5' | '1-10';
domain?: string; // Domain for terminology
includeExamples?: boolean; // Generate examples
strictness?: 'lenient' | 'balanced' | 'strict';
}npm run build # Compile TypeScript
npm run dev # Watch mode
npm test # Run tests
npm run lint # ESLint
npm run format # Prettier
npm run typecheck # Type check- Create
src/tools/<category>/<tool-name>.ts - Define input/output Zod schemas
- Implement execute function
- Export from
src/tools/<category>/index.ts - Add documentation in
tools/<category>/<tool-name>.md - Write tests
MIT License - see LICENSE for details.
- Eugene Yan - LLM-as-a-Judge research
- Vercel AI SDK - Agent patterns and tooling
- Agent Skills for Context Engineering - Foundation framework