This guide explains how to use waza for evaluating skills in the microsoft/skills repository.
Waza is a Go CLI tool for running evaluations on AI agent skills. It's designed to integrate seamlessly with the microsoft/skills CI pipeline, allowing skill authors to validate their work before contributing.
- Go 1.26+: Required only if building waza from source
- Git: For cloning repositories
- GitHub Actions (for CI): Standard ubuntu-latest runner
This is the recommended approach for CI/CD pipelines:
# Install latest version
curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
# Verify installation
waza --versionBenefits:
- No Docker required
- No Go toolchain required on the runner
- Downloads the latest release binary
- Works on all platforms
If you prefer containerized environments:
# Clone the waza repository
git clone https://github.com/microsoft/waza.git
cd waza
# Build the Docker image
docker build -t waza:local .
# Run waza in a container
docker run -v $(pwd):/workspace waza:local run eval.yamlBenefits:
- Isolated environment
- Reproducible builds
- No local Go installation needed
For development or local testing:
# Clone the repository
git clone https://github.com/microsoft/waza.git
cd waza
# Build the binary
make build
# Run the binary
./waza --versionYour skill repository should follow this structure:
your-skill/
├── SKILL.md # Skill definition with frontmatter
├── eval/ # Evaluation suite (optional but recommended)
│ ├── eval.yaml # Main benchmark specification
│ ├── tasks/ # Individual task definitions
│ │ ├── task-1.yaml
│ │ └── task-2.yaml
│ └── fixtures/ # Context files for tasks
│ ├── file1.txt
│ └── file2.py
└── .github/
└── workflows/
└── eval.yml # CI workflow for running evals
# Navigate to your skill directory
cd your-skill
# Run the interactive wizard
waza init eval
# Follow the prompts to configure your evaluationIf you have a skill, create the eval suite using waza new:
waza new skill my-skill --output-dir evalCreate eval/eval.yaml:
name: my-skill-eval
skill: my-skill
version: "1.0"
config:
trials_per_task: 1
timeout_seconds: 300
executor: mock # Use mock for CI (no API keys)
parallel: false
graders:
- type: text
name: output_check
config:
regex_match: ["expected pattern"]
tasks:
- "tasks/*.yaml"Create task files in eval/tasks/:
# eval/tasks/example-task.yaml
id: example-task
name: Example Task
description: Demonstrate the skill
stimulus:
message: "Explain what this code does"
context_files:
- "example.py"
graders:
- output_checkAdd context files to eval/fixtures/:
# eval/fixtures/example.py
def hello():
print("Hello, world!")# Basic run
waza run eval/eval.yaml --verbose
# Save results to JSON
waza run eval/eval.yaml --output results.json
# Run specific tasks only
waza run eval/eval.yaml --task "example-*"
# Run with parallel execution
waza run eval/eval.yaml --parallel --workers 4Copy the template workflow to your skill repository:
# From the waza repository
cp .github/workflows/skills-ci-example.yml \
/path/to/your-skill/.github/workflows/eval.ymlOr download directly:
curl -o .github/workflows/eval.yml \
https://raw.githubusercontent.com/microsoft/waza/main/.github/workflows/skills-ci-example.ymlEdit .github/workflows/eval.yml:
on:
pull_request:
branches: [ main ]
paths:
- 'SKILL.md'
- 'eval/**'
push:
branches: [ main ]
jobs:
evaluate-skill:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install Waza
run: curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
- name: Run Evaluation
run: waza run eval/eval.yaml --verbose --output results.json
- name: Upload Results
if: always()
uses: actions/upload-artifact@v4
with:
name: evaluation-results
path: results.jsonFor CI, use the mock executor (no API keys needed):
# eval/eval.yaml
config:
executor: mock # Simulates agent behavior for testingFor production testing with real AI models, use the copilot-sdk executor:
# eval/eval.yaml
config:
executor: copilot-sdk
model: claude-sonnet-4-20250514 # or gpt-4o, etc.For the default Copilot route, set the GITHUB_TOKEN environment variable in your workflow. If your Copilot SDK setup uses a custom provider, configure the provider environment variables instead.
- name: Run Evaluation with Copilot
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: waza run eval/eval.yaml --verboseWaza uses exit codes to indicate success or failure in CI:
| Exit Code | Meaning | CI Behavior |
|---|---|---|
| 0 | All tests passed | ✅ Workflow succeeds |
| 1 | One or more tests failed | ❌ Workflow fails |
| 2 | Configuration error (invalid YAML, missing files) | ❌ Workflow fails |
Example usage in CI:
# Run evaluation and fail the build if tests fail
waza run eval/eval.yaml || exit $?
# Capture exit code for custom handling
waza run eval/eval.yaml
EXIT_CODE=$?
if [ $EXIT_CODE -eq 1 ]; then
echo "Tests failed - review results"
elif [ $EXIT_CODE -eq 2 ]; then
echo "Configuration error - check eval.yaml"
fiSave results in JSON format for programmatic analysis:
waza run eval/eval.yaml --output results.jsonOutput structure:
{
"benchmark": {
"name": "my-skill-eval",
"skill": "my-skill",
"version": "1.0"
},
"config": {
"executor": "mock",
"model": "mock-model",
"trials_per_task": 1
},
"outcomes": [
{
"task_id": "example-task",
"status": "passed",
"score": 1.0,
"grader_results": [...]
}
],
"summary": {
"total_tasks": 1,
"passed": 1,
"failed": 0,
"pass_rate": 1.0
}
}Capture detailed execution logs:
waza run eval/eval.yaml --transcript-dir transcripts/Creates one JSON file per task execution in transcripts/.
Waza supports multiple grader types for validating agent output:
| Grader | Purpose | Example Use Case |
|---|---|---|
code |
Python/JavaScript assertions | Validate data structures |
regex |
Pattern matching | Check output format |
file |
File existence/content | Verify generated files |
behavior |
Agent behavior constraints | Limit tool calls, duration |
action_sequence |
Tool call sequence validation | Verify workflow steps |
See docs/GRADERS.md for complete documentation.
The mock executor runs instantly without API calls:
config:
executor: mockUse this for:
- Pull request validation
- Quick local testing
- Grader validation
Always run locally first:
waza run eval/eval.yaml --verboseThis catches configuration errors before pushing.
Include version in your eval.yaml:
version: "1.0"Update the version when making significant changes.
id: fix-authentication-bug # Good
id: task-1 # BadAdd clear descriptions:
description: |
The agent should identify the authentication bug in auth.py
and provide a fix that preserves backward compatibility.Use minimal context files to reduce token usage:
- Include only relevant code
- Remove comments and boilerplate
- Use snippets instead of full files
Ensure your eval file is at eval/eval.yaml or update the workflow:
- run: waza run path/to/your/eval.yamlCheck your YAML syntax:
# Validate YAML
waza run eval/eval.yaml --verboseCommon issues:
- Incorrect indentation
- Missing required fields
- Invalid task references
Review the results:
waza run eval/eval.yaml --output results.json
cat results.json | jq '.outcomes[] | select(.status == "failed")'Check:
- Grader expectations match actual output
- Task descriptions are clear
- Fixtures contain necessary context
Check that the runner can reach GitHub releases and that curl, bash, and tar are available. If you choose to build from source instead, install Go 1.26+ and Git LFS before running go build.
name: Evaluate Skill
on:
pull_request:
branches: [ main ]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
- run: waza run eval/eval.yaml --verbosejobs:
matrix-eval:
strategy:
matrix:
model: [gpt-4o, claude-sonnet-4-20250514]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
- run: |
sed -i "s/model: .*/model: ${{ matrix.model }}/" eval/eval.yaml
waza run eval/eval.yaml --output results-${{ matrix.model }}.jsonon:
schedule:
- cron: '0 0 * * *' # Daily at midnight
jobs:
nightly:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
- run: waza run eval/eval.yaml --verbose --output nightly-results.json
- uses: actions/upload-artifact@v4
with:
name: nightly-results
path: nightly-results.json- Main Documentation: README.md
- Grader Reference: docs/GRADERS.md
- Example Evaluations: examples/
- CI Examples: examples/ci/
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Repository: github.com/microsoft/waza