Pytest for your LLM prompts. Catch regressions before production.
PromptCheck lets you write test suites for LLM prompts in YAML, run them against any model, and get deterministic pass/fail results with rich terminal output. Think of it as unit testing for AI — define expected behaviors, assert on outputs, and catch regressions before they ship.
- YAML test definitions — readable, version-controllable test files
- Multi-provider support — OpenAI, Anthropic, and Ollama out of the box
- Rich assertion library — contains, equals, regex, length, JSON schema, JSON path, semantic similarity, LLM-as-judge, latency, and cost assertions
- Jinja2 prompt templates — parameterize prompts with
{{variables}} - Detailed failure reports — see exactly what failed and why
- Multiple output formats — terminal (Rich), JSON, and HTML reports
- Cost and latency tracking — built-in performance assertions
- Tag-based filtering — run specific test subsets with
--tag
# From source
git clone https://github.com/aymenhmaidiwastaken/promptcheck.git
cd promptcheck
pip install -e .
# With semantic similarity support
pip install -e ".[semantic]"# prompts/sentiment.txt
Classify the sentiment of the following text as positive, negative, or neutral.
Respond with a single word.
Text: {{input}}
# tests/sentiment.test.yaml
name: Sentiment Analysis
prompt: prompts/sentiment.txt
model: openai:gpt-4o-mini
tests:
- name: Positive sentiment
input: "I love this product! Best purchase ever!"
assert:
- type: contains
value: "positive"
- name: Negative sentiment
input: "Terrible experience. Complete waste of money."
assert:
- type: contains
value: "negative"
- name: Response is concise
input: "Pretty good overall, would recommend."
assert:
- type: length
max: 20promptcheck run tests/| Type | Description | Config |
|---|---|---|
contains |
Output contains substring | value, case_insensitive |
not_contains |
Output does not contain substring | value |
equals |
Output exactly matches | value, strip |
regex |
Output matches pattern | value |
length |
Output length within bounds | min, max |
json_schema |
Output validates against schema | schema |
json_path |
JSON path returns expected value | path, value |
semantic |
Semantic similarity above threshold | value, threshold |
llm_judge |
LLM evaluates output quality | criteria, model |
latency |
Response time under limit | max_ms |
cost |
API cost under limit | max_cost |
Configure which LLM to test against using the model field:
model: openai:gpt-4o-mini # OpenAI
model: anthropic:claude-sonnet-4-20250514 # Anthropic
model: ollama:llama3 # Ollama (local)Set API keys as environment variables:
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...# Run all tests in a directory
promptcheck run tests/
# Run with a different model
promptcheck run tests/ --model openai:gpt-4o
# Filter by tags
promptcheck run tests/ --tag sentiment
# Output as JSON
promptcheck run tests/ --output json
# Generate HTML report
promptcheck run tests/ --output html
# Initialize a new test file
promptcheck init
# Show version
promptcheck versionpromptcheck/
cli.py # Typer CLI commands
config.py # Configuration loading
core/
loader.py # YAML test file parser
executor.py # Test case execution engine
runner.py # Test suite orchestrator
result.py # Result data structures
assertions/
registry.py # Assertion type registry
string.py # contains, equals, regex, length
json_assertions.py # json_schema, json_path
semantic.py # Semantic similarity
llm_judge.py # LLM-as-judge evaluation
performance.py # latency, cost
providers/
registry.py # Provider registry
openai.py # OpenAI provider
anthropic.py # Anthropic provider
ollama.py # Ollama provider
reporters/
terminal.py # Rich terminal output
json_reporter.py # JSON file output
html.py # HTML report generation
MIT
