A hands-on workshop for product managers who want to understand how to evaluate AI features before shipping them.
You'll evaluate a customer support chatbot by testing two different personalities — polite and concise — against a dataset of real customer complaints. You'll start in the Braintrust UI (no code), then move to a Python eval script, and finally build a multi-turn chat app with production logging.
- Playground eval — Build an eval entirely in the Braintrust UI using the assets in
playground/ - Code eval — Run the same eval programmatically with
eval_customer_support.py - Nondeterminism + trials — See why single-run scores aren't reliable, and how
trial_countfixes that - Multi-turn chat — Run
chat_app.pyto generate real conversations and see production traces
- Sign up at braintrust.dev (free tier works)
- Go to Settings → Secrets and add your OpenAI API key
pip install -r requirements.txt
export BRAINTRUST_API_KEY="your-api-key"
export OPENAI_API_KEY="your-openai-api-key"Everything you need is in the playground/ directory.
- Create a new project in Braintrust: "Customer Support Chatbot"
- Go to Datasets → import
playground/customer_complaints.csv - Open the Playground, connect the dataset, set user message to
{{input}} - Paste the system prompt from
playground/prompt_a_polite.txt - Add a scorer using the prompt from
playground/scorer.txt- Name it "Brand Alignment"
- Set choice scores: A = 1.0, B = 0.5, C = 0.0
- Turn on chain-of-thought
- Run, review scores, save as experiment: "Polite Personality"
- Swap the system prompt to
playground/prompt_b_concise.txt - Run again, save as experiment: "Concise Personality"
- Compare experiments side by side
- Use Loop to analyze the results:
- "What is concise failing on? Why is polite scoring better?"
- "Are there any specific inputs that tend to confuse the scorer more?"
python3 eval_customer_support.pyThis runs the same polite vs. concise comparison as Part 1, but in code. Open the Braintrust UI to see the results.
The scores from Part 2 probably won't exactly match Part 1 — that's LLM nondeterminism at work.
Uncomment the trial_count=3 section at the bottom of eval_customer_support.py and run again. Trial averaging gives you more stable, trustworthy scores.
python3 chat_app.pyHave a conversation, then check the production logs in the Braintrust UI. Each conversation is logged as a single trace with nested turn spans.
playground/
customer_complaints.csv # Dataset (16 customer messages)
prompt_a_polite.txt # Polite persona system prompt
prompt_b_concise.txt # Concise persona system prompt
scorer.txt # Brand Alignment scorer (A/B/C)
eval_customer_support.py # Code-based eval with trial_count demo
chat_app.py # Multi-turn chat app with Braintrust logging
requirements.txt # Python dependencies
Reach out to Jess!