|
| 1 | +# Analysis |
| 2 | + |
| 3 | +Generate comprehensive benchmarks, analyze performance metrics, and integrate with BalatroBench for detailed statistics visualization. |
| 4 | + |
| 5 | +## Benchmark Generation |
| 6 | + |
| 7 | +### Basic Benchmarking |
| 8 | + |
| 9 | +Run benchmarks to evaluate model performance: |
| 10 | + |
| 11 | +```bash |
| 12 | +# Benchmark current configuration |
| 13 | +balatrollm benchmark |
| 14 | + |
| 15 | +# Benchmark specific model across strategies |
| 16 | +balatrollm --model openai/gpt-oss-120b benchmark |
| 17 | + |
| 18 | +# Benchmark with multiple runs for statistical significance |
| 19 | +balatrollm --runs 20 benchmark |
| 20 | +``` |
| 21 | + |
| 22 | +### Comprehensive Benchmarking |
| 23 | + |
| 24 | +Generate benchmarks across multiple dimensions: |
| 25 | + |
| 26 | +```bash |
| 27 | +# Benchmark all models with default strategy |
| 28 | +make balatrobench |
| 29 | + |
| 30 | +# Benchmark specific strategy across models |
| 31 | +balatrollm --strategy aggressive --runs 15 benchmark |
| 32 | + |
| 33 | +# Benchmark multiple strategies and models |
| 34 | +for strategy in default aggressive; do |
| 35 | + for model in openai/gpt-oss-20b openai/gpt-oss-120b qwen/qwen3-235b-a22b-2507; do |
| 36 | + balatrollm --strategy $strategy --model $model --runs 10 benchmark |
| 37 | + done |
| 38 | +done |
| 39 | +``` |
| 40 | + |
| 41 | +## Benchmark Results |
| 42 | + |
| 43 | +### Result Structure |
| 44 | + |
| 45 | +Benchmarks are organized hierarchically: |
| 46 | + |
| 47 | +``` |
| 48 | +benchmarks/ |
| 49 | +├── v0.10.0/ # Version |
| 50 | +│ ├── default/ # Strategy |
| 51 | +│ │ ├── openrouter/ # Provider |
| 52 | +│ │ │ ├── gpt-oss-20b.json |
| 53 | +│ │ │ └── gpt-oss-120b.json |
| 54 | +│ │ └── leaderboard.json # Strategy summary |
| 55 | +│ └── aggressive/ |
| 56 | +│ ├── openrouter/ |
| 57 | +│ └── leaderboard.json |
| 58 | +``` |
| 59 | + |
| 60 | +### Understanding Metrics |
| 61 | + |
| 62 | +Key performance indicators in benchmark results: |
| 63 | + |
| 64 | +- **Win Rate**: Percentage of games won |
| 65 | +- **Average Score**: Mean final score across runs |
| 66 | +- **Consistency**: Standard deviation of scores |
| 67 | +- **Efficiency**: Score per ante progression |
| 68 | +- **Strategy Adherence**: How well the bot follows strategy guidelines |
| 69 | + |
| 70 | +## BalatroBench Integration |
| 71 | + |
| 72 | +### Overview |
| 73 | + |
| 74 | +[BalatroBench](https://s1m0n38.github.io/balatrobench/) is a web-based dashboard for visualizing and comparing LLM performance in Balatro. It provides interactive charts, leaderboards, and detailed analytics. |
| 75 | + |
| 76 | +*[Screenshot placeholder: BalatroBench dashboard showing model comparison]* |
| 77 | + |
| 78 | +### Uploading Results |
| 79 | + |
| 80 | +Integrate your local benchmark results with BalatroBench: |
| 81 | + |
| 82 | +```bash |
| 83 | +# Generate benchmarks locally |
| 84 | +balatrollm --runs 20 benchmark |
| 85 | + |
| 86 | +# Upload to BalatroBench (coming soon) |
| 87 | +balatrollm benchmark --upload |
| 88 | + |
| 89 | +# Or manually copy results to BalatroBench format |
| 90 | +cp benchmarks/v0.10.0/default/leaderboard.json /path/to/balatrobench/data/ |
| 91 | +``` |
| 92 | + |
| 93 | +### Viewing Results |
| 94 | + |
| 95 | +Access comprehensive analytics through the web interface: |
| 96 | + |
| 97 | +1. **Model Comparison**: Side-by-side performance metrics |
| 98 | +2. **Strategy Analysis**: How different strategies perform across models |
| 99 | +3. **Trend Analysis**: Performance changes over time |
| 100 | +4. **Detailed Breakdowns**: Ante-by-ante progression analysis |
| 101 | + |
| 102 | +*[Screenshot placeholder: Model comparison view in BalatroBench]* |
| 103 | + |
| 104 | +## Local Analysis |
| 105 | + |
| 106 | +### Command-Line Analysis |
| 107 | + |
| 108 | +Analyze results directly from the command line: |
| 109 | + |
| 110 | +```bash |
| 111 | +# View latest benchmark summary |
| 112 | +cat benchmarks/v0.10.0/default/leaderboard.json | jq |
| 113 | + |
| 114 | +# Compare models |
| 115 | +jq '.models[] | {name: .model, win_rate: .metrics.win_rate, avg_score: .metrics.avg_score}' \ |
| 116 | + benchmarks/v0.10.0/default/leaderboard.json |
| 117 | + |
| 118 | +# Find top performer |
| 119 | +jq '.models | sort_by(.metrics.avg_score) | reverse | .[0]' \ |
| 120 | + benchmarks/v0.10.0/default/leaderboard.json |
| 121 | +``` |
| 122 | + |
| 123 | +### Custom Analysis Scripts |
| 124 | + |
| 125 | +Create custom analysis for specific insights: |
| 126 | + |
| 127 | +```bash |
| 128 | +# Calculate model efficiency (score per run) |
| 129 | +find benchmarks -name "*.json" -not -name "leaderboard.json" | \ |
| 130 | + xargs jq -r '[.model, (.total_score / .total_runs)] | @csv' |
| 131 | + |
| 132 | +# Compare strategies for same model |
| 133 | +diff <(jq '.models[] | select(.model=="gpt-oss-20b") | .metrics' \ |
| 134 | + benchmarks/v0.10.0/default/leaderboard.json) \ |
| 135 | + <(jq '.models[] | select(.model=="gpt-oss-20b") | .metrics' \ |
| 136 | + benchmarks/v0.10.0/aggressive/leaderboard.json) |
| 137 | +``` |
| 138 | + |
| 139 | +## Performance Tracking |
| 140 | + |
| 141 | +### Continuous Monitoring |
| 142 | + |
| 143 | +Set up automated benchmarking: |
| 144 | + |
| 145 | +```bash |
| 146 | +# Daily benchmark script |
| 147 | +#!/bin/bash |
| 148 | +DATE=$(date +%Y%m%d) |
| 149 | +balatrollm --runs 5 --runs-dir "daily_benchmarks/$DATE" benchmark |
| 150 | +``` |
| 151 | + |
| 152 | +### Regression Testing |
| 153 | + |
| 154 | +Monitor performance across versions: |
| 155 | + |
| 156 | +```bash |
| 157 | +# Compare current version to previous |
| 158 | +jq '.models[] | {model, current: .metrics.avg_score}' \ |
| 159 | + benchmarks/v0.10.0/default/leaderboard.json > current.json |
| 160 | + |
| 161 | +jq '.models[] | {model, previous: .metrics.avg_score}' \ |
| 162 | + benchmarks/v0.9.0/default/leaderboard.json > previous.json |
| 163 | + |
| 164 | +# Join and compare |
| 165 | +jq -s 'add | group_by(.model) | map(add)' current.json previous.json |
| 166 | +``` |
| 167 | + |
| 168 | +## Interpreting Results |
| 169 | + |
| 170 | +### Statistical Significance |
| 171 | + |
| 172 | +Ensure reliable results: |
| 173 | + |
| 174 | +```bash |
| 175 | +# Run sufficient samples for confidence |
| 176 | +balatrollm --runs 30 benchmark # Minimum recommended |
| 177 | + |
| 178 | +# Check variance in results |
| 179 | +jq '.detailed_runs[] | .final_score' benchmarks/latest/model.json | \ |
| 180 | + awk '{sum+=$1; sumsq+=$1*$1} END {print "Mean:", sum/NR, "StdDev:", sqrt((sumsq-sum*sum/NR)/NR)}' |
| 181 | +``` |
| 182 | + |
| 183 | +### Model Selection Criteria |
| 184 | + |
| 185 | +Choose models based on your priorities: |
| 186 | + |
| 187 | +- **Consistency**: Low standard deviation in scores |
| 188 | +- **Peak Performance**: Highest maximum scores achieved |
| 189 | +- **Win Rate**: Reliability in completing games successfully |
| 190 | +- **Speed**: Faster response times for real-time applications |
| 191 | + |
| 192 | +### Strategy Optimization |
| 193 | + |
| 194 | +Use results to refine strategies: |
| 195 | + |
| 196 | +```bash |
| 197 | +# Identify successful patterns |
| 198 | +jq '.detailed_runs[] | select(.final_score > 8000) | .strategy_decisions' \ |
| 199 | + benchmarks/v0.10.0/aggressive/openrouter/gpt-oss-120b.json |
| 200 | + |
| 201 | +# Find failure modes |
| 202 | +jq '.detailed_runs[] | select(.final_score < 2000) | .failure_reason' \ |
| 203 | + benchmarks/v0.10.0/default/openrouter/gpt-oss-20b.json |
| 204 | +``` |
0 commit comments