|
1 | 1 | # Analysis |
2 | 2 |
|
3 | | -Generate comprehensive benchmarks, analyze performance metrics, and integrate with BalatroBench for detailed statistics visualization. |
| 3 | +Analyze run data, generate performance benchmarks, and visualize results with BalatroBench. |
4 | 4 |
|
5 | | -## Benchmark Generation |
| 5 | +## Run Data Collection |
6 | 6 |
|
7 | | -Run benchmarks to evaluate model performance: |
8 | | - |
9 | | -```bash |
10 | | -balatrollm benchmark |
11 | | -``` |
12 | | - |
13 | | -## Benchmark Results |
14 | | - |
15 | | -Benchmarks are organized hierarchically: |
| 7 | +When you run BalatroLLM, all game data is automatically collected and organized in the `runs` directory. Each run is stored in a hierarchical structure that makes it easy to compare different models, strategies, and game sessions. |
16 | 8 |
|
17 | 9 | ``` |
18 | | -benchmarks/ |
19 | | -├── v0.10.0/ # Version |
20 | | -│ ├── default/ # Strategy |
21 | | -│ │ ├── leaderboard.json # Strategy summary with aggregated stats |
22 | | -│ │ └── openai/ # Provider |
23 | | -│ │ ├── gpt-oss-20b.json # Model performance summary |
24 | | -│ │ ├── gpt-oss-120b.json # Model performance summary |
25 | | -│ │ ├── gpt-oss-20b/ # Individual runs for model |
26 | | -│ │ │ ├── 20250922_124308_887_RedDeck_s1__OOOO155/ |
27 | | -│ │ │ │ ├── request-00001/ # Individual LLM request |
28 | | -│ │ │ │ │ ├── reasoning.md # LLM reasoning process |
29 | | -│ │ │ │ │ ├── request.md # Full request sent to LLM |
30 | | -│ │ │ │ │ ├── screenshot.png # Game state screenshot |
31 | | -│ │ │ │ │ └── tool_call.json # Function call details |
32 | | -│ │ │ │ ├── request-00002/ |
33 | | -│ │ │ │ └── ... |
34 | | -│ │ │ └── [other runs] |
35 | | -│ │ └── gpt-oss-120b/ # Individual runs for model |
36 | | -│ │ └── [similar structure] |
37 | | -│ └── aggressive/ # Other strategies |
38 | | -│ └── [similar structure] |
| 10 | +runs/ |
| 11 | +└── v0.13.2/ # Version |
| 12 | + └── default/ # Strategy |
| 13 | + └── openai/ # Vendor |
| 14 | + └── gpt-oss-20b/ # Model |
| 15 | + └── 20251024_120206_331_RedDeck_s1__AAAAAAA/ # Run directory |
| 16 | + ├── config.json # Model configuration and API settings |
| 17 | + ├── strategy.json # Strategy template used |
| 18 | + ├── stats.json # Aggregated performance metrics |
| 19 | + ├── gamestates.jsonl # Game state at each decision point |
| 20 | + ├── requests.jsonl # Prompts sent to the LLM |
| 21 | + ├── responses.jsonl # Model responses and actions |
| 22 | + ├── run.log # Complete text log |
| 23 | + └── screenshots/ # PNG images of game states |
39 | 24 | ``` |
40 | 25 |
|
41 | | -## BalatroBench Integration |
| 26 | +Each run directory contains several files that capture different aspects of the game session. The configuration and strategy files record the setup used for the run. The stats file contains aggregated performance metrics like total rounds completed, token usage, and costs. The three JSONL files log every step of the game, recording game states, LLM prompts, and model responses. The run log provides a complete text record, and the screenshots directory contains PNG images of the game state at each step (when screenshot mode is enabled). |
42 | 27 |
|
43 | | -### Overview |
| 28 | +## Benchmark Analysis |
44 | 29 |
|
45 | | -[BalatroBench](https://coder.github.io/balatrobench/) is a web-based dashboard for visualizing and comparing LLM performance in Balatro. It provides interactive charts, leaderboards, and detailed analytics. |
| 30 | +The `balatrobench` CLI tool processes run data to generate comprehensive benchmark statistics and leaderboards. Benchmarks can be generated in two different modes depending on what you want to analyze. |
46 | 31 |
|
47 | | -### Integrating with BalatroBench |
| 32 | +### Models |
48 | 33 |
|
49 | | -To use BalatroBench as a local dashboard for visualizing your benchmark results: |
| 34 | +Use the models mode when you want to compare how different models perform within the same strategy. This mode is useful for answering questions like "Which model plays the default strategy best?" or "How do different vendors' models compare on the aggressive strategy?" |
50 | 35 |
|
51 | 36 | ```bash |
52 | | -# Step 1: Generate runs with custom output directory |
53 | | -balatrollm --runs-dir example-runs --runs 20 |
54 | | - |
55 | | -# Step 2: Generate benchmark analysis |
56 | | -balatrollm benchmark --runs-dir example-runs --output-dir example-benchmark |
57 | | - |
58 | | -# Step 3: Clone BalatroBench repository |
59 | | -git clone https://github.com/coder/balatrobench.git /path/to/balatrobench |
60 | | - |
61 | | -# Step 4: Move benchmark data to BalatroBench (or create symbolic link) |
62 | | -mv example-benchmark/benchmarks /path/to/balatrobench/data/benchmarks |
63 | | -# OR create a symbolic link: |
64 | | -# ln -s $(pwd)/example-benchmark/benchmarks /path/to/balatrobench/data/benchmarks |
65 | | - |
66 | | -# Step 5: Host BalatroBench locally |
67 | | -cd /path/to/balatrobench |
68 | | -python3 -m http.server 8001 |
69 | | -# Then visit http://localhost:8001 |
| 37 | +balatrobench --models |
70 | 38 | ``` |
71 | 39 |
|
72 | | -## BalatroBench Dashboard Views |
73 | | - |
74 | | -BalatroBench provides three primary views for analyzing LLM performance in Balatro, each offering different levels of detail and insights into model behavior and strategic decision-making. |
75 | | - |
76 | | -### 1. Main Leaderboard and Performance Overview |
77 | | - |
78 | | -The main dashboard presents a comprehensive comparison of all evaluated models, combining visual and tabular representations of performance metrics. |
79 | | - |
80 | | - |
81 | | - |
82 | | - |
83 | | -**Visual Performance Comparison** |
84 | | - |
85 | | -The top section features an interactive bar chart displaying average performance across models, with each bar representing the mean number of rounds achieved. Error bars indicate the standard deviation, providing insight into performance consistency. Models are color-coded for easy identification, with higher-performing models typically shown in more prominent colors (purple for top performers, progressing through gray, green, and blue for lower performers). |
86 | | - |
87 | | -**Detailed Leaderboard Table** |
88 | | - |
89 | | -Below the chart, a comprehensive table provides granular performance metrics for each model: |
90 | | - |
91 | | -- **Model & Vendor**: Lists the specific model name and its provider (x-ai, openai, google, deepseek) |
92 | | - |
93 | | -- **Round Performance**: Shows average rounds achieved with standard deviation, indicating both performance level and consistency |
94 | | - |
95 | | -- **Success Rate Indicators**: Three color-coded percentage columns: |
96 | | - |
97 | | - - **Green (✓)**: Successful round completions |
98 | | - - **Yellow (⚠)**: Partial completions or warnings |
99 | | - - **Red (✗)**: Failed attempts or errors |
100 | | - |
101 | | -- **Financial Metrics**: |
102 | | - |
103 | | - - **In $/¥**: Input token costs with standard deviation |
104 | | - - **Out $/¥**: Output token costs with standard deviation |
105 | | - |
106 | | -- **Performance Timing**: |
107 | | - |
108 | | - - **Duration**: Average time per decision with variability measures |
109 | | - - **Total Cost**: Comprehensive cost analysis including standard deviations |
110 | | - |
111 | | -This view enables researchers to quickly identify top-performing models, understand cost-performance trade-offs, and assess model reliability through consistency metrics. |
112 | | - |
113 | | -### 2. Model Details and Analytics |
114 | | - |
115 | | -Clicking on any model in the leaderboard expands to reveal detailed analytics and run breakdowns for that specific model. |
| 40 | +The results are organized with leaderboards for each strategy, making it easy to identify the top-performing models: |
116 | 41 |
|
117 | | - |
118 | | - |
119 | | - |
120 | | -**Performance Distribution Analysis** |
121 | | - |
122 | | -The expanded view includes three key analytical components: |
123 | | - |
124 | | -**Rounds Distribution Histogram**: A bar chart showing the frequency distribution of rounds achieved across all runs for the selected model. This reveals whether the model's performance is consistent or highly variable, with patterns indicating: |
125 | | - |
126 | | -- Consistent performers show narrow, tall distributions |
127 | | -- Variable performers show wide, scattered distributions |
128 | | -- Multi-modal distributions may indicate different strategic approaches |
129 | | - |
130 | | -**Provider Breakdown**: A donut chart visualizing the proportion of runs from different API providers, useful for understanding data source diversity and potential provider-specific performance variations. |
131 | | - |
132 | | -**Aggregated Statistics Panel**: A summary box displaying key totals: |
133 | | - |
134 | | -- **Input/Output Token Counts**: Total tokens processed across all runs |
135 | | -- **Financial Totals**: Cumulative costs in dollars |
136 | | -- **Total Processing Time**: Aggregate time spent on decision-making |
137 | | - |
138 | | -**Individual Run Analysis Table** |
139 | | - |
140 | | -The bottom section provides a detailed breakdown of individual game runs, with each row representing a complete Balatro session: |
141 | | - |
142 | | -- **Round-by-Round Results**: Shows progression through different game rounds with success/warning/failure indicators |
143 | | -- **Performance Metrics**: Input/output costs and timing for each individual run |
144 | | -- **Success Patterns**: Color-coded indicators help identify at which stages models typically succeed or fail |
145 | | - |
146 | | -This view helps researchers understand model behavior patterns, identify optimal performance conditions, and analyze the relationship between different performance factors. |
147 | | - |
148 | | -### 3. Individual Run Analysis |
149 | | - |
150 | | -The most detailed view opens when clicking on a specific run, providing a step-by-step analysis of the LLM's decision-making process throughout a complete Balatro game session. |
| 42 | +``` |
| 43 | +benchmarks/models/ |
| 44 | +├── manifest.json |
| 45 | +└── v0.13.2/ # Version |
| 46 | + └── default/ # Strategy |
| 47 | + ├── leaderboard.json # Models ranked for this strategy |
| 48 | + └── openai/ # Vendor |
| 49 | + ├── gpt-oss-20b/ # Model |
| 50 | + │ └── 20251024_120206_331_RedDeck_s1__AAAAAAA/ # Run |
| 51 | + │ └── request-00001/ # Individual request |
| 52 | + │ ├── request.md # Full LLM prompt |
| 53 | + │ ├── reasoning.md # Model reasoning |
| 54 | + │ ├── tool_call.json # Action taken |
| 55 | + │ └── screenshot.png # Game state |
| 56 | + └── gpt-oss-20b.json # Aggregated model statistics |
| 57 | +``` |
151 | 58 |
|
152 | | - |
153 | | - |
| 59 | +### Strategies |
154 | 60 |
|
155 | | -**Game State Visualization** |
| 61 | +Use the strategies mode when you want to compare how different strategies perform for the same model. This mode helps answer questions like "Does the aggressive strategy work better than the default for GPT-4?" or "Which strategy should I use with Claude?" |
156 | 62 |
|
157 | | -The left panel displays an actual screenshot of the Balatro game state at the moment of decision, showing: |
| 63 | +```bash |
| 64 | +balatrobench --strategies |
| 65 | +``` |
158 | 66 |
|
159 | | -- **Current Game Phase**: Whether in shop, hand selection, or other game modes |
160 | | -- **Available Options**: Cards, jokers, and other game elements visible to the LLM |
161 | | -- **Resource Status**: Money, joker slots, and other strategic resources |
162 | | -- **Visual Context**: The exact visual information the LLM uses for decision-making |
| 67 | +The strategies mode generates leaderboards organized by model, with statistics for each strategy: |
163 | 68 |
|
164 | | -**Strategic Analysis Panel** |
| 69 | +``` |
| 70 | +benchmarks/strategies/ |
| 71 | +├── manifest.json |
| 72 | +└── v0.13.2/ # Version |
| 73 | + └── openai/ # Vendor |
| 74 | + └── gpt-oss-20b/ # Model |
| 75 | + ├── leaderboard.json # Strategies ranked for this model |
| 76 | + ├── default/ # Strategy |
| 77 | + │ ├── stats.json # Aggregated statistics |
| 78 | + │ └── gpt-oss-20b/ # Run details |
| 79 | + │ └── 20251024_120206_331_RedDeck_s1__AAAAAAA/ # Run |
| 80 | + │ └── request-00001/ # Individual request |
| 81 | + │ ├── request.md # Full LLM prompt |
| 82 | + │ ├── reasoning.md # Model reasoning |
| 83 | + │ ├── tool_call.json # Action taken |
| 84 | + │ └── screenshot.png # Game state |
| 85 | + └── aggressive/ # Other strategies |
| 86 | + └── [similar structure] |
| 87 | +``` |
165 | 88 |
|
166 | | -The right panel provides comprehensive insight into the LLM's reasoning process: |
| 89 | +Both modes preserve detailed request-level data including the full LLM prompts, reasoning output, tool calls, and screenshots for in-depth analysis. |
167 | 90 |
|
168 | | -**Contextual Situation Analysis**: A detailed text description explaining: |
| 91 | +## BalatroBench Integration |
169 | 92 |
|
170 | | -- Current game state and available options |
171 | | -- Strategic considerations and constraints |
172 | | -- Resource management situation |
173 | | -- Previous game history and context |
| 93 | +[BalatroBench](https://coder.github.io/balatrobench/) is a web-based dashboard for visualizing benchmark results. You can run it locally to explore your data through interactive charts and leaderboards. |
174 | 94 |
|
175 | | -**LLM Reasoning Process**: The model's internal reasoning, showing: |
| 95 | +First, clone the BalatroBench repository: |
176 | 96 |
|
177 | | -- Strategic analysis of available options |
178 | | -- Cost-benefit calculations |
179 | | -- Risk assessment considerations |
180 | | -- Long-term planning thoughts |
181 | | -- Decision rationale and justification |
| 97 | +```bash |
| 98 | +git clone https://github.com/coder/balatrobench.git |
| 99 | +``` |
182 | 100 |
|
183 | | -**Function Call Details**: Technical information about the executed action: |
| 101 | +Next, copy or symlink your benchmark data into the BalatroBench data directory. You can move the benchmarks directly: |
184 | 102 |
|
185 | | -- **Function Name**: The specific game action taken (e.g., "shop", "select_hand") |
186 | | -- **Parameters**: Exact arguments passed to the game engine |
187 | | -- **Action Description**: Human-readable explanation of what the model decided to do |
188 | | -- **Strategic Reasoning**: Why this particular action was chosen |
| 103 | +```bash |
| 104 | +mv benchmarks /path/to/balatrobench/data/benchmarks |
| 105 | +``` |
189 | 106 |
|
190 | | -**Navigation and Analysis Tools** |
| 107 | +Or create a symbolic link to keep the data in your BalatroLLM directory: |
191 | 108 |
|
192 | | -- **Step Navigation**: Arrow controls allow researchers to move through the chronological sequence of decisions within a single game run |
193 | | -- **Request Numbering**: Clear labeling of each decision point for reference and analysis |
194 | | -- **Modal Interface**: Clean overlay design that allows easy comparison with the main dashboard |
| 109 | +```bash |
| 110 | +ln -s $(pwd)/benchmarks /path/to/balatrobench/data/benchmarks |
| 111 | +``` |
195 | 112 |
|
196 | | -This detailed view enables researchers to: |
| 113 | +Finally, start a local web server to view the dashboard: |
197 | 114 |
|
198 | | -- Understand exactly how LLMs interpret visual game states |
199 | | -- Analyze the quality and depth of strategic reasoning |
200 | | -- Identify decision-making patterns and potential improvements |
201 | | -- Debug specific failure modes or suboptimal choices |
202 | | -- Study the relationship between reasoning quality and performance outcomes |
| 115 | +```bash |
| 116 | +cd /path/to/balatrobench |
| 117 | +python3 -m http.server 8001 |
| 118 | +``` |
203 | 119 |
|
204 | | -The combination of visual game state, strategic reasoning, and technical execution details provides a complete picture of LLM behavior in complex, multi-step decision-making scenarios. |
| 120 | +Open your browser to `http://localhost:8001` to explore the interactive visualization of your benchmark results. |
0 commit comments