Skip to content

Commit a98e400

Browse files
Deployed bbc4359 with MkDocs version: 1.6.1
0 parents  commit a98e400

57 files changed

Lines changed: 10820 additions & 0 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.nojekyll

Whitespace-only changes.

404.html

Lines changed: 402 additions & 0 deletions
Large diffs are not rendered by default.

analysis/index.html

Lines changed: 665 additions & 0 deletions
Large diffs are not rendered by default.

analysis/index.md

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# Analysis
2+
3+
Analyze run data, generate performance benchmarks, and visualize results with BalatroBench.
4+
5+
## Run Data Collection
6+
7+
When you run BalatroLLM, all game data is automatically collected and organized in the `runs` directory. Each run is stored in a hierarchical structure that makes it easy to compare different models, strategies, and game sessions.
8+
9+
```text
10+
runs/
11+
└── v0.13.2/ # Version
12+
└── default/ # Strategy
13+
└── openai/ # Vendor
14+
└── gpt-oss-20b/ # Model
15+
└── 20251024_120206_331_RedDeck_s1__AAAAAAA/ # Run directory
16+
├── config.json # Model configuration and API settings
17+
├── strategy.json # Strategy template used
18+
├── stats.json # Aggregated performance metrics
19+
├── gamestates.jsonl # Game state at each decision point
20+
├── requests.jsonl # Prompts sent to the LLM
21+
├── responses.jsonl # Model responses and actions
22+
├── run.log # Complete text log
23+
└── screenshots/ # PNG images of game states
24+
```
25+
26+
Each run directory contains several files that capture different aspects of the game session. The configuration and strategy files record the setup used for the run. The stats file contains aggregated performance metrics like total rounds completed, token usage, and costs. The three JSONL files log every step of the game, recording game states, LLM prompts, and model responses. The run log provides a complete text record, and the screenshots directory contains PNG images of the game state at each step (when screenshot mode is enabled).
27+
28+
## Benchmark Analysis
29+
30+
The `balatrobench` CLI tool processes run data to generate comprehensive benchmark statistics and leaderboards. Benchmarks can be generated in two different modes depending on what you want to analyze.
31+
32+
### Models
33+
34+
Use the models mode when you want to compare how different models perform within the same strategy. This mode is useful for answering questions like "Which model plays the default strategy best?" or "How do different vendors' models compare on the aggressive strategy?"
35+
36+
```bash
37+
balatrobench --models
38+
```
39+
40+
The results are organized with leaderboards for each strategy, making it easy to identify the top-performing models:
41+
42+
```text
43+
benchmarks/models/
44+
├── manifest.json
45+
└── v0.13.2/ # Version
46+
└── default/ # Strategy
47+
├── leaderboard.json # Models ranked for this strategy
48+
└── openai/ # Vendor
49+
├── gpt-oss-20b/ # Model
50+
│ └── 20251024_120206_331_RedDeck_s1__AAAAAAA/ # Run
51+
│ └── request-00001/ # Individual request
52+
│ ├── request.md # Full LLM prompt
53+
│ ├── reasoning.md # Model reasoning
54+
│ ├── tool_call.json # Action taken
55+
│ └── screenshot.png # Game state
56+
└── gpt-oss-20b.json # Aggregated model statistics
57+
```
58+
59+
### Strategies
60+
61+
Use the strategies mode when you want to compare how different strategies perform for the same model. This mode helps answer questions like "Does the aggressive strategy work better than the default for GPT-4?" or "Which strategy should I use with Claude?"
62+
63+
```bash
64+
balatrobench --strategies
65+
```
66+
67+
The strategies mode generates leaderboards organized by model, with statistics for each strategy:
68+
69+
```text
70+
benchmarks/strategies/
71+
├── manifest.json
72+
└── v0.13.2/ # Version
73+
└── openai/ # Vendor
74+
└── gpt-oss-20b/ # Model
75+
├── leaderboard.json # Strategies ranked for this model
76+
├── default/ # Strategy
77+
│ ├── stats.json # Aggregated statistics
78+
│ └── gpt-oss-20b/ # Run details
79+
│ └── 20251024_120206_331_RedDeck_s1__AAAAAAA/ # Run
80+
│ └── request-00001/ # Individual request
81+
│ ├── request.md # Full LLM prompt
82+
│ ├── reasoning.md # Model reasoning
83+
│ ├── tool_call.json # Action taken
84+
│ └── screenshot.png # Game state
85+
└── aggressive/ # Other strategies
86+
└── [similar structure]
87+
```
88+
89+
Both modes preserve detailed request-level data including the full LLM prompts, reasoning output, tool calls, and screenshots for in-depth analysis.
90+
91+
## BalatroBench Integration
92+
93+
[BalatroBench](https://coder.github.io/balatrobench/) is a web-based dashboard for visualizing benchmark results. You can run it locally to explore your data through interactive charts and leaderboards.
94+
95+
First, clone the BalatroBench repository:
96+
97+
```bash
98+
git clone https://github.com/coder/balatrobench.git
99+
```
100+
101+
Next, copy or symlink your benchmark data into the BalatroBench data directory. You can move the benchmarks directly:
102+
103+
```bash
104+
mv benchmarks /path/to/balatrobench/data/benchmarks
105+
```
106+
107+
Or create a symbolic link to keep the data in your BalatroLLM directory:
108+
109+
```bash
110+
ln -s $(pwd)/benchmarks /path/to/balatrobench/data/benchmarks
111+
```
112+
113+
Finally, start a local web server to view the dashboard:
114+
115+
```bash
116+
cd /path/to/balatrobench
117+
python3 -m http.server 8001
118+
```
119+
120+
Open your browser to `http://localhost:8001` to explore the interactive visualization of your benchmark results.

assets/images/favicon.png

1.83 KB
Loading

assets/javascripts/bundle.f55a23d4.min.js

Lines changed: 16 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

assets/javascripts/bundle.f55a23d4.min.js.map

Lines changed: 7 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

assets/javascripts/lunr/min/lunr.ar.min.js

Lines changed: 1 addition & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

assets/javascripts/lunr/min/lunr.da.min.js

Lines changed: 18 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

assets/javascripts/lunr/min/lunr.de.min.js

Lines changed: 18 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)