Skip to content

Commit 8c78a9b

Browse files
committed
docs: rewrite docs from scratch
1 parent 336816e commit 8c78a9b

4 files changed

Lines changed: 192 additions & 465 deletions

File tree

docs/analysis.md

Lines changed: 86 additions & 170 deletions
Original file line numberDiff line numberDiff line change
@@ -1,204 +1,120 @@
11
# Analysis
22

3-
Generate comprehensive benchmarks, analyze performance metrics, and integrate with BalatroBench for detailed statistics visualization.
3+
Analyze run data, generate performance benchmarks, and visualize results with BalatroBench.
44

5-
## Benchmark Generation
5+
## Run Data Collection
66

7-
Run benchmarks to evaluate model performance:
8-
9-
```bash
10-
balatrollm benchmark
11-
```
12-
13-
## Benchmark Results
14-
15-
Benchmarks are organized hierarchically:
7+
When you run BalatroLLM, all game data is automatically collected and organized in the `runs` directory. Each run is stored in a hierarchical structure that makes it easy to compare different models, strategies, and game sessions.
168

179
```
18-
benchmarks/
19-
├── v0.10.0/ # Version
20-
│ ├── default/ # Strategy
21-
│ │ ├── leaderboard.json # Strategy summary with aggregated stats
22-
│ │ └── openai/ # Provider
23-
│ │ ├── gpt-oss-20b.json # Model performance summary
24-
│ │ ├── gpt-oss-120b.json # Model performance summary
25-
│ │ ├── gpt-oss-20b/ # Individual runs for model
26-
│ │ │ ├── 20250922_124308_887_RedDeck_s1__OOOO155/
27-
│ │ │ │ ├── request-00001/ # Individual LLM request
28-
│ │ │ │ │ ├── reasoning.md # LLM reasoning process
29-
│ │ │ │ │ ├── request.md # Full request sent to LLM
30-
│ │ │ │ │ ├── screenshot.png # Game state screenshot
31-
│ │ │ │ │ └── tool_call.json # Function call details
32-
│ │ │ │ ├── request-00002/
33-
│ │ │ │ └── ...
34-
│ │ │ └── [other runs]
35-
│ │ └── gpt-oss-120b/ # Individual runs for model
36-
│ │ └── [similar structure]
37-
│ └── aggressive/ # Other strategies
38-
│ └── [similar structure]
10+
runs/
11+
└── v0.13.2/ # Version
12+
└── default/ # Strategy
13+
└── openai/ # Vendor
14+
└── gpt-oss-20b/ # Model
15+
└── 20251024_120206_331_RedDeck_s1__AAAAAAA/ # Run directory
16+
├── config.json # Model configuration and API settings
17+
├── strategy.json # Strategy template used
18+
├── stats.json # Aggregated performance metrics
19+
├── gamestates.jsonl # Game state at each decision point
20+
├── requests.jsonl # Prompts sent to the LLM
21+
├── responses.jsonl # Model responses and actions
22+
├── run.log # Complete text log
23+
└── screenshots/ # PNG images of game states
3924
```
4025

41-
## BalatroBench Integration
26+
Each run directory contains several files that capture different aspects of the game session. The configuration and strategy files record the setup used for the run. The stats file contains aggregated performance metrics like total rounds completed, token usage, and costs. The three JSONL files log every step of the game, recording game states, LLM prompts, and model responses. The run log provides a complete text record, and the screenshots directory contains PNG images of the game state at each step (when screenshot mode is enabled).
4227

43-
### Overview
28+
## Benchmark Analysis
4429

45-
[BalatroBench](https://coder.github.io/balatrobench/) is a web-based dashboard for visualizing and comparing LLM performance in Balatro. It provides interactive charts, leaderboards, and detailed analytics.
30+
The `balatrobench` CLI tool processes run data to generate comprehensive benchmark statistics and leaderboards. Benchmarks can be generated in two different modes depending on what you want to analyze.
4631

47-
### Integrating with BalatroBench
32+
### Models
4833

49-
To use BalatroBench as a local dashboard for visualizing your benchmark results:
34+
Use the models mode when you want to compare how different models perform within the same strategy. This mode is useful for answering questions like "Which model plays the default strategy best?" or "How do different vendors' models compare on the aggressive strategy?"
5035

5136
```bash
52-
# Step 1: Generate runs with custom output directory
53-
balatrollm --runs-dir example-runs --runs 20
54-
55-
# Step 2: Generate benchmark analysis
56-
balatrollm benchmark --runs-dir example-runs --output-dir example-benchmark
57-
58-
# Step 3: Clone BalatroBench repository
59-
git clone https://github.com/coder/balatrobench.git /path/to/balatrobench
60-
61-
# Step 4: Move benchmark data to BalatroBench (or create symbolic link)
62-
mv example-benchmark/benchmarks /path/to/balatrobench/data/benchmarks
63-
# OR create a symbolic link:
64-
# ln -s $(pwd)/example-benchmark/benchmarks /path/to/balatrobench/data/benchmarks
65-
66-
# Step 5: Host BalatroBench locally
67-
cd /path/to/balatrobench
68-
python3 -m http.server 8001
69-
# Then visit http://localhost:8001
37+
balatrobench --models
7038
```
7139

72-
## BalatroBench Dashboard Views
73-
74-
BalatroBench provides three primary views for analyzing LLM performance in Balatro, each offering different levels of detail and insights into model behavior and strategic decision-making.
75-
76-
### 1. Main Leaderboard and Performance Overview
77-
78-
The main dashboard presents a comprehensive comparison of all evaluated models, combining visual and tabular representations of performance metrics.
79-
80-
![Main Leaderboard - Light Theme](assets/balatrobench-light-1.png#only-light)
81-
![Main Leaderboard - Dark Theme](assets/balatrobench-dark-1.png#only-dark)
82-
83-
**Visual Performance Comparison**
84-
85-
The top section features an interactive bar chart displaying average performance across models, with each bar representing the mean number of rounds achieved. Error bars indicate the standard deviation, providing insight into performance consistency. Models are color-coded for easy identification, with higher-performing models typically shown in more prominent colors (purple for top performers, progressing through gray, green, and blue for lower performers).
86-
87-
**Detailed Leaderboard Table**
88-
89-
Below the chart, a comprehensive table provides granular performance metrics for each model:
90-
91-
- **Model & Vendor**: Lists the specific model name and its provider (x-ai, openai, google, deepseek)
92-
93-
- **Round Performance**: Shows average rounds achieved with standard deviation, indicating both performance level and consistency
94-
95-
- **Success Rate Indicators**: Three color-coded percentage columns:
96-
97-
- **Green (✓)**: Successful round completions
98-
- **Yellow (⚠)**: Partial completions or warnings
99-
- **Red (✗)**: Failed attempts or errors
100-
101-
- **Financial Metrics**:
102-
103-
- **In $/¥**: Input token costs with standard deviation
104-
- **Out $/¥**: Output token costs with standard deviation
105-
106-
- **Performance Timing**:
107-
108-
- **Duration**: Average time per decision with variability measures
109-
- **Total Cost**: Comprehensive cost analysis including standard deviations
110-
111-
This view enables researchers to quickly identify top-performing models, understand cost-performance trade-offs, and assess model reliability through consistency metrics.
112-
113-
### 2. Model Details and Analytics
114-
115-
Clicking on any model in the leaderboard expands to reveal detailed analytics and run breakdowns for that specific model.
40+
The results are organized with leaderboards for each strategy, making it easy to identify the top-performing models:
11641

117-
![Model Details - Light Theme](assets/balatrobench-light-2.png#only-light)
118-
![Model Details - Dark Theme](assets/balatrobench-dark-2.png#only-dark)
119-
120-
**Performance Distribution Analysis**
121-
122-
The expanded view includes three key analytical components:
123-
124-
**Rounds Distribution Histogram**: A bar chart showing the frequency distribution of rounds achieved across all runs for the selected model. This reveals whether the model's performance is consistent or highly variable, with patterns indicating:
125-
126-
- Consistent performers show narrow, tall distributions
127-
- Variable performers show wide, scattered distributions
128-
- Multi-modal distributions may indicate different strategic approaches
129-
130-
**Provider Breakdown**: A donut chart visualizing the proportion of runs from different API providers, useful for understanding data source diversity and potential provider-specific performance variations.
131-
132-
**Aggregated Statistics Panel**: A summary box displaying key totals:
133-
134-
- **Input/Output Token Counts**: Total tokens processed across all runs
135-
- **Financial Totals**: Cumulative costs in dollars
136-
- **Total Processing Time**: Aggregate time spent on decision-making
137-
138-
**Individual Run Analysis Table**
139-
140-
The bottom section provides a detailed breakdown of individual game runs, with each row representing a complete Balatro session:
141-
142-
- **Round-by-Round Results**: Shows progression through different game rounds with success/warning/failure indicators
143-
- **Performance Metrics**: Input/output costs and timing for each individual run
144-
- **Success Patterns**: Color-coded indicators help identify at which stages models typically succeed or fail
145-
146-
This view helps researchers understand model behavior patterns, identify optimal performance conditions, and analyze the relationship between different performance factors.
147-
148-
### 3. Individual Run Analysis
149-
150-
The most detailed view opens when clicking on a specific run, providing a step-by-step analysis of the LLM's decision-making process throughout a complete Balatro game session.
42+
```
43+
benchmarks/models/
44+
├── manifest.json
45+
└── v0.13.2/ # Version
46+
└── default/ # Strategy
47+
├── leaderboard.json # Models ranked for this strategy
48+
└── openai/ # Vendor
49+
├── gpt-oss-20b/ # Model
50+
│ └── 20251024_120206_331_RedDeck_s1__AAAAAAA/ # Run
51+
│ └── request-00001/ # Individual request
52+
│ ├── request.md # Full LLM prompt
53+
│ ├── reasoning.md # Model reasoning
54+
│ ├── tool_call.json # Action taken
55+
│ └── screenshot.png # Game state
56+
└── gpt-oss-20b.json # Aggregated model statistics
57+
```
15158

152-
![Run Details - Light Theme](assets/balatrobench-light-3.png#only-light)
153-
![Run Details - Dark Theme](assets/balatrobench-dark-3.png#only-dark)
59+
### Strategies
15460

155-
**Game State Visualization**
61+
Use the strategies mode when you want to compare how different strategies perform for the same model. This mode helps answer questions like "Does the aggressive strategy work better than the default for GPT-4?" or "Which strategy should I use with Claude?"
15662

157-
The left panel displays an actual screenshot of the Balatro game state at the moment of decision, showing:
63+
```bash
64+
balatrobench --strategies
65+
```
15866

159-
- **Current Game Phase**: Whether in shop, hand selection, or other game modes
160-
- **Available Options**: Cards, jokers, and other game elements visible to the LLM
161-
- **Resource Status**: Money, joker slots, and other strategic resources
162-
- **Visual Context**: The exact visual information the LLM uses for decision-making
67+
The strategies mode generates leaderboards organized by model, with statistics for each strategy:
16368

164-
**Strategic Analysis Panel**
69+
```
70+
benchmarks/strategies/
71+
├── manifest.json
72+
└── v0.13.2/ # Version
73+
└── openai/ # Vendor
74+
└── gpt-oss-20b/ # Model
75+
├── leaderboard.json # Strategies ranked for this model
76+
├── default/ # Strategy
77+
│ ├── stats.json # Aggregated statistics
78+
│ └── gpt-oss-20b/ # Run details
79+
│ └── 20251024_120206_331_RedDeck_s1__AAAAAAA/ # Run
80+
│ └── request-00001/ # Individual request
81+
│ ├── request.md # Full LLM prompt
82+
│ ├── reasoning.md # Model reasoning
83+
│ ├── tool_call.json # Action taken
84+
│ └── screenshot.png # Game state
85+
└── aggressive/ # Other strategies
86+
└── [similar structure]
87+
```
16588

166-
The right panel provides comprehensive insight into the LLM's reasoning process:
89+
Both modes preserve detailed request-level data including the full LLM prompts, reasoning output, tool calls, and screenshots for in-depth analysis.
16790

168-
**Contextual Situation Analysis**: A detailed text description explaining:
91+
## BalatroBench Integration
16992

170-
- Current game state and available options
171-
- Strategic considerations and constraints
172-
- Resource management situation
173-
- Previous game history and context
93+
[BalatroBench](https://coder.github.io/balatrobench/) is a web-based dashboard for visualizing benchmark results. You can run it locally to explore your data through interactive charts and leaderboards.
17494

175-
**LLM Reasoning Process**: The model's internal reasoning, showing:
95+
First, clone the BalatroBench repository:
17696

177-
- Strategic analysis of available options
178-
- Cost-benefit calculations
179-
- Risk assessment considerations
180-
- Long-term planning thoughts
181-
- Decision rationale and justification
97+
```bash
98+
git clone https://github.com/coder/balatrobench.git
99+
```
182100

183-
**Function Call Details**: Technical information about the executed action:
101+
Next, copy or symlink your benchmark data into the BalatroBench data directory. You can move the benchmarks directly:
184102

185-
- **Function Name**: The specific game action taken (e.g., "shop", "select_hand")
186-
- **Parameters**: Exact arguments passed to the game engine
187-
- **Action Description**: Human-readable explanation of what the model decided to do
188-
- **Strategic Reasoning**: Why this particular action was chosen
103+
```bash
104+
mv benchmarks /path/to/balatrobench/data/benchmarks
105+
```
189106

190-
**Navigation and Analysis Tools**
107+
Or create a symbolic link to keep the data in your BalatroLLM directory:
191108

192-
- **Step Navigation**: Arrow controls allow researchers to move through the chronological sequence of decisions within a single game run
193-
- **Request Numbering**: Clear labeling of each decision point for reference and analysis
194-
- **Modal Interface**: Clean overlay design that allows easy comparison with the main dashboard
109+
```bash
110+
ln -s $(pwd)/benchmarks /path/to/balatrobench/data/benchmarks
111+
```
195112

196-
This detailed view enables researchers to:
113+
Finally, start a local web server to view the dashboard:
197114

198-
- Understand exactly how LLMs interpret visual game states
199-
- Analyze the quality and depth of strategic reasoning
200-
- Identify decision-making patterns and potential improvements
201-
- Debug specific failure modes or suboptimal choices
202-
- Study the relationship between reasoning quality and performance outcomes
115+
```bash
116+
cd /path/to/balatrobench
117+
python3 -m http.server 8001
118+
```
203119

204-
The combination of visual game state, strategic reasoning, and technical execution details provides a complete picture of LLM behavior in complex, multi-step decision-making scenarios.
120+
Open your browser to `http://localhost:8001` to explore the interactive visualization of your benchmark results.

docs/index.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
# BalatroLLM
22

3-
**LLM-powered bot that plays Balatro using strategic decision making**
3+
<!-- <figure markdown="span"> -->
44

5-
---
5+
<!-- <figcaption>A Balatro bot powered by LLM</figcaption> -->
66

7-
!!! warning "Pre-1.0 Development Notice"
7+
<!-- </figure> -->
88

9-
This project is currently in pre-1.0 development phase. According to [Semantic Versioning](https://semver.org/) specification, minor version updates (0.x.y → 0.(x+1).0) may introduce breaking changes. Please review release notes carefully before upgrading.
9+
---
1010

11-
BalatroLLM is an intelligent bot that leverages Large Language Models to play Balatro, the popular roguelike poker deck-building game. The bot uses OpenAI-compatible APIs to communicate with various LLM providers and makes strategic decisions based on comprehensive game state analysis. Whether you're running benchmarks across different models or exploring AI gaming strategies, BalatroLLM provides a robust framework for automated Balatro gameplay.
11+
BalatroLLM is a bot that uses Large Language Models (LLMs) to play [Balatro](https://www.playbalatro.com/), the popular roguelike poker deck-building game. The bot analyzes game states, makes strategic decisions, and executes actions through the [BalatroBot](https://github.com/coder/balatrobot) client.
1212

1313
<div class="grid cards" markdown>
1414

@@ -36,7 +36,7 @@ BalatroLLM is an intelligent bot that leverages Large Language Models to play Ba
3636

3737
[:octicons-arrow-right-24: Analysis](analysis.md)
3838

39-
- :octicons-sparkle-fill-16:{ .lg .middle } __Documentation for LLM__
39+
- :octicons-sparkle-fill-16:{ .lg .middle } __Docs for LLM__
4040

4141
---
4242

0 commit comments

Comments
 (0)