Skip to content

Commit 2c8efb7

Browse files
Deployed 1657865 with MkDocs version: 1.6.1
0 parents  commit 2c8efb7

63 files changed

Lines changed: 12340 additions & 0 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.nojekyll

Whitespace-only changes.

404.html

Lines changed: 402 additions & 0 deletions
Large diffs are not rendered by default.

analysis/index.html

Lines changed: 840 additions & 0 deletions
Large diffs are not rendered by default.

analysis/index.md

Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
# Analysis
2+
3+
Generate comprehensive benchmarks, analyze performance metrics, and integrate with BalatroBench for detailed statistics visualization.
4+
5+
## Benchmark Generation
6+
7+
Run benchmarks to evaluate model performance:
8+
9+
```bash
10+
balatrollm benchmark
11+
```
12+
13+
## Benchmark Results
14+
15+
Benchmarks are organized hierarchically:
16+
17+
```text
18+
benchmarks/
19+
├── v0.10.0/ # Version
20+
│ ├── default/ # Strategy
21+
│ │ ├── leaderboard.json # Strategy summary with aggregated stats
22+
│ │ └── openai/ # Provider
23+
│ │ ├── gpt-oss-20b.json # Model performance summary
24+
│ │ ├── gpt-oss-120b.json # Model performance summary
25+
│ │ ├── gpt-oss-20b/ # Individual runs for model
26+
│ │ │ ├── 20250922_124308_887_RedDeck_s1__OOOO155/
27+
│ │ │ │ ├── request-00001/ # Individual LLM request
28+
│ │ │ │ │ ├── reasoning.md # LLM reasoning process
29+
│ │ │ │ │ ├── request.md # Full request sent to LLM
30+
│ │ │ │ │ ├── screenshot.png # Game state screenshot
31+
│ │ │ │ │ └── tool_call.json # Function call details
32+
│ │ │ │ ├── request-00002/
33+
│ │ │ │ └── ...
34+
│ │ │ └── [other runs]
35+
│ │ └── gpt-oss-120b/ # Individual runs for model
36+
│ │ └── [similar structure]
37+
│ └── aggressive/ # Other strategies
38+
│ └── [similar structure]
39+
```
40+
41+
## BalatroBench Integration
42+
43+
### Overview
44+
45+
[BalatroBench](https://coder.github.io/balatrobench/) is a web-based dashboard for visualizing and comparing LLM performance in Balatro. It provides interactive charts, leaderboards, and detailed analytics.
46+
47+
### Integrating with BalatroBench
48+
49+
To use BalatroBench as a local dashboard for visualizing your benchmark results:
50+
51+
```bash
52+
# Step 1: Generate runs with custom output directory
53+
balatrollm --runs-dir example-runs --runs 20
54+
55+
# Step 2: Generate benchmark analysis
56+
balatrollm benchmark --runs-dir example-runs --output-dir example-benchmark
57+
58+
# Step 3: Clone BalatroBench repository
59+
git clone https://github.com/coder/balatrobench.git /path/to/balatrobench
60+
61+
# Step 4: Move benchmark data to BalatroBench (or create symbolic link)
62+
mv example-benchmark/benchmarks /path/to/balatrobench/data/benchmarks
63+
# OR create a symbolic link:
64+
# ln -s $(pwd)/example-benchmark/benchmarks /path/to/balatrobench/data/benchmarks
65+
66+
# Step 5: Host BalatroBench locally
67+
cd /path/to/balatrobench
68+
python3 -m http.server 8001
69+
# Then visit http://localhost:8001
70+
```
71+
72+
## BalatroBench Dashboard Views
73+
74+
BalatroBench provides three primary views for analyzing LLM performance in Balatro, each offering different levels of detail and insights into model behavior and strategic decision-making.
75+
76+
### 1. Main Leaderboard and Performance Overview
77+
78+
The main dashboard presents a comprehensive comparison of all evaluated models, combining visual and tabular representations of performance metrics.
79+
80+
**Visual Performance Comparison**
81+
82+
The top section features an interactive bar chart displaying average performance across models, with each bar representing the mean number of rounds achieved. Error bars indicate the standard deviation, providing insight into performance consistency. Models are color-coded for easy identification, with higher-performing models typically shown in more prominent colors (purple for top performers, progressing through gray, green, and blue for lower performers).
83+
84+
**Detailed Leaderboard Table**
85+
86+
Below the chart, a comprehensive table provides granular performance metrics for each model:
87+
88+
- **Model & Vendor**: Lists the specific model name and its provider (x-ai, openai, google, deepseek)
89+
90+
- **Round Performance**: Shows average rounds achieved with standard deviation, indicating both performance level and consistency
91+
92+
- **Success Rate Indicators**: Three color-coded percentage columns:
93+
94+
- **Green (✓)**: Successful round completions
95+
- **Yellow (⚠)**: Partial completions or warnings
96+
- **Red (✗)**: Failed attempts or errors
97+
98+
- **Financial Metrics**:
99+
100+
- **In $/¥**: Input token costs with standard deviation
101+
- **Out $/¥**: Output token costs with standard deviation
102+
103+
- **Performance Timing**:
104+
105+
- **Duration**: Average time per decision with variability measures
106+
- **Total Cost**: Comprehensive cost analysis including standard deviations
107+
108+
This view enables researchers to quickly identify top-performing models, understand cost-performance trade-offs, and assess model reliability through consistency metrics.
109+
110+
### 2. Model Details and Analytics
111+
112+
Clicking on any model in the leaderboard expands to reveal detailed analytics and run breakdowns for that specific model.
113+
114+
**Performance Distribution Analysis**
115+
116+
The expanded view includes three key analytical components:
117+
118+
**Rounds Distribution Histogram**: A bar chart showing the frequency distribution of rounds achieved across all runs for the selected model. This reveals whether the model's performance is consistent or highly variable, with patterns indicating:
119+
120+
- Consistent performers show narrow, tall distributions
121+
- Variable performers show wide, scattered distributions
122+
- Multi-modal distributions may indicate different strategic approaches
123+
124+
**Provider Breakdown**: A donut chart visualizing the proportion of runs from different API providers, useful for understanding data source diversity and potential provider-specific performance variations.
125+
126+
**Aggregated Statistics Panel**: A summary box displaying key totals:
127+
128+
- **Input/Output Token Counts**: Total tokens processed across all runs
129+
- **Financial Totals**: Cumulative costs in dollars
130+
- **Total Processing Time**: Aggregate time spent on decision-making
131+
132+
**Individual Run Analysis Table**
133+
134+
The bottom section provides a detailed breakdown of individual game runs, with each row representing a complete Balatro session:
135+
136+
- **Round-by-Round Results**: Shows progression through different game rounds with success/warning/failure indicators
137+
- **Performance Metrics**: Input/output costs and timing for each individual run
138+
- **Success Patterns**: Color-coded indicators help identify at which stages models typically succeed or fail
139+
140+
This view helps researchers understand model behavior patterns, identify optimal performance conditions, and analyze the relationship between different performance factors.
141+
142+
### 3. Individual Run Analysis
143+
144+
The most detailed view opens when clicking on a specific run, providing a step-by-step analysis of the LLM's decision-making process throughout a complete Balatro game session.
145+
146+
**Game State Visualization**
147+
148+
The left panel displays an actual screenshot of the Balatro game state at the moment of decision, showing:
149+
150+
- **Current Game Phase**: Whether in shop, hand selection, or other game modes
151+
- **Available Options**: Cards, jokers, and other game elements visible to the LLM
152+
- **Resource Status**: Money, joker slots, and other strategic resources
153+
- **Visual Context**: The exact visual information the LLM uses for decision-making
154+
155+
**Strategic Analysis Panel**
156+
157+
The right panel provides comprehensive insight into the LLM's reasoning process:
158+
159+
**Contextual Situation Analysis**: A detailed text description explaining:
160+
161+
- Current game state and available options
162+
- Strategic considerations and constraints
163+
- Resource management situation
164+
- Previous game history and context
165+
166+
**LLM Reasoning Process**: The model's internal reasoning, showing:
167+
168+
- Strategic analysis of available options
169+
- Cost-benefit calculations
170+
- Risk assessment considerations
171+
- Long-term planning thoughts
172+
- Decision rationale and justification
173+
174+
**Function Call Details**: Technical information about the executed action:
175+
176+
- **Function Name**: The specific game action taken (e.g., "shop", "select_hand")
177+
- **Parameters**: Exact arguments passed to the game engine
178+
- **Action Description**: Human-readable explanation of what the model decided to do
179+
- **Strategic Reasoning**: Why this particular action was chosen
180+
181+
**Navigation and Analysis Tools**
182+
183+
- **Step Navigation**: Arrow controls allow researchers to move through the chronological sequence of decisions within a single game run
184+
- **Request Numbering**: Clear labeling of each decision point for reference and analysis
185+
- **Modal Interface**: Clean overlay design that allows easy comparison with the main dashboard
186+
187+
This detailed view enables researchers to:
188+
189+
- Understand exactly how LLMs interpret visual game states
190+
- Analyze the quality and depth of strategic reasoning
191+
- Identify decision-making patterns and potential improvements
192+
- Debug specific failure modes or suboptimal choices
193+
- Study the relationship between reasoning quality and performance outcomes
194+
195+
The combination of visual game state, strategic reasoning, and technical execution details provides a complete picture of LLM behavior in complex, multi-step decision-making scenarios.

assets/balatrobench-dark-1.png

889 KB
Loading

assets/balatrobench-dark-2.png

954 KB
Loading

assets/balatrobench-dark-3.png

1.49 MB
Loading

assets/balatrobench-light-1.png

846 KB
Loading

assets/balatrobench-light-2.png

889 KB
Loading

assets/balatrobench-light-3.png

1.47 MB
Loading

0 commit comments

Comments
 (0)