|
1 | 1 | # BalatroBench |
2 | 2 |
|
3 | | -A community-driven benchmark platform for evaluating Large Language Models' strategic performance in Balatro through intelligent tool-calling and decision-making. |
4 | | - |
5 | | -## 🎯 What is BalatroBench? |
6 | | - |
7 | | -BalatroBench provides a standardized way to evaluate how well different AI models can play Balatro, the popular poker-inspired roguelike card game. The benchmark tests strategic thinking, decision-making, and tool-calling capabilities across different LLM models. |
8 | | - |
9 | | -## 🚀 Quick Start |
10 | | - |
11 | | -This is a **static website** that works with any web server or GitHub Pages. No build process required! |
12 | | - |
13 | | -### Local Development |
14 | | - |
15 | | -1. Clone the repository: |
16 | | -```bash |
17 | | -git clone <YOUR_GIT_URL> |
18 | | -cd balatrobench |
19 | | -``` |
20 | | - |
21 | | -2. Serve the files locally: |
22 | | -```bash |
23 | | -# Using Python (recommended) |
24 | | -python -m http.server 8000 |
25 | | - |
26 | | -# Using Node.js (if you have it) |
27 | | -npx serve . |
28 | | - |
29 | | -# Using any other static file server |
30 | | -``` |
31 | | - |
32 | | -3. Open http://localhost:8000 in your browser |
33 | | - |
34 | | -### GitHub Pages Deployment |
35 | | - |
36 | | -1. Push your changes to the `main` branch |
37 | | -2. Go to repository Settings > Pages |
38 | | -3. Set source to "Deploy from a branch" |
39 | | -4. Select `main` branch and `/ (root)` folder |
40 | | -5. Your site will be available at `https://yourusername.github.io/balatrobench` |
41 | | - |
42 | | -## 📁 Project Structure |
43 | | - |
44 | | -``` |
45 | | -├── index.html # Main page (Official Benchmark) |
46 | | -├── community.html # Community submissions page |
47 | | -├── about.html # About page with detailed information |
48 | | -├── submit.html # Redirects to CONTRIBUTING.md |
49 | | -├── CONTRIBUTING.md # Detailed submission guidelines |
50 | | -├── js/ |
51 | | -│ └── app.js # JavaScript for data loading and UI |
52 | | -├── scripts/ |
53 | | -│ ├── process-benchmarks.js # Benchmark data processor |
54 | | -│ └── analyze-benchmarks.js # Benchmark analysis tool |
55 | | -├── data/ |
56 | | -│ ├── benchmarks/ # Raw benchmark data by version |
57 | | -│ │ └── v0.2.0/ |
58 | | -│ │ └── default/ |
59 | | -│ │ ├── leaderboard.json # Processed leaderboard |
60 | | -│ │ ├── cerebras-*.json # Individual benchmark runs |
61 | | -│ │ └── ... |
62 | | -│ ├── strategies/ # Strategy templates and tools |
63 | | -│ │ └── default/ |
64 | | -│ │ ├── TOOLS.json # Available game tools |
65 | | -│ │ ├── STRATEGY.md.jinja # Strategy template |
66 | | -│ │ ├── MEMORY.md.jinja # Memory template |
67 | | -│ │ └── GAMESTATE.md.jinja # Game state template |
68 | | -│ └── leaderboard.json # Main leaderboard (auto-generated) |
69 | | -├── public/ # Static assets |
70 | | -└── README.md |
71 | | -``` |
72 | | - |
73 | | -## 🏆 Official Benchmark |
74 | | - |
75 | | -The official leaderboard tracks performance across standardized seeds and configurations: |
76 | | - |
77 | | -- **Balatro Version**: v1.0.1n |
78 | | -- **Framework**: BalatroLLM v0.2.0+ |
79 | | -- **Strategy**: Template-based strategic prompting system |
80 | | -- **Seeds**: Consistent seeds for reproducibility |
81 | | -- **Metrics**: Average ante reached, win rate, token efficiency, completion rate |
82 | | -- **Models**: Cerebras GPT-OSS-120B, Cerebras Qwen3-235B, and more |
83 | | - |
84 | | -## 👥 Community Contributions |
85 | | - |
86 | | -### Submitting Your Results |
87 | | - |
88 | | -There are two ways to contribute: |
89 | | - |
90 | | -#### Option 1: Submit Raw Benchmark Data (Recommended) |
91 | | - |
92 | | -1. **Run benchmarks** using the BalatroLLM framework |
93 | | -2. **Add your results** to `data/benchmarks/v{version}/{strategy}/` |
94 | | - - Include the complete `{model}_benchmark.json` files |
95 | | - - These contain full game progression, LLM interactions, and tool calls |
96 | | -3. **Process the data** using `node scripts/process-benchmarks.js` |
97 | | -4. **Submit a Pull Request** with both raw data and updated leaderboard |
98 | | - |
99 | | -#### Option 2: Submit Strategy Documentation |
100 | | - |
101 | | -1. **Fork this repository** |
102 | | -2. **Create strategy templates** in `data/strategies/{your-strategy}/` |
103 | | - - Copy the structure from `data/strategies/default/` |
104 | | - - Customize the Jinja2 templates for your approach |
105 | | -3. **Document your methodology** with clear explanations |
106 | | -4. **Submit a Pull Request** with title: "Strategy Contribution: [Your Strategy Name]" |
107 | | - |
108 | | -### Submission Requirements |
109 | | - |
110 | | -- ✅ Valid benchmark results from BalatroLLM v0.2.0+ |
111 | | -- ✅ Complete game progression data (not just summary statistics) |
112 | | -- ✅ Clear strategy description and methodology |
113 | | -- ✅ Reproducible results with seed consistency |
114 | | -- ✅ No offensive or inappropriate content |
115 | | - |
116 | | -## 🛠️ Technologies Used |
117 | | - |
118 | | -- **HTML5** - Semantic markup |
119 | | -- **Tailwind CSS** - Styling (via CDN) |
120 | | -- **Vanilla JavaScript** - Dynamic content loading |
121 | | -- **Node.js** - Benchmark processing scripts |
122 | | -- **Jinja2 Templates** - Strategy prompt templating |
123 | | -- **Font Awesome** - Icons (via CDN) |
124 | | -- **JSON** - Data storage and interchange |
125 | | - |
126 | | -## 📊 Data Management |
127 | | - |
128 | | -BalatroBench uses a sophisticated data management system: |
129 | | - |
130 | | -### Raw Benchmark Data |
131 | | -- `data/benchmarks/v{version}/{strategy}/` - Versioned benchmark results |
132 | | -- Individual files: `{model}_benchmark.json` - Complete run data with game progression, LLM interactions, and tool calls |
133 | | -- Structured by version and strategy for historical tracking |
134 | | - |
135 | | -### Processed Data |
136 | | -- `data/leaderboard.json` - Auto-generated leaderboard from all benchmark data |
137 | | -- Aggregated statistics: performance scores, win rates, token efficiency |
138 | | -- Generated by `scripts/process-benchmarks.js` |
139 | | - |
140 | | -### Strategy System |
141 | | -- `data/strategies/default/` - Template-based strategy system |
142 | | -- Jinja2 templates for consistent prompting across models |
143 | | -- Tool definitions and game state templates |
144 | | - |
145 | | -### Processing Pipeline |
146 | | -```bash |
147 | | -# Process raw benchmark data into leaderboard |
148 | | -node scripts/process-benchmarks.js |
149 | | - |
150 | | -# Alternative analysis tool |
151 | | -node scripts/analyze-benchmarks.js |
152 | | -``` |
153 | | - |
154 | | -This approach provides: |
155 | | -- Version control of all data and results |
156 | | -- Automated leaderboard generation |
157 | | -- Historical benchmark tracking |
158 | | -- Reproducible evaluation methodology |
159 | | -- No database setup required |
160 | | - |
161 | | -## 🤝 Contributing |
162 | | - |
163 | | -We welcome contributions! You can: |
164 | | - |
165 | | -1. **Submit strategies** via pull requests |
166 | | -2. **Report issues** or suggest improvements |
167 | | -3. **Improve the website** (design, features, documentation) |
168 | | - |
169 | | -## 📈 Adding New Official Results |
170 | | - |
171 | | -To add new benchmark results: |
172 | | - |
173 | | -### Adding Raw Benchmark Data |
174 | | - |
175 | | -1. **Add benchmark files** to `data/benchmarks/v{version}/{strategy}/` |
176 | | - - Use format: `{model}_benchmark.json` |
177 | | - - Include complete run data from BalatroLLM framework |
178 | | - |
179 | | -2. **Process the data** to update leaderboards: |
180 | | - ```bash |
181 | | - node scripts/process-benchmarks.js |
182 | | - ``` |
183 | | - |
184 | | -3. **Submit a pull request** with both raw data and updated leaderboard |
185 | | - |
186 | | -### Data Processing Tools |
187 | | - |
188 | | -The project includes two processing scripts: |
189 | | - |
190 | | -- **`process-benchmarks.js`** - Primary tool for generating leaderboards from benchmark data |
191 | | -- **`analyze-benchmarks.js`** - Alternative analysis tool with different aggregation methods |
192 | | - |
193 | | -Both scripts automatically scan the `data/benchmarks/` directory and process all available benchmark files. |
194 | | - |
195 | | -## 🔧 Development & Customization |
196 | | - |
197 | | -### Local Development Setup |
198 | | - |
199 | | -```bash |
200 | | -# Clone the repository |
201 | | -git clone <YOUR_GIT_URL> |
202 | | -cd balatrobench |
203 | | - |
204 | | -# Install Node.js dependencies (for processing scripts) |
205 | | -# No package.json yet - scripts use built-in Node.js modules |
206 | | - |
207 | | -# Serve the website locally |
208 | | -python -m http.server 8000 |
209 | | -# or: npx serve . |
210 | | -``` |
211 | | - |
212 | | -### Customization Options |
213 | | - |
214 | | -- **Styling**: Modify Tailwind classes in HTML files |
215 | | -- **Functionality**: Edit `js/app.js` for frontend behavior |
216 | | -- **Data Processing**: Customize `scripts/process-benchmarks.js` |
217 | | -- **Strategy Templates**: Add new templates in `data/strategies/` |
218 | | -- **Pages**: Create new HTML files following the existing pattern |
219 | | - |
220 | | -### Processing Scripts |
221 | | - |
222 | | -- **Process benchmarks**: `node scripts/process-benchmarks.js` |
223 | | -- **Alternative analysis**: `node scripts/analyze-benchmarks.js` |
224 | | -- Both scripts output to `data/leaderboard.json` |
225 | | - |
226 | | -## 📜 License |
227 | | - |
228 | | -This project is open source. Feel free to use, modify, and distribute. |
229 | | - |
230 | | -## 🙋♀️ Support |
231 | | - |
232 | | -- Open an issue on GitHub |
233 | | -- Join our Discord community |
234 | | -- Email: community@balatrobench.dev |
| 3 | +A benchmark platform for evaluating Large Language Models' strategic performance in Balatro through intelligent tool-calling and decision-making. |
0 commit comments