@@ -15,7 +15,7 @@ This is a **static website** that works with any web server or GitHub Pages. No
15151 . Clone the repository:
1616``` bash
1717git clone < YOUR_GIT_URL>
18- cd balatrobench-site
18+ cd balatrobench
1919```
2020
21212 . Serve the files locally:
@@ -37,22 +37,36 @@ npx serve .
37372 . Go to repository Settings > Pages
38383 . Set source to "Deploy from a branch"
39394 . Select ` main ` branch and ` / (root) ` folder
40- 5 . Your site will be available at ` https://yourusername.github.io/balatrobench-site `
40+ 5 . Your site will be available at ` https://yourusername.github.io/balatrobench `
4141
4242## 📁 Project Structure
4343
4444```
4545├── index.html # Main page (Official Benchmark)
4646├── community.html # Community submissions page
47+ ├── about.html # About page with detailed information
4748├── submit.html # Redirects to CONTRIBUTING.md
4849├── CONTRIBUTING.md # Detailed submission guidelines
4950├── js/
50- │ └── app.js # JavaScript for data loading
51+ │ └── app.js # JavaScript for data loading and UI
52+ ├── scripts/
53+ │ ├── process-benchmarks.js # Benchmark data processor
54+ │ └── analyze-benchmarks.js # Benchmark analysis tool
5155├── data/
52- │ ├── leaderboard.json # Official benchmark results
53- │ └── strategies/ # Community submissions
54- │ ├── strategy1.json
55- │ └── strategy2.json
56+ │ ├── benchmarks/ # Raw benchmark data by version
57+ │ │ └── v0.2.0/
58+ │ │ └── default/
59+ │ │ ├── leaderboard.json # Processed leaderboard
60+ │ │ ├── cerebras-*.json # Individual benchmark runs
61+ │ │ └── ...
62+ │ ├── strategies/ # Strategy templates and tools
63+ │ │ └── default/
64+ │ │ ├── TOOLS.json # Available game tools
65+ │ │ ├── STRATEGY.md.jinja # Strategy template
66+ │ │ ├── MEMORY.md.jinja # Memory template
67+ │ │ └── GAMESTATE.md.jinja # Game state template
68+ │ └── leaderboard.json # Main leaderboard (auto-generated)
69+ ├── public/ # Static assets
5670└── README.md
5771```
5872
@@ -61,67 +75,88 @@ npx serve .
6175The official leaderboard tracks performance across standardized seeds and configurations:
6276
6377- ** Balatro Version** : v1.0.1n
64- - ** Seeds** : 100 consistent seeds for reproducibility
65- - ** Metrics** : Average ante reached, win rate, token efficiency
66- - ** Models** : GPT-4o, Claude-3.5-Sonnet, Gemini-Pro, and more
78+ - ** Framework** : BalatroLLM v0.2.0+
79+ - ** Strategy** : Template-based strategic prompting system
80+ - ** Seeds** : Consistent seeds for reproducibility
81+ - ** Metrics** : Average ante reached, win rate, token efficiency, completion rate
82+ - ** Models** : Cerebras GPT-OSS-120B, Cerebras Qwen3-235B, and more
6783
6884## 👥 Community Contributions
6985
70- ### Submitting Your Strategy
86+ ### Submitting Your Results
7187
72- 1 . ** Fork this repository**
73- 2 . ** Create a strategy file** in ` data/strategies/ ` following this format:
74-
75- ``` json
76- {
77- "title" : " Your Strategy Name" ,
78- "author" : " YourUsername" ,
79- "model" : " GPT-4o" ,
80- "score" : " 8.5" ,
81- "winRate" : " 75%" ,
82- "avgTokens" : " 15000" ,
83- "date" : " 2024-01-20" ,
84- "description" : " Brief description of your approach" ,
85- "prompt" : " Your full system prompt..." ,
86- "methodology" : " Detailed explanation..." ,
87- "results" : {
88- "seeds" : [1 , 2 , 3 ],
89- "scores" : [8.2 , 8.8 , 8.1 ]
90- },
91- "tags" : [" tag1" , " tag2" ]
92- }
93- ```
88+ There are two ways to contribute:
89+
90+ #### Option 1: Submit Raw Benchmark Data (Recommended)
91+
92+ 1 . ** Run benchmarks** using the BalatroLLM framework
93+ 2 . ** Add your results** to ` data/benchmarks/v{version}/{strategy}/ `
94+ - Include the complete ` {model}_benchmark.json ` files
95+ - These contain full game progression, LLM interactions, and tool calls
96+ 3 . ** Process the data** using ` node scripts/process-benchmarks.js `
97+ 4 . ** Submit a Pull Request** with both raw data and updated leaderboard
9498
95- 3 . ** Submit a Pull Request ** with title: "Community Submission: [ Your Strategy Name ] "
99+ #### Option 2: Submit Strategy Documentation
96100
97- ### Strategy Requirements
101+ 1 . ** Fork this repository**
102+ 2 . ** Create strategy templates** in ` data/strategies/{your-strategy}/ `
103+ - Copy the structure from ` data/strategies/default/ `
104+ - Customize the Jinja2 templates for your approach
105+ 3 . ** Document your methodology** with clear explanations
106+ 4 . ** Submit a Pull Request** with title: "Strategy Contribution: [ Your Strategy Name] "
107+
108+ ### Submission Requirements
98109
99- - ✅ Valid benchmark results on standard seeds
110+ - ✅ Valid benchmark results from BalatroLLM v0.2.0+
111+ - ✅ Complete game progression data (not just summary statistics)
100112- ✅ Clear strategy description and methodology
101- - ✅ Reproducible results
102- - ✅ Follows JSON schema format
113+ - ✅ Reproducible results with seed consistency
103114- ✅ No offensive or inappropriate content
104115
105116## 🛠️ Technologies Used
106117
107118- ** HTML5** - Semantic markup
108119- ** Tailwind CSS** - Styling (via CDN)
109120- ** Vanilla JavaScript** - Dynamic content loading
121+ - ** Node.js** - Benchmark processing scripts
122+ - ** Jinja2 Templates** - Strategy prompt templating
110123- ** Font Awesome** - Icons (via CDN)
111- - ** JSON** - Data storage
124+ - ** JSON** - Data storage and interchange
112125
113126## 📊 Data Management
114127
115- All data is stored in JSON files for simplicity:
128+ BalatroBench uses a sophisticated data management system:
129+
130+ ### Raw Benchmark Data
131+ - ` data/benchmarks/v{version}/{strategy}/ ` - Versioned benchmark results
132+ - Individual files: ` {model}_benchmark.json ` - Complete run data with game progression, LLM interactions, and tool calls
133+ - Structured by version and strategy for historical tracking
134+
135+ ### Processed Data
136+ - ` data/leaderboard.json ` - Auto-generated leaderboard from all benchmark data
137+ - Aggregated statistics: performance scores, win rates, token efficiency
138+ - Generated by ` scripts/process-benchmarks.js `
116139
117- - ` data/leaderboard.json ` - Official benchmark results
118- - ` data/strategies/*.json ` - Community submissions
140+ ### Strategy System
141+ - ` data/strategies/default/ ` - Template-based strategy system
142+ - Jinja2 templates for consistent prompting across models
143+ - Tool definitions and game state templates
119144
120- This approach allows for:
121- - Version control of all data
122- - Easy community contributions via PRs
145+ ### Processing Pipeline
146+ ``` bash
147+ # Process raw benchmark data into leaderboard
148+ node scripts/process-benchmarks.js
149+
150+ # Alternative analysis tool
151+ node scripts/analyze-benchmarks.js
152+ ```
153+
154+ This approach provides:
155+ - Version control of all data and results
156+ - Automated leaderboard generation
157+ - Historical benchmark tracking
158+ - Reproducible evaluation methodology
123159- No database setup required
124- - GitHub Pages compatibility
125160
126161## 🤝 Contributing
127162
@@ -133,21 +168,61 @@ We welcome contributions! You can:
133168
134169## 📈 Adding New Official Results
135170
136- To update the official leaderboard:
171+ To add new benchmark results:
172+
173+ ### Adding Raw Benchmark Data
174+
175+ 1 . ** Add benchmark files** to ` data/benchmarks/v{version}/{strategy}/ `
176+ - Use format: ` {model}_benchmark.json `
177+ - Include complete run data from BalatroLLM framework
178+
179+ 2 . ** Process the data** to update leaderboards:
180+ ``` bash
181+ node scripts/process-benchmarks.js
182+ ```
137183
138- 1 . Edit ` data/leaderboard.json `
139- 2 . Follow the existing schema
140- 3 . Submit a pull request
184+ 3 . ** Submit a pull request** with both raw data and updated leaderboard
141185
142- ## 🔧 Customization
186+ ### Data Processing Tools
143187
144- Want to customize the site?
188+ The project includes two processing scripts:
189+
190+ - ** ` process-benchmarks.js ` ** - Primary tool for generating leaderboards from benchmark data
191+ - ** ` analyze-benchmarks.js ` ** - Alternative analysis tool with different aggregation methods
192+
193+ Both scripts automatically scan the ` data/benchmarks/ ` directory and process all available benchmark files.
194+
195+ ## 🔧 Development & Customization
196+
197+ ### Local Development Setup
198+
199+ ``` bash
200+ # Clone the repository
201+ git clone < YOUR_GIT_URL>
202+ cd balatrobench
203+
204+ # Install Node.js dependencies (for processing scripts)
205+ # No package.json yet - scripts use built-in Node.js modules
206+
207+ # Serve the website locally
208+ python -m http.server 8000
209+ # or: npx serve .
210+ ```
211+
212+ ### Customization Options
145213
146214- ** Styling** : Modify Tailwind classes in HTML files
147- - ** Functionality** : Edit ` js/app.js `
148- - ** Data** : Add/modify JSON files in ` data/ `
215+ - ** Functionality** : Edit ` js/app.js ` for frontend behavior
216+ - ** Data Processing** : Customize ` scripts/process-benchmarks.js `
217+ - ** Strategy Templates** : Add new templates in ` data/strategies/ `
149218- ** Pages** : Create new HTML files following the existing pattern
150219
220+ ### Processing Scripts
221+
222+ - ** Process benchmarks** : ` node scripts/process-benchmarks.js `
223+ - ** Alternative analysis** : ` node scripts/analyze-benchmarks.js `
224+ - Both scripts output to ` data/leaderboard.json `
225+
151226## 📜 License
152227
153228This project is open source. Feel free to use, modify, and distribute.
0 commit comments