|
| 1 | +# Analysis |
| 2 | + |
| 3 | +Generate comprehensive benchmarks, analyze performance metrics, and integrate with BalatroBench for detailed statistics visualization. |
| 4 | + |
| 5 | +## Benchmark Generation |
| 6 | + |
| 7 | +Run benchmarks to evaluate model performance: |
| 8 | + |
| 9 | +```bash |
| 10 | +balatrollm benchmark |
| 11 | +``` |
| 12 | + |
| 13 | +## Benchmark Results |
| 14 | + |
| 15 | +Benchmarks are organized hierarchically: |
| 16 | + |
| 17 | +```text |
| 18 | +benchmarks/ |
| 19 | +├── v0.10.0/ # Version |
| 20 | +│ ├── default/ # Strategy |
| 21 | +│ │ ├── leaderboard.json # Strategy summary with aggregated stats |
| 22 | +│ │ └── openai/ # Provider |
| 23 | +│ │ ├── gpt-oss-20b.json # Model performance summary |
| 24 | +│ │ ├── gpt-oss-120b.json # Model performance summary |
| 25 | +│ │ ├── gpt-oss-20b/ # Individual runs for model |
| 26 | +│ │ │ ├── 20250922_124308_887_RedDeck_s1__OOOO155/ |
| 27 | +│ │ │ │ ├── request-00001/ # Individual LLM request |
| 28 | +│ │ │ │ │ ├── reasoning.md # LLM reasoning process |
| 29 | +│ │ │ │ │ ├── request.md # Full request sent to LLM |
| 30 | +│ │ │ │ │ ├── screenshot.png # Game state screenshot |
| 31 | +│ │ │ │ │ └── tool_call.json # Function call details |
| 32 | +│ │ │ │ ├── request-00002/ |
| 33 | +│ │ │ │ └── ... |
| 34 | +│ │ │ └── [other runs] |
| 35 | +│ │ └── gpt-oss-120b/ # Individual runs for model |
| 36 | +│ │ └── [similar structure] |
| 37 | +│ └── aggressive/ # Other strategies |
| 38 | +│ └── [similar structure] |
| 39 | +``` |
| 40 | + |
| 41 | +## BalatroBench Integration |
| 42 | + |
| 43 | +### Overview |
| 44 | + |
| 45 | +[BalatroBench](https://s1m0n38.github.io/balatrobench/) is a web-based dashboard for visualizing and comparing LLM performance in Balatro. It provides interactive charts, leaderboards, and detailed analytics. |
| 46 | + |
| 47 | +### Integrating with BalatroBench |
| 48 | + |
| 49 | +To use BalatroBench as a local dashboard for visualizing your benchmark results: |
| 50 | + |
| 51 | +```bash |
| 52 | +# Step 1: Generate runs with custom output directory |
| 53 | +balatrollm --runs-dir example-runs --runs 20 |
| 54 | + |
| 55 | +# Step 2: Generate benchmark analysis |
| 56 | +balatrollm benchmark --runs-dir example-runs --output-dir example-benchmark |
| 57 | + |
| 58 | +# Step 3: Clone BalatroBench repository |
| 59 | +git clone https://github.com/S1M0N38/balatrobench.git /path/to/balatrobench |
| 60 | + |
| 61 | +# Step 4: Move benchmark data to BalatroBench (or create symbolic link) |
| 62 | +mv example-benchmark/benchmarks /path/to/balatrobench/data/benchmarks |
| 63 | +# OR create a symbolic link: |
| 64 | +# ln -s $(pwd)/example-benchmark/benchmarks /path/to/balatrobench/data/benchmarks |
| 65 | + |
| 66 | +# Step 5: Host BalatroBench locally |
| 67 | +cd /path/to/balatrobench |
| 68 | +python3 -m http.server 8001 |
| 69 | +# Then visit http://localhost:8001 |
| 70 | +``` |
| 71 | + |
| 72 | +## BalatroBench Dashboard Views |
| 73 | + |
| 74 | +BalatroBench provides three primary views for analyzing LLM performance in Balatro, each offering different levels of detail and insights into model behavior and strategic decision-making. |
| 75 | + |
| 76 | +### 1. Main Leaderboard and Performance Overview |
| 77 | + |
| 78 | +The main dashboard presents a comprehensive comparison of all evaluated models, combining visual and tabular representations of performance metrics. |
| 79 | + |
| 80 | +**Visual Performance Comparison** |
| 81 | + |
| 82 | +The top section features an interactive bar chart displaying average performance across models, with each bar representing the mean number of rounds achieved. Error bars indicate the standard deviation, providing insight into performance consistency. Models are color-coded for easy identification, with higher-performing models typically shown in more prominent colors (purple for top performers, progressing through gray, green, and blue for lower performers). |
| 83 | + |
| 84 | +**Detailed Leaderboard Table** |
| 85 | + |
| 86 | +Below the chart, a comprehensive table provides granular performance metrics for each model: |
| 87 | + |
| 88 | +- **Model & Vendor**: Lists the specific model name and its provider (x-ai, openai, google, deepseek) |
| 89 | + |
| 90 | +- **Round Performance**: Shows average rounds achieved with standard deviation, indicating both performance level and consistency |
| 91 | + |
| 92 | +- **Success Rate Indicators**: Three color-coded percentage columns: |
| 93 | + |
| 94 | + - **Green (✓)**: Successful round completions |
| 95 | + - **Yellow (⚠)**: Partial completions or warnings |
| 96 | + - **Red (✗)**: Failed attempts or errors |
| 97 | + |
| 98 | +- **Financial Metrics**: |
| 99 | + |
| 100 | + - **In $/¥**: Input token costs with standard deviation |
| 101 | + - **Out $/¥**: Output token costs with standard deviation |
| 102 | + |
| 103 | +- **Performance Timing**: |
| 104 | + |
| 105 | + - **Duration**: Average time per decision with variability measures |
| 106 | + - **Total Cost**: Comprehensive cost analysis including standard deviations |
| 107 | + |
| 108 | +This view enables researchers to quickly identify top-performing models, understand cost-performance trade-offs, and assess model reliability through consistency metrics. |
| 109 | + |
| 110 | +### 2. Model Details and Analytics |
| 111 | + |
| 112 | +Clicking on any model in the leaderboard expands to reveal detailed analytics and run breakdowns for that specific model. |
| 113 | + |
| 114 | +**Performance Distribution Analysis** |
| 115 | + |
| 116 | +The expanded view includes three key analytical components: |
| 117 | + |
| 118 | +**Rounds Distribution Histogram**: A bar chart showing the frequency distribution of rounds achieved across all runs for the selected model. This reveals whether the model's performance is consistent or highly variable, with patterns indicating: |
| 119 | + |
| 120 | +- Consistent performers show narrow, tall distributions |
| 121 | +- Variable performers show wide, scattered distributions |
| 122 | +- Multi-modal distributions may indicate different strategic approaches |
| 123 | + |
| 124 | +**Provider Breakdown**: A donut chart visualizing the proportion of runs from different API providers, useful for understanding data source diversity and potential provider-specific performance variations. |
| 125 | + |
| 126 | +**Aggregated Statistics Panel**: A summary box displaying key totals: |
| 127 | + |
| 128 | +- **Input/Output Token Counts**: Total tokens processed across all runs |
| 129 | +- **Financial Totals**: Cumulative costs in dollars |
| 130 | +- **Total Processing Time**: Aggregate time spent on decision-making |
| 131 | + |
| 132 | +**Individual Run Analysis Table** |
| 133 | + |
| 134 | +The bottom section provides a detailed breakdown of individual game runs, with each row representing a complete Balatro session: |
| 135 | + |
| 136 | +- **Round-by-Round Results**: Shows progression through different game rounds with success/warning/failure indicators |
| 137 | +- **Performance Metrics**: Input/output costs and timing for each individual run |
| 138 | +- **Success Patterns**: Color-coded indicators help identify at which stages models typically succeed or fail |
| 139 | + |
| 140 | +This view helps researchers understand model behavior patterns, identify optimal performance conditions, and analyze the relationship between different performance factors. |
| 141 | + |
| 142 | +### 3. Individual Run Analysis |
| 143 | + |
| 144 | +The most detailed view opens when clicking on a specific run, providing a step-by-step analysis of the LLM's decision-making process throughout a complete Balatro game session. |
| 145 | + |
| 146 | +**Game State Visualization** |
| 147 | + |
| 148 | +The left panel displays an actual screenshot of the Balatro game state at the moment of decision, showing: |
| 149 | + |
| 150 | +- **Current Game Phase**: Whether in shop, hand selection, or other game modes |
| 151 | +- **Available Options**: Cards, jokers, and other game elements visible to the LLM |
| 152 | +- **Resource Status**: Money, joker slots, and other strategic resources |
| 153 | +- **Visual Context**: The exact visual information the LLM uses for decision-making |
| 154 | + |
| 155 | +**Strategic Analysis Panel** |
| 156 | + |
| 157 | +The right panel provides comprehensive insight into the LLM's reasoning process: |
| 158 | + |
| 159 | +**Contextual Situation Analysis**: A detailed text description explaining: |
| 160 | + |
| 161 | +- Current game state and available options |
| 162 | +- Strategic considerations and constraints |
| 163 | +- Resource management situation |
| 164 | +- Previous game history and context |
| 165 | + |
| 166 | +**LLM Reasoning Process**: The model's internal reasoning, showing: |
| 167 | + |
| 168 | +- Strategic analysis of available options |
| 169 | +- Cost-benefit calculations |
| 170 | +- Risk assessment considerations |
| 171 | +- Long-term planning thoughts |
| 172 | +- Decision rationale and justification |
| 173 | + |
| 174 | +**Function Call Details**: Technical information about the executed action: |
| 175 | + |
| 176 | +- **Function Name**: The specific game action taken (e.g., "shop", "select_hand") |
| 177 | +- **Parameters**: Exact arguments passed to the game engine |
| 178 | +- **Action Description**: Human-readable explanation of what the model decided to do |
| 179 | +- **Strategic Reasoning**: Why this particular action was chosen |
| 180 | + |
| 181 | +**Navigation and Analysis Tools** |
| 182 | + |
| 183 | +- **Step Navigation**: Arrow controls allow researchers to move through the chronological sequence of decisions within a single game run |
| 184 | +- **Request Numbering**: Clear labeling of each decision point for reference and analysis |
| 185 | +- **Modal Interface**: Clean overlay design that allows easy comparison with the main dashboard |
| 186 | + |
| 187 | +This detailed view enables researchers to: |
| 188 | + |
| 189 | +- Understand exactly how LLMs interpret visual game states |
| 190 | +- Analyze the quality and depth of strategic reasoning |
| 191 | +- Identify decision-making patterns and potential improvements |
| 192 | +- Debug specific failure modes or suboptimal choices |
| 193 | +- Study the relationship between reasoning quality and performance outcomes |
| 194 | + |
| 195 | +The combination of visual game state, strategic reasoning, and technical execution details provides a complete picture of LLM behavior in complex, multi-step decision-making scenarios. |
0 commit comments