|
| 1 | +# SPECTRE Agent System — Complete Setup Guide |
| 2 | + |
| 3 | +This guide covers setting up the autonomous Python agent system with Discord bot integration on the Spectre (Franklin) cluster. |
| 4 | + |
| 5 | +## Prerequisites |
| 6 | + |
| 7 | +- Python 3.11+ on the cluster login/utility node |
| 8 | +- `uv` package manager installed |
| 9 | +- Access to SLURM commands (`sbatch`, `sacct`, `squeue`) |
| 10 | +- BeeGFS mounted at `/mnt/beegfs/` |
| 11 | +- An Anthropic API key with access to Claude Opus 4.6, Sonnet 4.6, and Haiku 4.5 |
| 12 | +- A Discord account with permission to create bots |
| 13 | + |
| 14 | +--- |
| 15 | + |
| 16 | +## Step 1: Create a Discord Bot |
| 17 | + |
| 18 | +1. Go to https://discord.com/developers/applications |
| 19 | +2. Click **New Application** and name it `SPECTRE Bot` |
| 20 | +3. Go to the **Bot** tab: |
| 21 | + - Click **Add Bot** (if not already created) |
| 22 | + - Copy the **Token** — save it securely, you'll need it later |
| 23 | + - Enable **Message Content Intent** (required for reading messages) |
| 24 | + - Enable **Server Members Intent** |
| 25 | +4. Go to the **OAuth2 → URL Generator** tab: |
| 26 | + - **Scopes**: select `bot` and `applications.commands` |
| 27 | + - **Bot Permissions**: select: |
| 28 | + - Send Messages |
| 29 | + - Embed Links |
| 30 | + - Attach Files |
| 31 | + - Use Slash Commands |
| 32 | + - Read Message History |
| 33 | + - Create Public Threads |
| 34 | + - Copy the generated URL |
| 35 | +5. Open the URL in your browser and add the bot to your Discord server |
| 36 | + |
| 37 | +## Step 2: Set Up Discord Server Channels |
| 38 | + |
| 39 | +Create these channels in your Discord server: |
| 40 | + |
| 41 | +| Channel | Purpose | |
| 42 | +|---------|---------| |
| 43 | +| `#simulation-status` | Automated status updates, milestones | |
| 44 | +| `#decisions` | Interactive decision requests with buttons | |
| 45 | +| `#alerts` | Failure alerts and critical warnings | |
| 46 | +| `#plots` | Surface field PNGs, convergence plots | |
| 47 | +| `#logs` | Verbose agent activity (optional) | |
| 48 | + |
| 49 | +**Get your Guild (Server) ID:** |
| 50 | +- Enable Developer Mode in Discord (Settings → Advanced → Developer Mode) |
| 51 | +- Right-click your server name → Copy Server ID |
| 52 | + |
| 53 | +## Step 3: Configure Secrets |
| 54 | + |
| 55 | +Create the secrets file on the cluster: |
| 56 | + |
| 57 | +```bash |
| 58 | +sudo mkdir -p /etc/spectre-agents |
| 59 | +sudo tee /etc/spectre-agents/env << 'EOF' |
| 60 | +ANTHROPIC_API_KEY=sk-ant-your-key-here |
| 61 | +DISCORD_BOT_TOKEN=your-bot-token-here |
| 62 | +DISCORD_GUILD_ID=your-guild-id-here |
| 63 | +EOF |
| 64 | +sudo chmod 600 /etc/spectre-agents/env |
| 65 | +sudo chown joe:joe /etc/spectre-agents/env |
| 66 | +``` |
| 67 | + |
| 68 | +## Step 4: Install the Agent System |
| 69 | + |
| 70 | +```bash |
| 71 | +cd /mnt/beegfs/spectre-150-ensembles |
| 72 | + |
| 73 | +# Create virtual environment |
| 74 | +uv venv .venv |
| 75 | + |
| 76 | +# Install dependencies (includes spectre_agents package) |
| 77 | +uv sync |
| 78 | + |
| 79 | +# Verify the package loads |
| 80 | +.venv/bin/python -c "from spectre_agents.config import load_config; print('OK')" |
| 81 | +``` |
| 82 | + |
| 83 | +## Step 5: Test the Bot Locally |
| 84 | + |
| 85 | +Before installing as a service, test interactively: |
| 86 | + |
| 87 | +```bash |
| 88 | +cd /mnt/beegfs/spectre-150-ensembles |
| 89 | + |
| 90 | +# Source the secrets |
| 91 | +source /etc/spectre-agents/env |
| 92 | +export ANTHROPIC_API_KEY DISCORD_BOT_TOKEN DISCORD_GUILD_ID |
| 93 | + |
| 94 | +# Run the agent system |
| 95 | +.venv/bin/python -m spectre_agents --config spectre_agents_config.yaml |
| 96 | +``` |
| 97 | + |
| 98 | +You should see: |
| 99 | +``` |
| 100 | +SPECTRE Agent System starting... |
| 101 | +Bot connected as SPECTRE Bot#1234 (ID: ...) |
| 102 | +Synced commands to guild ... |
| 103 | +``` |
| 104 | + |
| 105 | +In Discord, the bot should post "SPECTRE Agent System online" in `#simulation-status`. |
| 106 | + |
| 107 | +Test slash commands: |
| 108 | +- `/run status` — should show current (idle) status |
| 109 | +- `/validate` — should run namelist validation |
| 110 | +- `/dashboard status` — should check dashboard health |
| 111 | + |
| 112 | +Press `Ctrl+C` to stop. |
| 113 | + |
| 114 | +## Step 6: Install as a Systemd Service |
| 115 | + |
| 116 | +```bash |
| 117 | +# Copy the service file |
| 118 | +sudo cp systemd/spectre-agents.service /etc/systemd/system/ |
| 119 | + |
| 120 | +# Reload systemd |
| 121 | +sudo systemctl daemon-reload |
| 122 | + |
| 123 | +# Enable and start the service |
| 124 | +sudo systemctl enable spectre-agents |
| 125 | +sudo systemctl start spectre-agents |
| 126 | + |
| 127 | +# Check status |
| 128 | +sudo systemctl status spectre-agents |
| 129 | + |
| 130 | +# View logs |
| 131 | +journalctl -u spectre-agents -f |
| 132 | +``` |
| 133 | + |
| 134 | +## Step 7: Verify Everything Works |
| 135 | + |
| 136 | +1. In Discord, run `/run status` — bot should respond with a status embed |
| 137 | +2. Run `/validate` — should trigger namelist validation and return results |
| 138 | +3. Run `/dashboard status` — should report dashboard component health |
| 139 | +4. Run `/run start` — should validate, submit a SLURM job, and start monitoring |
| 140 | + |
| 141 | +--- |
| 142 | + |
| 143 | +## Architecture Overview |
| 144 | + |
| 145 | +``` |
| 146 | +┌─────────────────────────────────────────────┐ |
| 147 | +│ Spectre Cluster Node │ |
| 148 | +│ │ |
| 149 | +│ systemd: spectre-agents.service │ |
| 150 | +│ ┌───────────────────────────────────────┐ │ |
| 151 | +│ │ python -m spectre_agents │ │ |
| 152 | +│ │ │ │ |
| 153 | +│ │ Discord Bot (asyncio event loop) │ │ |
| 154 | +│ │ ├── Slash commands → Agent runner │ │ |
| 155 | +│ │ ├── Decision queue ← Orchestrator │ │ |
| 156 | +│ │ └── Status embeds → Discord │ │ |
| 157 | +│ │ │ │ |
| 158 | +│ │ Agent Runner (ThreadPoolExecutor) │ │ |
| 159 | +│ │ ├── Orchestrator (Opus) │ │ |
| 160 | +│ │ │ delegates to: │ │ |
| 161 | +│ │ ├── WorkflowRunner (Haiku) │ │ |
| 162 | +│ │ ├── StdoutDiagnostics (Sonnet) │ │ |
| 163 | +│ │ ├── ModelOutputReview (Sonnet) │ │ |
| 164 | +│ │ ├── NamelistValidator (Sonnet) │ │ |
| 165 | +│ │ ├── ForcingDataQC (Sonnet) │ │ |
| 166 | +│ │ ├── DashboardManager (Haiku) │ │ |
| 167 | +│ │ ├── DiscordNotifier (Haiku) │ │ |
| 168 | +│ │ └── WebResearch (Sonnet) │ │ |
| 169 | +│ └───────────────────────────────────────┘ │ |
| 170 | +│ │ |
| 171 | +│ SLURM ←→ sbatch/sacct/squeue │ |
| 172 | +│ BeeGFS ←→ /mnt/beegfs/spectre-* │ |
| 173 | +│ Tailscale ←→ Dashboard proxy │ |
| 174 | +└─────────────────────────────────────────────┘ |
| 175 | +``` |
| 176 | + |
| 177 | +## Discord Commands Reference |
| 178 | + |
| 179 | +| Command | Description | |
| 180 | +|---------|-------------| |
| 181 | +| `/run start` | Validate config, submit simulation, start monitoring | |
| 182 | +| `/run status` | Show job state, model days, CFL, throughput | |
| 183 | +| `/run stop` | Cancel SLURM job, stop monitoring | |
| 184 | +| `/run resubmit` | Clear run dir, resubmit from pickup | |
| 185 | +| `/diagnose [job_id]` | Run STDOUT failure diagnostics | |
| 186 | +| `/review` | Model output physical plausibility check | |
| 187 | +| `/validate` | Pre-flight namelist validation | |
| 188 | +| `/qc forcing` | EXF forcing data QC | |
| 189 | +| `/qc obc` | OBC boundary data QC | |
| 190 | +| `/dashboard start` | Start monitoring stack | |
| 191 | +| `/dashboard status` | Health-check all components | |
| 192 | +| `/dashboard restart [component]` | Restart dashboard/converter/plotter | |
| 193 | +| `/ensemble start` | Begin bred vector generation | |
| 194 | +| `/ensemble status` | Show ensemble convergence | |
| 195 | +| `/config [param]` | Show simulation configuration | |
| 196 | + |
| 197 | +## Agent Autonomy Levels |
| 198 | + |
| 199 | +The system operates with **high autonomy**: |
| 200 | + |
| 201 | +**Autonomous actions (no Discord approval needed):** |
| 202 | +- Resubmit after SLURM walltime exceeded |
| 203 | +- Restart dead dashboard/plotter/converter processes |
| 204 | +- Clear run directory before resubmit |
| 205 | +- Rebuild container image if not found |
| 206 | + |
| 207 | +**Requires Discord approval (posts interactive buttons):** |
| 208 | +- Timestep changes (CFL approaching 0.45) |
| 209 | +- Ambiguous failure with multiple fix options |
| 210 | +- Physics parameter changes (viscosity, diffusion) |
| 211 | +- First-time configuration submission |
| 212 | +- Bred vector cycle completion review |
| 213 | + |
| 214 | +## Troubleshooting |
| 215 | + |
| 216 | +### Bot doesn't respond to commands |
| 217 | +- Check `journalctl -u spectre-agents -f` for errors |
| 218 | +- Verify `DISCORD_BOT_TOKEN` and `DISCORD_GUILD_ID` are correct |
| 219 | +- Ensure the bot has the required permissions in your server |
| 220 | +- Commands may take up to 1 hour to sync globally; guild sync is instant |
| 221 | + |
| 222 | +### "Claude Agent SDK not found" error |
| 223 | +- Ensure `claude-agent-sdk` is installed: `.venv/bin/pip list | grep claude` |
| 224 | +- The Claude Code CLI must be installed on the system: `which claude` |
| 225 | + |
| 226 | +### Agent times out |
| 227 | +- Check `ANTHROPIC_API_KEY` is valid and has quota |
| 228 | +- Increase `max_turns` in `spectre_agents_config.yaml` if agents need more steps |
| 229 | +- Check network connectivity from the cluster node |
| 230 | + |
| 231 | +### SLURM commands fail |
| 232 | +- Verify the service runs as the correct user (joe) |
| 233 | +- Check that SLURM is accessible from the node running the service |
| 234 | +- Ensure the working directory exists: `/mnt/beegfs/spectre-150-ensembles` |
| 235 | + |
| 236 | +## Cost Estimates |
| 237 | + |
| 238 | +| Agent | Model | Approx. cost per invocation | |
| 239 | +|-------|-------|---------------------------| |
| 240 | +| Orchestrator | Opus 4.6 | $0.10 – $0.50 | |
| 241 | +| Diagnostics/Review/Validator/QC | Sonnet 4.6 | $0.02 – $0.10 | |
| 242 | +| WorkflowRunner/Dashboard/Notify | Haiku 4.5 | $0.005 – $0.02 | |
| 243 | + |
| 244 | +A typical run-diagnose-fix-restart cycle costs approximately $0.50 – $1.00. |
0 commit comments