Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
261 changes: 261 additions & 0 deletions docs/discord-setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,261 @@
# SPECTRE Agent System — Complete Setup Guide

This guide covers setting up the autonomous Python agent system with Discord bot integration on the Spectre (Franklin) cluster.

## Prerequisites

- Python 3.11+ on the cluster login/utility node
- `uv` package manager installed
- Access to SLURM commands (`sbatch`, `sacct`, `squeue`)
- BeeGFS mounted at `/mnt/beegfs/`
- An Anthropic API key with access to Claude Opus 4.6, Sonnet 4.6, and Haiku 4.5
- A Discord account with permission to create bots

---

## Step 1: Create a Discord Bot

1. Go to https://discord.com/developers/applications
2. Click **New Application** and name it `SPECTRE Bot`
3. Go to the **Bot** tab:
- Click **Add Bot** (if not already created)
- Copy the **Token** — save it securely, you'll need it later
- Enable **Message Content Intent** (required for reading messages)
- Enable **Server Members Intent**
4. Go to the **OAuth2 → URL Generator** tab:
- **Scopes**: select `bot` and `applications.commands`
- **Bot Permissions**: select:
- Send Messages
- Embed Links
- Attach Files
- Use Slash Commands
- Read Message History
- Create Public Threads
- Copy the generated URL
5. Open the URL in your browser and add the bot to your Discord server

## Step 2: Set Up Discord Server Channels

Create these channels in your Discord server:

| Channel | Purpose |
|---------|---------|
| `#simulation-status` | Automated status updates, milestones |
| `#decisions` | Interactive decision requests with buttons |
| `#alerts` | Failure alerts and critical warnings |
| `#plots` | Surface field PNGs, convergence plots |
| `#logs` | Verbose agent activity (optional) |
| `#ask-mitgcm` | Knowledge Q&A — ask about MITgcm, ERA5, oceanography, or the codebase |

**Get your Guild (Server) ID:**
- Enable Developer Mode in Discord (Settings → Advanced → Developer Mode)
- Right-click your server name → Copy Server ID

## Step 3: Configure Secrets

Create the secrets file on the cluster:

```bash
sudo mkdir -p /etc/spectre-agents
sudo tee /etc/spectre-agents/env << 'EOF'
ANTHROPIC_API_KEY=sk-ant-your-key-here
DISCORD_BOT_TOKEN=your-bot-token-here
DISCORD_GUILD_ID=your-guild-id-here
EOF
sudo chmod 600 /etc/spectre-agents/env
sudo chown joe:joe /etc/spectre-agents/env
```

## Step 4: Install the Agent System

```bash
cd /mnt/beegfs/spectre-150-ensembles

# Create virtual environment
uv venv .venv

# Install dependencies (includes spectre_agents package)
uv sync

# Verify the package loads
.venv/bin/python -c "from spectre_agents.config import load_config; print('OK')"
```

## Step 5: Test the Bot Locally

Before installing as a service, test interactively:

```bash
cd /mnt/beegfs/spectre-150-ensembles

# Source the secrets
source /etc/spectre-agents/env
export ANTHROPIC_API_KEY DISCORD_BOT_TOKEN DISCORD_GUILD_ID

# Run the agent system
.venv/bin/python -m spectre_agents --config spectre_agents_config.yaml
```

You should see:
```
SPECTRE Agent System starting...
Bot connected as SPECTRE Bot#1234 (ID: ...)
Synced commands to guild ...
```

In Discord, the bot should post "SPECTRE Agent System online" in `#simulation-status`.

Test slash commands:
- `/run status` — should show current (idle) status
- `/validate` — should run namelist validation
- `/dashboard status` — should check dashboard health

Press `Ctrl+C` to stop.

## Step 6: Install as a Systemd Service

```bash
# Copy the service file
sudo cp systemd/spectre-agents.service /etc/systemd/system/

# Reload systemd
sudo systemctl daemon-reload

# Enable and start the service
sudo systemctl enable spectre-agents
sudo systemctl start spectre-agents

# Check status
sudo systemctl status spectre-agents

# View logs
journalctl -u spectre-agents -f
```

## Step 7: Verify Everything Works

1. In Discord, run `/run status` — bot should respond with a status embed
2. Run `/validate` — should trigger namelist validation and return results
3. Run `/dashboard status` — should report dashboard component health
4. Run `/run start` — should validate, submit a SLURM job, and start monitoring

---

## Architecture Overview

```
┌─────────────────────────────────────────────┐
│ Spectre Cluster Node │
│ │
│ systemd: spectre-agents.service │
│ ┌───────────────────────────────────────┐ │
│ │ python -m spectre_agents │ │
│ │ │ │
│ │ Discord Bot (asyncio event loop) │ │
│ │ ├── Slash commands → Agent runner │ │
│ │ ├── Decision queue ← Orchestrator │ │
│ │ └── Status embeds → Discord │ │
│ │ │ │
│ │ Agent Runner (ThreadPoolExecutor) │ │
│ │ ├── Orchestrator (Opus) │ │
│ │ │ delegates to: │ │
│ │ ├── WorkflowRunner (Haiku) │ │
│ │ ├── StdoutDiagnostics (Sonnet) │ │
│ │ ├── ModelOutputReview (Sonnet) │ │
│ │ ├── NamelistValidator (Sonnet) │ │
│ │ ├── ForcingDataQC (Sonnet) │ │
│ │ ├── DashboardManager (Haiku) │ │
│ │ ├── DiscordNotifier (Haiku) │ │
│ │ └── WebResearch (Sonnet) │ │
│ └───────────────────────────────────────┘ │
│ │
│ SLURM ←→ sbatch/sacct/squeue │
│ BeeGFS ←→ /mnt/beegfs/spectre-* │
│ Tailscale ←→ Dashboard proxy │
└─────────────────────────────────────────────┘
```

## Discord Commands Reference

### Slash commands (simulation ops)

| Command | Description |
|---------|-------------|
| `/run start` | Validate config, submit simulation, start monitoring |
| `/run status` | Show job state, model days, CFL, throughput |
| `/run stop` | Cancel SLURM job, stop monitoring |
| `/run resubmit` | Clear run dir, resubmit from pickup |
| `/diagnose [job_id]` | Run STDOUT failure diagnostics |
| `/review` | Model output physical plausibility check |
| `/validate` | Pre-flight namelist validation |
| `/qc forcing` | EXF forcing data QC |
| `/qc obc` | OBC boundary data QC |
| `/dashboard start` | Start monitoring stack |
| `/dashboard status` | Health-check all components |
| `/dashboard restart [component]` | Restart dashboard/converter/plotter |
| `/ensemble start` | Begin bred vector generation |
| `/ensemble status` | Show ensemble convergence |
| `/config [param]` | Show simulation configuration |

### Knowledge Q&A (`#ask-mitgcm`)

Just type a question in the `#ask-mitgcm` channel — no slash command needed.
The bot answers using Claude with full context about:

- **MITgcm**: parameters, packages, Fortran source, debugging
- **ERA5 / GLORYS**: variable definitions, units, accumulation conventions
- **This simulation**: grid, forcing, namelists, workflows, known gotchas
- **Oceanography**: North Atlantic circulation, air-sea fluxes, ensemble methods
- **HPC / SLURM**: job scheduling, containers, parallel I/O

Long answers automatically create a thread to keep the channel clean.
The bot can also search the web and read files in the repo for up-to-date answers.

## Agent Autonomy Levels

The system operates with **high autonomy**:

**Autonomous actions (no Discord approval needed):**
- Resubmit after SLURM walltime exceeded
- Restart dead dashboard/plotter/converter processes
- Clear run directory before resubmit
- Rebuild container image if not found

**Requires Discord approval (posts interactive buttons):**
- Timestep changes (CFL approaching 0.45)
- Ambiguous failure with multiple fix options
- Physics parameter changes (viscosity, diffusion)
- First-time configuration submission
- Bred vector cycle completion review

## Troubleshooting

### Bot doesn't respond to commands
- Check `journalctl -u spectre-agents -f` for errors
- Verify `DISCORD_BOT_TOKEN` and `DISCORD_GUILD_ID` are correct
- Ensure the bot has the required permissions in your server
- Commands may take up to 1 hour to sync globally; guild sync is instant

### "Claude Agent SDK not found" error
- Ensure `claude-agent-sdk` is installed: `.venv/bin/pip list | grep claude`
- The Claude Code CLI must be installed on the system: `which claude`

### Agent times out
- Check `ANTHROPIC_API_KEY` is valid and has quota
- Increase `max_turns` in `spectre_agents_config.yaml` if agents need more steps
- Check network connectivity from the cluster node

### SLURM commands fail
- Verify the service runs as the correct user (joe)
- Check that SLURM is accessible from the node running the service
- Ensure the working directory exists: `/mnt/beegfs/spectre-150-ensembles`

## Cost Estimates

| Agent | Model | Approx. cost per invocation |
|-------|-------|---------------------------|
| Orchestrator | Opus 4.6 | $0.10 – $0.50 |
| Diagnostics/Review/Validator/QC | Sonnet 4.6 | $0.02 – $0.10 |
| WorkflowRunner/Dashboard/Notify | Haiku 4.5 | $0.005 – $0.02 |

A typical run-diagnose-fix-restart cycle costs approximately $0.50 – $1.00.
10 changes: 8 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -33,11 +33,17 @@ dependencies = [
"xarray==2025.6.1",
"xgcm==0.8.1",
"zarr==3.0.8",
"MetPy==1.7.1"
"MetPy==1.7.1",
"claude-agent-sdk",
"discord.py>=2.3.0",
"anyio>=4.0.0",
]
[project.urls]
Homepage = "https://github.com/ocean-spectre/spectra-150-ensembles"
Issues = "https://github.com/fluidnumerics/spectre_utils/issues"

[project.scripts]
spectre-agents = "spectre_agents.__main__:cli"

[tool.setuptools]
packages = ["spectre_utils"]
packages = ["spectre_utils", "spectre_agents", "spectre_agents.tools", "spectre_agents.agents", "spectre_agents.discord_bot"]
7 changes: 7 additions & 0 deletions spectre_agents/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"""SPECTRE Simulation Agent System.

Autonomous Python agents for MITgcm ocean simulation orchestration,
with Discord bot integration for bidirectional communication.
"""

__version__ = "0.1.0"
105 changes: 105 additions & 0 deletions spectre_agents/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
"""Entry point for the SPECTRE agent system.

Usage:
python -m spectre_agents [--config PATH]

Starts the Discord bot and agent runner as concurrent asyncio tasks.
"""

from __future__ import annotations

import argparse
import asyncio
import logging
import signal
import sys
from pathlib import Path

from spectre_agents.config import load_config
from spectre_agents.context import AgentContext
from spectre_agents.discord_bot.bot import run_bot
from spectre_agents.tools.discord_notify import set_agent_context

logger = logging.getLogger("spectre_agents")


def setup_logging() -> None:
"""Configure structured logging to stderr and optional file."""
fmt = logging.Formatter(
"%(asctime)s [%(levelname)s] %(name)s: %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
)
handler = logging.StreamHandler(sys.stderr)
handler.setFormatter(fmt)

root = logging.getLogger()
root.setLevel(logging.INFO)
root.addHandler(handler)

# Suppress noisy discord.py debug logs
logging.getLogger("discord").setLevel(logging.WARNING)
logging.getLogger("discord.http").setLevel(logging.WARNING)


async def main(config_path: str | None = None) -> None:
"""Main async entry point."""
setup_logging()

config = load_config(config_path)
logger.info("Loaded config: base_dir=%s, sim_dir=%s", config.base_dir, config.sim_dir)

# Validate required secrets
if not config.anthropic_api_key:
logger.error("ANTHROPIC_API_KEY not set. Set it in /etc/spectre-agents/env or environment.")
sys.exit(1)
if not config.discord_bot_token:
logger.error("DISCORD_BOT_TOKEN not set. Set it in /etc/spectre-agents/env or environment.")
sys.exit(1)

# Initialize shared context
ctx = AgentContext(base_dir=config.base_dir)
ctx.load_state()
logger.info("Loaded state: status=%s, job=%s", ctx.simulation.status, ctx.simulation.active_job_id)

# Wire up Discord tools with the context
set_agent_context(ctx)

# Handle shutdown signals
loop = asyncio.get_event_loop()
stop_event = asyncio.Event()

def signal_handler(sig):
logger.info("Received signal %s, shutting down...", sig)
stop_event.set()

for sig in (signal.SIGINT, signal.SIGTERM):
loop.add_signal_handler(sig, signal_handler, sig)

# Run the Discord bot — it manages the event loop
logger.info("Starting SPECTRE Agent System...")
try:
await run_bot(config, ctx)
except asyncio.CancelledError:
pass
finally:
ctx.save_state()
logger.info("SPECTRE Agent System stopped.")


def cli() -> None:
"""CLI entry point."""
parser = argparse.ArgumentParser(
description="SPECTRE Simulation Agent System with Discord bot"
)
parser.add_argument(
"--config",
type=str,
default=None,
help="Path to spectre_agents_config.yaml",
)
args = parser.parse_args()
asyncio.run(main(args.config))


if __name__ == "__main__":
cli()
Loading
Loading