test: add stormtest concurrency/reload stress harness

dsarno · claude · dsarno · commit 86906a4d7515 · 2026-05-31T20:19:47.000-07:00
script/stormtest.py opens many concurrent MCP clients and fires rapid,
randomized tool calls across every domain at a live editor, with periodic
editor_reload_plugin churn mixed in. It's a robustness harness (not a
correctness test): it answers whether the editor + plugin + WebSocket
dispatcher + server survive sustained concurrent abuse and reload cycles
without crashing, and surfaces per-tool latency/error hot-spots.

- 8 workers x 5 waves x 25 calls/wave by default (~1000 calls); brutal mode
  via SS_WORKERS/SS_WAVES env (~9000). Reads-only smoke via SS_RELOAD=0.
- Workers route to the active session and follow the session-id rotation a
  reload causes; writes are namespaced per-worker under a scratch scene so
  they never touch the project's real scene.
- Per-call timeout so a reload that severs the response can't wedge the run;
  full JSON snapshot flushed every few seconds (survives crash/kill).
- Docs in docs/STRESS_TESTING.md; pointer added to AGENTS.md Testing section.
- _stormtest/ scratch added to .gitignore.

Co-Authored-By: Claude Opus 4.8 &lt;noreply@anthropic.com&gt;
diff --git a/.gitignore b/.gitignore
@@ -62,3 +62,6 @@ test_project/tests/space_city_assets/
 
 # Test runtime artifacts (created by GDScript tests; ephemeral)
 test_project/tests/_mcp_test_*
+
+# stormtest scratch artifacts (script/stormtest.py)
+_stormtest/
diff --git a/AGENTS.md b/AGENTS.md
@@ -275,6 +275,19 @@ test_results_get             # review last results
 
 Test suites extend `McpTestSuite` (assertion methods: `assert_true`, `assert_eq`, `assert_has_key`, `assert_contains`, `assert_is_error`, etc.). Drop `test_*.gd` files in `res://tests/` and they're auto-discovered.
 
+### Stress / load testing — `script/stormtest.py`
+
+`stormtest` opens many concurrent MCP clients and fires rapid, randomized tool calls across **every** domain at a live editor, with periodic `editor_reload_plugin` churn mixed in. It's a robustness test, not a correctness test: it answers "does the editor + plugin + WebSocket dispatcher + server survive sustained concurrent abuse and reload cycles without crashing?" and surfaces per-tool latency/error hot-spots. Use it after changes to the dispatcher, transport, readiness gating, session routing, or the reload/handoff path. Full reference: `docs/STRESS_TESTING.md`.
+
+```bash
+.venv/bin/python script/stormtest.py                       # ≈ 1000 calls, with reload churn
+SS_WORKERS=12 SS_WAVES=30 .venv/bin/python script/stormtest.py   # brutal ≈ 9000 calls
+SS_RELOAD=0 .venv/bin/python script/stormtest.py           # reads-only smoke, no reloads
+SS_URL=http://127.0.0.1:8010/mcp .venv/bin/python script/stormtest.py  # target another stack
+```
+
+To stress a *branch's* code (plugin + server), point a Godot editor at that worktree's `test_project/` and serve its `src/` via `script/serve-this-worktree` (external server, so `editor_reload_plugin` exercises reload without killing the server), then run stormtest against it. A full JSON snapshot lands in `$TMPDIR/stormtest_report.json` (override with `SS_REPORT`), flushed every few seconds so a crash/kill still leaves data. A small `EDITOR_NOT_READY` / `NODE_NOT_FOUND` / `CONNECTION` error rate is expected noise under concurrency + reloads — watch instead for the process dying, a reload that never recovers, or one op with pathological error/latency.
+
 **Guardrails built into the test runner:**
 - **Zero-assertion detection**: Tests that complete with 0 assertions are flagged as failures ("Test completed with 0 assertions — likely skipped its logic"). This catches tests that silently `return` before asserting anything.
 - **Resilient discovery**: If a `.gd` file fails to load (parse error, duplicate method, wrong base class), the rest of the suites still run and the failing files are reported in `load_errors`.
diff --git a/docs/STRESS_TESTING.md b/docs/STRESS_TESTING.md
@@ -0,0 +1,114 @@
+# Stress testing — `script/stormtest.py`
+
+`stormtest` is a concurrency + reload stress harness. It opens many MCP client
+connections at once and fires rapid, randomized tool calls across **every**
+domain at a live Godot editor, periodically triggering `editor_reload_plugin`
+mid-run. It is not a correctness test — it answers two questions:
+
+1. **Does the stack survive sustained concurrent abuse + reload churn without
+   crashing?** (editor process, GDScript plugin, WebSocket dispatcher, server)
+2. **Where are the latency / error hot-spots per tool?**
+
+It complements the deterministic suites (`pytest`, `test_run`): those check
+that each tool is *correct*; stormtest checks that the whole stack is *robust*
+under load and across the disable→extract→enable reload window.
+
+## What it does
+
+- **N parallel workers**, each its own `fastmcp.Client` connection (default 8).
+- Workers route to the **active session** (empty `session_id`), so when a
+  reload rotates the session id they automatically follow the new one.
+- **Reads dominate** the op mix (like real traffic); **writes exercise every
+  domain** — node/scene/script/batch/material/theme/resource/camera/particle/
+  audio/animation/input_map/signal/filesystem.
+- Each worker namespaces its writes under `<scene_root>/wN/...` so workers
+  hammer one shared edited scene without colliding on node paths.
+- **Worker 0 is the "chaos" worker**: every `SS_RELOAD_EVERY` waves it fires
+  `editor_reload_plugin` instead of a normal burst, then reconnects (and
+  reopens the scratch scene). The other workers keep hammering through the
+  reload window and reconnect on the connection drop.
+- All disk artifacts (scratch scripts/resources/scene) land under
+  `res://_stormtest/` in whatever project the target editor has open — scratch
+  material that's safe to delete afterward.
+
+## Safety
+
+- Operates in a throwaway scratch scene (`res://_stormtest/storm.tscn`), **not**
+  the project's real scene; restores the originally-open scene on teardown.
+- Never calls `project_run`, so it can't autosave-pollute the real scene.
+- A full JSON snapshot is flushed to `stormtest_report.json` (in `$TMPDIR`,
+  overridable via `SS_REPORT`) **every few seconds**, so a crash or a kill mid-
+  run still leaves analyzable data (this is deliberate — an earlier version
+  lost its metrics to a `SIGKILL`).
+- It does **not** clear logs (a diagnostic must not destroy its own evidence).
+
+## Running
+
+The target editor's MCP server must be reachable (default `:8000`). For a true
+test of a branch's code, point the editor at that branch's worktree and serve
+that worktree's `src/` (see `script/serve-this-worktree`), so both the GDScript
+plugin and the Python server are the code under test.
+
+```bash
+# default ≈ 1000 calls, with reload churn, against localhost:8000
+.venv/bin/python script/stormtest.py
+
+# brutal ≈ 9000 calls
+SS_WORKERS=12 SS_WAVES=30 .venv/bin/python script/stormtest.py
+
+# reads-only smoke, no reloads
+SS_RELOAD=0 SS_WORKERS=4 SS_WAVES=3 .venv/bin/python script/stormtest.py
+
+# target a server on another port / host
+SS_URL=http://127.0.0.1:8010/mcp .venv/bin/python script/stormtest.py
+```
+
+### Knobs (env)
+
+| Var | Default | Meaning |
+|---|---|---|
+| `SS_WORKERS` | 8 | parallel client connections |
+| `SS_WAVES` | 5 | waves per worker |
+| `SS_CALLS` | 25 | calls per worker per wave |
+| `SS_RELOAD` | 1 | include `editor_reload_plugin` churn (`0` to skip) |
+| `SS_RELOAD_EVERY` | 2 | chaos worker reloads every N waves |
+| `SS_RECONNECT_TIMEOUT` | 30 | seconds to wait for the server to return after a reload |
+| `SS_URL` | `http://127.0.0.1:8000/mcp` | target MCP endpoint |
+| `SS_REPORT` | `$TMPDIR/stormtest_report.json` | where to write the JSON snapshot |
+
+Total calls ≈ `WORKERS × WAVES × CALLS` minus the chaos worker's reload waves.
+
+## Reading the result
+
+On exit (or `Ctrl-C` / `SIGTERM` — it has a graceful handler) it prints:
+
+- **final verdict**: `EDITOR ALIVE` vs `EDITOR DEAD/UNREACHABLE`
+- throughput (calls/sec), ok/err totals
+- **reloads survived / attempted** and per-reload **recovery time** (wall-clock
+  to reconnect)
+- overall **latency** p50 / p95 / max
+- **error-code histogram** (e.g. `EDITOR_NOT_READY`, `NODE_NOT_FOUND`,
+  `INVALID_PARAMS`, `CONNECTION`)
+- **per-op table**: ok/err counts, p50/p95/max latency, and the error codes for
+  that op
+
+The same data, plus more, is in `stormtest_report.json`.
+
+### Expected (healthy) error noise
+
+A small error rate is normal and *not* a failure:
+
+- `EDITOR_NOT_READY` — transient, during reload windows or play-state changes.
+- `NODE_NOT_FOUND` — concurrent-delete races (one worker deletes a node another
+  was about to touch); expected under concurrency.
+- `CONNECTION` — during the reload disable→enable window before reconnect.
+
+What you're watching for instead: the editor **process dying** (verdict flips to
+`DEAD`, a flood of `CONNECTION` that never recovers), a reload that **never
+comes back** (managed-server-killed; recovery time unbounded), or one op with a
+**pathologically high error rate or latency** that points at a real regression.
+
+> If the target server is plugin-managed (auto-spawned), a reload may kill it
+> and not return — run the server **externally** (e.g. `serve-this-worktree`,
+> which uses `--reload`) so `editor_reload_plugin` exercises the plugin reload
+> without taking the server down with it.
diff --git a/script/stormtest.py b/script/stormtest.py