Skip to content

Commit 86906a4

Browse files
dsarnoclaude
andcommitted
test: add stormtest concurrency/reload stress harness
script/stormtest.py opens many concurrent MCP clients and fires rapid, randomized tool calls across every domain at a live editor, with periodic editor_reload_plugin churn mixed in. It's a robustness harness (not a correctness test): it answers whether the editor + plugin + WebSocket dispatcher + server survive sustained concurrent abuse and reload cycles without crashing, and surfaces per-tool latency/error hot-spots. - 8 workers x 5 waves x 25 calls/wave by default (~1000 calls); brutal mode via SS_WORKERS/SS_WAVES env (~9000). Reads-only smoke via SS_RELOAD=0. - Workers route to the active session and follow the session-id rotation a reload causes; writes are namespaced per-worker under a scratch scene so they never touch the project's real scene. - Per-call timeout so a reload that severs the response can't wedge the run; full JSON snapshot flushed every few seconds (survives crash/kill). - Docs in docs/STRESS_TESTING.md; pointer added to AGENTS.md Testing section. - _stormtest/ scratch added to .gitignore. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 359ad12 commit 86906a4

4 files changed

Lines changed: 1037 additions & 0 deletions

File tree

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,3 +62,6 @@ test_project/tests/space_city_assets/
6262

6363
# Test runtime artifacts (created by GDScript tests; ephemeral)
6464
test_project/tests/_mcp_test_*
65+
66+
# stormtest scratch artifacts (script/stormtest.py)
67+
_stormtest/

AGENTS.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -275,6 +275,19 @@ test_results_get # review last results
275275

276276
Test suites extend `McpTestSuite` (assertion methods: `assert_true`, `assert_eq`, `assert_has_key`, `assert_contains`, `assert_is_error`, etc.). Drop `test_*.gd` files in `res://tests/` and they're auto-discovered.
277277

278+
### Stress / load testing — `script/stormtest.py`
279+
280+
`stormtest` opens many concurrent MCP clients and fires rapid, randomized tool calls across **every** domain at a live editor, with periodic `editor_reload_plugin` churn mixed in. It's a robustness test, not a correctness test: it answers "does the editor + plugin + WebSocket dispatcher + server survive sustained concurrent abuse and reload cycles without crashing?" and surfaces per-tool latency/error hot-spots. Use it after changes to the dispatcher, transport, readiness gating, session routing, or the reload/handoff path. Full reference: `docs/STRESS_TESTING.md`.
281+
282+
```bash
283+
.venv/bin/python script/stormtest.py # ≈ 1000 calls, with reload churn
284+
SS_WORKERS=12 SS_WAVES=30 .venv/bin/python script/stormtest.py # brutal ≈ 9000 calls
285+
SS_RELOAD=0 .venv/bin/python script/stormtest.py # reads-only smoke, no reloads
286+
SS_URL=http://127.0.0.1:8010/mcp .venv/bin/python script/stormtest.py # target another stack
287+
```
288+
289+
To stress a *branch's* code (plugin + server), point a Godot editor at that worktree's `test_project/` and serve its `src/` via `script/serve-this-worktree` (external server, so `editor_reload_plugin` exercises reload without killing the server), then run stormtest against it. A full JSON snapshot lands in `$TMPDIR/stormtest_report.json` (override with `SS_REPORT`), flushed every few seconds so a crash/kill still leaves data. A small `EDITOR_NOT_READY` / `NODE_NOT_FOUND` / `CONNECTION` error rate is expected noise under concurrency + reloads — watch instead for the process dying, a reload that never recovers, or one op with pathological error/latency.
290+
278291
**Guardrails built into the test runner:**
279292
- **Zero-assertion detection**: Tests that complete with 0 assertions are flagged as failures ("Test completed with 0 assertions — likely skipped its logic"). This catches tests that silently `return` before asserting anything.
280293
- **Resilient discovery**: If a `.gd` file fails to load (parse error, duplicate method, wrong base class), the rest of the suites still run and the failing files are reported in `load_errors`.

docs/STRESS_TESTING.md

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# Stress testing — `script/stormtest.py`
2+
3+
`stormtest` is a concurrency + reload stress harness. It opens many MCP client
4+
connections at once and fires rapid, randomized tool calls across **every**
5+
domain at a live Godot editor, periodically triggering `editor_reload_plugin`
6+
mid-run. It is not a correctness test — it answers two questions:
7+
8+
1. **Does the stack survive sustained concurrent abuse + reload churn without
9+
crashing?** (editor process, GDScript plugin, WebSocket dispatcher, server)
10+
2. **Where are the latency / error hot-spots per tool?**
11+
12+
It complements the deterministic suites (`pytest`, `test_run`): those check
13+
that each tool is *correct*; stormtest checks that the whole stack is *robust*
14+
under load and across the disable→extract→enable reload window.
15+
16+
## What it does
17+
18+
- **N parallel workers**, each its own `fastmcp.Client` connection (default 8).
19+
- Workers route to the **active session** (empty `session_id`), so when a
20+
reload rotates the session id they automatically follow the new one.
21+
- **Reads dominate** the op mix (like real traffic); **writes exercise every
22+
domain** — node/scene/script/batch/material/theme/resource/camera/particle/
23+
audio/animation/input_map/signal/filesystem.
24+
- Each worker namespaces its writes under `<scene_root>/wN/...` so workers
25+
hammer one shared edited scene without colliding on node paths.
26+
- **Worker 0 is the "chaos" worker**: every `SS_RELOAD_EVERY` waves it fires
27+
`editor_reload_plugin` instead of a normal burst, then reconnects (and
28+
reopens the scratch scene). The other workers keep hammering through the
29+
reload window and reconnect on the connection drop.
30+
- All disk artifacts (scratch scripts/resources/scene) land under
31+
`res://_stormtest/` in whatever project the target editor has open — scratch
32+
material that's safe to delete afterward.
33+
34+
## Safety
35+
36+
- Operates in a throwaway scratch scene (`res://_stormtest/storm.tscn`), **not**
37+
the project's real scene; restores the originally-open scene on teardown.
38+
- Never calls `project_run`, so it can't autosave-pollute the real scene.
39+
- A full JSON snapshot is flushed to `stormtest_report.json` (in `$TMPDIR`,
40+
overridable via `SS_REPORT`) **every few seconds**, so a crash or a kill mid-
41+
run still leaves analyzable data (this is deliberate — an earlier version
42+
lost its metrics to a `SIGKILL`).
43+
- It does **not** clear logs (a diagnostic must not destroy its own evidence).
44+
45+
## Running
46+
47+
The target editor's MCP server must be reachable (default `:8000`). For a true
48+
test of a branch's code, point the editor at that branch's worktree and serve
49+
that worktree's `src/` (see `script/serve-this-worktree`), so both the GDScript
50+
plugin and the Python server are the code under test.
51+
52+
```bash
53+
# default ≈ 1000 calls, with reload churn, against localhost:8000
54+
.venv/bin/python script/stormtest.py
55+
56+
# brutal ≈ 9000 calls
57+
SS_WORKERS=12 SS_WAVES=30 .venv/bin/python script/stormtest.py
58+
59+
# reads-only smoke, no reloads
60+
SS_RELOAD=0 SS_WORKERS=4 SS_WAVES=3 .venv/bin/python script/stormtest.py
61+
62+
# target a server on another port / host
63+
SS_URL=http://127.0.0.1:8010/mcp .venv/bin/python script/stormtest.py
64+
```
65+
66+
### Knobs (env)
67+
68+
| Var | Default | Meaning |
69+
|---|---|---|
70+
| `SS_WORKERS` | 8 | parallel client connections |
71+
| `SS_WAVES` | 5 | waves per worker |
72+
| `SS_CALLS` | 25 | calls per worker per wave |
73+
| `SS_RELOAD` | 1 | include `editor_reload_plugin` churn (`0` to skip) |
74+
| `SS_RELOAD_EVERY` | 2 | chaos worker reloads every N waves |
75+
| `SS_RECONNECT_TIMEOUT` | 30 | seconds to wait for the server to return after a reload |
76+
| `SS_URL` | `http://127.0.0.1:8000/mcp` | target MCP endpoint |
77+
| `SS_REPORT` | `$TMPDIR/stormtest_report.json` | where to write the JSON snapshot |
78+
79+
Total calls ≈ `WORKERS × WAVES × CALLS` minus the chaos worker's reload waves.
80+
81+
## Reading the result
82+
83+
On exit (or `Ctrl-C` / `SIGTERM` — it has a graceful handler) it prints:
84+
85+
- **final verdict**: `EDITOR ALIVE` vs `EDITOR DEAD/UNREACHABLE`
86+
- throughput (calls/sec), ok/err totals
87+
- **reloads survived / attempted** and per-reload **recovery time** (wall-clock
88+
to reconnect)
89+
- overall **latency** p50 / p95 / max
90+
- **error-code histogram** (e.g. `EDITOR_NOT_READY`, `NODE_NOT_FOUND`,
91+
`INVALID_PARAMS`, `CONNECTION`)
92+
- **per-op table**: ok/err counts, p50/p95/max latency, and the error codes for
93+
that op
94+
95+
The same data, plus more, is in `stormtest_report.json`.
96+
97+
### Expected (healthy) error noise
98+
99+
A small error rate is normal and *not* a failure:
100+
101+
- `EDITOR_NOT_READY` — transient, during reload windows or play-state changes.
102+
- `NODE_NOT_FOUND` — concurrent-delete races (one worker deletes a node another
103+
was about to touch); expected under concurrency.
104+
- `CONNECTION` — during the reload disable→enable window before reconnect.
105+
106+
What you're watching for instead: the editor **process dying** (verdict flips to
107+
`DEAD`, a flood of `CONNECTION` that never recovers), a reload that **never
108+
comes back** (managed-server-killed; recovery time unbounded), or one op with a
109+
**pathologically high error rate or latency** that points at a real regression.
110+
111+
> If the target server is plugin-managed (auto-spawned), a reload may kill it
112+
> and not return — run the server **externally** (e.g. `serve-this-worktree`,
113+
> which uses `--reload`) so `editor_reload_plugin` exercises the plugin reload
114+
> without taking the server down with it.

0 commit comments

Comments
 (0)