Fix stale MCP session diagnostics and reconnect log spam#225
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
Quick update: I pushed two follow-up commits after CI/Codecov feedback:
Local validation after the coverage update:
The new workflow checks are still present as requested, but upstream Actions may still need approval to run on the fork PR. |
There was a problem hiding this comment.
Pull request overview
Adds better stale streamable-HTTP MCP session diagnostics and reduces editor reconnect log spam, improving client recovery behavior and long-running editor reliability.
Changes:
- Introduces
StaleMcpSessionDiagnosticMiddlewareand wires it into the server’s HTTP ASGI app to rewrite the MCP SDK “Session not found” response with actionableerror.data. - Updates the Godot editor WebSocket reconnect loop with a longer capped backoff (to 60s) and log-throttling after initial attempts.
- Adds Python unit tests and GDScript tests covering the stale-session rewrite and reconnect backoff/logging behavior; adds a new GitHub Actions workflow for PR checks.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/godot_ai/asgi.py |
Adds ASGI middleware that detects and rewrites the SDK stale-session JSON-RPC error into a structured, recoverable diagnostic. |
src/godot_ai/server.py |
Wraps FastMCP’s HTTP app via a GodotAIFastMCP subclass so the middleware is applied for HTTP transports. |
plugin/addons/godot_ai/connection.gd |
Extends reconnect backoff to 60s and throttles reconnect logging while keeping infinite retry behavior. |
tests/unit/test_asgi_session_diagnostics.py |
Adds unit tests validating rewrite behavior, chunked bodies, and non-interference with other responses/scopes. |
test_project/tests/test_connection.gd |
Adds coverage for reconnect delay capping and reconnect logging throttling logic. |
.github/workflows/pr-checks.yml |
Adds a PR checks workflow running ruff + pytest (focused + full suite) using uv. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| and payload.get("jsonrpc") == "2.0" | ||
| and isinstance(payload.get("error"), dict) |
There was a problem hiding this comment.
_is_sdk_session_not_found() matches any JSON-RPC payload with error.message == "Session not found". That’s broader than the stated intent of rewriting the SDK’s exact stale-session response and could accidentally rewrite unrelated 404 JSON-RPC errors if another component ever uses the same message. To make this middleware safer/future-proof, also match on the SDK’s other stable fields (e.g., error.code == -32600, id == "server-error", and/or presence of an mcp-session-id request header).
| and payload.get("jsonrpc") == "2.0" | |
| and isinstance(payload.get("error"), dict) | |
| and payload.get("jsonrpc") == "2.0" | |
| and payload.get("id") == "server-error" | |
| and isinstance(payload.get("error"), dict) | |
| and payload["error"].get("code") == -32600 |
|
Hey @dialupdisaster — I reviewed this PR and Copilot's feedback. I agree with all four of Copilot's comments and put together a patch you can apply on top of What this addresses
Test changes
Validation
Not touched
PatchApply with diff --git a/.github/workflows/pr-checks.yml b/.github/workflows/pr-checks.yml
deleted file mode 100644
index 317be45..0000000
--- a/.github/workflows/pr-checks.yml
+++ /dev/null
@@ -1,39 +0,0 @@
-name: PR Checks
-
-on:
- workflow_dispatch:
- push:
- branches: [main]
- pull_request:
- branches: [main]
-
-jobs:
- python-checks:
- name: Python checks
- runs-on: ubuntu-latest
-
- steps:
- - name: Check out repository
- uses: actions/checkout@v6
-
- - name: Set up Python
- uses: actions/setup-python@v6
- with:
- python-version: "3.13"
-
- - name: Set up uv
- uses: astral-sh/setup-uv@v7
- with:
- enable-cache: true
-
- - name: Install dependencies
- run: uv sync --extra dev
-
- - name: Lint
- run: uv run ruff check src tests
-
- - name: Run focused session recovery tests
- run: uv run pytest tests/unit/test_asgi_session_diagnostics.py tests/unit/test_cli_reload.py
-
- - name: Run Python test suite
- run: uv run pytest tests
diff --git a/src/godot_ai/asgi.py b/src/godot_ai/asgi.py
index 3c147ae..e089afd 100644
--- a/src/godot_ai/asgi.py
+++ b/src/godot_ai/asgi.py
@@ -52,7 +52,10 @@ class StaleMcpSessionDiagnosticMiddleware:
return getattr(self.app, name)
async def __call__(self, scope: Scope, receive: Receive, send: Send) -> None:
- if scope["type"] != "http":
+ # The SDK only emits "Session not found" when the client sent an
+ # mcp-session-id header. Skip buffering for everything else so unrelated
+ # HTTP 404s keep their original streaming/timing semantics.
+ if scope["type"] != "http" or not _request_has_mcp_session_id(scope):
await self.app(scope, receive, send)
return
@@ -81,11 +84,11 @@ class StaleMcpSessionDiagnosticMiddleware:
async def _send_response(
self,
- start_message: Message | None,
+ start_message: Message,
body: bytes,
send: Send,
) -> None:
- rewritten = self._rewrite_stale_session_body(start_message, body)
+ rewritten = self._rewrite_stale_session_body(body)
response_body = rewritten if rewritten is not None else body
headers = start_message.get("headers", [])
if rewritten is not None:
@@ -95,7 +98,7 @@ class StaleMcpSessionDiagnosticMiddleware:
await send(start_message)
await send({"type": "http.response.body", "body": response_body, "more_body": False})
- def _rewrite_stale_session_body(self, start_message: Message, body: bytes) -> bytes | None:
+ def _rewrite_stale_session_body(self, body: bytes) -> bytes | None:
try:
payload = json.loads(body.decode("utf-8"))
except (UnicodeDecodeError, json.JSONDecodeError):
@@ -108,12 +111,16 @@ class StaleMcpSessionDiagnosticMiddleware:
return json.dumps(payload, separators=(",", ":")).encode("utf-8")
def _is_sdk_session_not_found(self, payload: Any) -> bool:
- return (
- isinstance(payload, dict)
- and payload.get("jsonrpc") == "2.0"
- and isinstance(payload.get("error"), dict)
- and payload["error"].get("message") == "Session not found"
- )
+ # Match on multiple stable SDK signals (code + id + message) to avoid
+ # rewriting unrelated JSON-RPC 404s that happen to share one field.
+ if not isinstance(payload, dict) or payload.get("jsonrpc") != "2.0":
+ return False
+ if payload.get("id") != "server-error":
+ return False
+ error = payload.get("error")
+ if not isinstance(error, dict):
+ return False
+ return error.get("code") == -32600 and error.get("message") == "Session not found"
def _headers_without_content_length(self, headers: Any) -> list[tuple[bytes, bytes]]:
return [
@@ -131,6 +138,10 @@ class StaleMcpSessionDiagnosticMiddleware:
return [*headers, (b"content-type", b"application/json")]
+def _request_has_mcp_session_id(scope: Scope) -> bool:
+ return any(key.lower() == b"mcp-session-id" for key, _ in scope.get("headers", []))
+
+
def _get_dev_transport() -> str:
transport = os.environ.get(DEV_TRANSPORT_ENV, "streamable-http")
if transport not in RELOADABLE_TRANSPORTS:
diff --git a/tests/unit/test_asgi_session_diagnostics.py b/tests/unit/test_asgi_session_diagnostics.py
index 5953607..0b858fe 100644
--- a/tests/unit/test_asgi_session_diagnostics.py
+++ b/tests/unit/test_asgi_session_diagnostics.py
@@ -110,7 +110,10 @@ async def test_stale_mcp_session_diagnostic_handles_chunked_stale_session_body()
app = StaleMcpSessionDiagnosticMiddleware(sdk_stale_session_response)
- sent = await _single_http_request(app)
+ sent = await _single_http_request(
+ app,
+ headers=[(b"mcp-session-id", b"stale-session-id")],
+ )
assert sent[0]["status"] == 404
assert (b"content-type", b"application/json") in sent[0]["headers"]
@@ -119,8 +122,45 @@ async def test_stale_mcp_session_diagnostic_handles_chunked_stale_session_body()
assert body["error"]["data"]["action"] == "reinitialize_mcp_session"
+@pytest.mark.anyio
+async def test_stale_mcp_session_diagnostic_passes_through_when_request_has_no_session_id():
+ # Without an mcp-session-id header the SDK never emits its stale-session
+ # response, so the middleware must not buffer or rewrite anything.
+ async def stale_looking_response(scope, receive, send):
+ await send(
+ {
+ "type": "http.response.start",
+ "status": 404,
+ "headers": [(b"content-type", b"application/json")],
+ }
+ )
+ await send(
+ {
+ "type": "http.response.body",
+ "body": json.dumps(
+ {
+ "jsonrpc": "2.0",
+ "id": "server-error",
+ "error": {"code": -32600, "message": "Session not found"},
+ }
+ ).encode(),
+ "more_body": False,
+ }
+ )
+
+ app = StaleMcpSessionDiagnosticMiddleware(stale_looking_response)
+
+ sent = await _single_http_request(app)
+
+ body = json.loads(sent[1]["body"])
+ assert body["error"] == {"code": -32600, "message": "Session not found"}
+ assert "data" not in body["error"]
+
+
@pytest.mark.anyio
async def test_stale_mcp_session_diagnostic_leaves_other_responses_unchanged():
+ # Use the session header so this exercises the buffering code path's
+ # non-404 short-circuit, not just the request-level gate.
async def ok_response(scope, receive, send):
await send(
{
@@ -133,7 +173,10 @@ async def test_stale_mcp_session_diagnostic_leaves_other_responses_unchanged():
app = StaleMcpSessionDiagnosticMiddleware(ok_response)
- sent = await _single_http_request(app)
+ sent = await _single_http_request(
+ app,
+ headers=[(b"mcp-session-id", b"live-session-id")],
+ )
assert sent == [
{
@@ -160,7 +203,10 @@ async def test_stale_mcp_session_diagnostic_streams_non_404_responses_unchanged(
app = StaleMcpSessionDiagnosticMiddleware(streaming_response)
- sent = await _single_http_request(app)
+ sent = await _single_http_request(
+ app,
+ headers=[(b"mcp-session-id", b"live-session-id")],
+ )
assert sent == [
{
@@ -197,7 +243,10 @@ async def test_stale_mcp_session_diagnostic_passes_unhandled_asgi_messages_throu
app = StaleMcpSessionDiagnosticMiddleware(extension_message_response)
- sent = await _single_http_request(app)
+ sent = await _single_http_request(
+ app,
+ headers=[(b"mcp-session-id", b"live-session-id")],
+ )
assert sent == [
{
@@ -207,6 +256,59 @@ async def test_stale_mcp_session_diagnostic_passes_unhandled_asgi_messages_throu
]
+@pytest.mark.anyio
+async def test_stale_mcp_session_diagnostic_ignores_non_dict_jsonrpc_payloads():
+ # A 404 with a JSON body that isn't the SDK's stale-session shape (here a
+ # list, and a dict whose error field is a string) must pass through.
+ async def list_payload_response(scope, receive, send):
+ await send(
+ {
+ "type": "http.response.start",
+ "status": 404,
+ "headers": [(b"content-type", b"application/json")],
+ }
+ )
+ await send(
+ {
+ "type": "http.response.body",
+ "body": b'[{"jsonrpc":"2.0","error":{"message":"Session not found"}}]',
+ "more_body": False,
+ }
+ )
+
+ app = StaleMcpSessionDiagnosticMiddleware(list_payload_response)
+ sent = await _single_http_request(
+ app,
+ headers=[(b"mcp-session-id", b"stale-session-id")],
+ )
+ body = json.loads(sent[1]["body"])
+ assert isinstance(body, list)
+
+ async def string_error_response(scope, receive, send):
+ await send(
+ {
+ "type": "http.response.start",
+ "status": 404,
+ "headers": [(b"content-type", b"application/json")],
+ }
+ )
+ await send(
+ {
+ "type": "http.response.body",
+ "body": b'{"jsonrpc":"2.0","id":"server-error","error":"Session not found"}',
+ "more_body": False,
+ }
+ )
+
+ app = StaleMcpSessionDiagnosticMiddleware(string_error_response)
+ sent = await _single_http_request(
+ app,
+ headers=[(b"mcp-session-id", b"stale-session-id")],
+ )
+ body = json.loads(sent[1]["body"])
+ assert body["error"] == "Session not found"
+
+
@pytest.mark.anyio
async def test_stale_mcp_session_diagnostic_leaves_other_404_responses_unchanged():
async def not_found_response(scope, receive, send):
@@ -221,7 +323,10 @@ async def test_stale_mcp_session_diagnostic_leaves_other_404_responses_unchanged
app = StaleMcpSessionDiagnosticMiddleware(not_found_response)
- sent = await _single_http_request(app)
+ sent = await _single_http_request(
+ app,
+ headers=[(b"mcp-session-id", b"stale-session-id")],
+ )
assert sent == [
{
@@ -254,13 +359,48 @@ async def test_stale_mcp_session_diagnostic_leaves_other_json_rpc_404_errors_unc
app = StaleMcpSessionDiagnosticMiddleware(json_rpc_not_found_response)
- sent = await _single_http_request(app)
+ sent = await _single_http_request(
+ app,
+ headers=[(b"mcp-session-id", b"stale-session-id")],
+ )
body = json.loads(sent[1]["body"])
assert body["error"] == {"code": -32000, "message": "Tool not found"}
assert "data" not in body["error"]
+@pytest.mark.anyio
+async def test_stale_mcp_session_diagnostic_requires_matching_jsonrpc_id():
+ # An unrelated 404 that happens to share the SDK's "Session not found"
+ # message but uses a different JSON-RPC id must not be rewritten.
+ async def lookalike_response(scope, receive, send):
+ body = json.dumps(
+ {
+ "jsonrpc": "2.0",
+ "id": "request-1",
+ "error": {"code": -32600, "message": "Session not found"},
+ }
+ ).encode()
+ await send(
+ {
+ "type": "http.response.start",
+ "status": 404,
+ "headers": [(b"content-type", b"application/json")],
+ }
+ )
+ await send({"type": "http.response.body", "body": body, "more_body": False})
+
+ app = StaleMcpSessionDiagnosticMiddleware(lookalike_response)
+
+ sent = await _single_http_request(
+ app,
+ headers=[(b"mcp-session-id", b"stale-session-id")],
+ )
+
+ body = json.loads(sent[1]["body"])
+ assert "data" not in body["error"]
+
+
def test_create_server_wraps_streamable_http_app_with_stale_session_diagnostic():
server = create_server()
To apply: # save the diff above as /tmp/copilot-fixes.patch, then:
git checkout fix/session-recovery-diagnostics
git apply /tmp/copilot-fixes.patch
git add -A && git commit -m "fix(asgi): address Copilot feedback on stale-session middleware"
git pushOr pull from Generated by Claude Code |
- Gate buffering on the mcp-session-id request header so unrelated HTTP 404s keep their original streaming/timing semantics. The SDK only emits "Session not found" when the client sent that header. - Tighten _is_sdk_session_not_found to require id == "server-error" and error.code == -32600 alongside the message, so an unrelated JSON-RPC 404 that happens to share the message text is not rewritten. - Drop dead Message | None branch from _send_response / _rewrite_stale_session_body; the call site already guarantees non-None. - Delete .github/workflows/pr-checks.yml: it duplicates ci.yml, which already runs ruff + the full pytest suite on PRs. - Add tests covering the new gate, the tighter id match, and update the chunked-body / passthrough tests to send the session header so they actually exercise the rewriter (not just the gate). https://claude.ai/code/session_01Lnxpctxzfc5sY5rBUQitXJ
The header-gated buffering moved coverage off the existing 200-OK / SSE / extension-message tests because they no longer entered the buffering path. Codecov flagged 6 lines uncovered (asgi.py 70, 74-75, 81, 117, 122). - Add the mcp-session-id header to the 200-OK, SSE, and extension-message tests so they exercise the buffering code path's non-404 short-circuit (lines 70, 74-75, 81), not just the request-level gate. - Add test_stale_mcp_session_diagnostic_ignores_non_dict_jsonrpc_payloads to cover the matcher's "not a dict" and "error is not a dict" rejection branches (lines 117, 122). Local coverage: src/godot_ai/asgi.py 105/105 stmts, 100%. https://claude.ai/code/session_01Lnxpctxzfc5sY5rBUQitXJ
|
Nice work @dialupdisaster! Thanks for being the first contributor! 🚀🤘🏽🎉 |
Summary
mcp-session-id.Problem
When a streamable-HTTP MCP client keeps using an
mcp-session-idthat the MCP SDK no longer knows about, the Python MCP SDK rejects the request before Godot AI tool handlers run. The current response is a plain JSON-RPC error:{ "code": -32600, "message": "Session not found" }That error is technically correct, but it gives clients and LLM agents no clear recovery action. In practice, a client can keep retrying the same stale session forever even though a fresh MCP session would work.
Separately, the editor WebSocket reconnect loop currently retries forever with a 10-second cap and logs every attempt. If the WebSocket cannot reconnect overnight, this can produce thousands of repeated reconnect log lines.
Why not silently recover the old MCP session?
Server-side resurrection of a stale streamable-HTTP session is not safe here. The missing
mcp-session-idrefers to transport/session-manager state that no longer exists. Reusing or silently recreating that session ID risks violating MCP initialization/session semantics.The correct recovery is for the MCP client to initialize a fresh streamable-HTTP MCP session. This PR makes that explicit and machine-readable.
New stale-session response
For the MCP SDK's stale session response, Godot AI now preserves the HTTP 404 and JSON-RPC
-32600error code, but rewrites the message and adds structured recovery data:{ "jsonrpc": "2.0", "id": "server-error", "error": { "code": -32600, "message": "MCP session expired or was not found; reinitialize the streamable HTTP session", "data": { "recoverable": true, "action": "reinitialize_mcp_session", "reason": "stale_streamable_http_session" } } }Clients and LLM tooling can key off:
and create a fresh MCP session instead of retrying the same stale one.
Reconnect behavior change
The editor WebSocket reconnect loop still retries indefinitely, but less noisily:
[1, 2, 4, 8, 10]to[1, 2, 4, 8, 16, 30, 60].WebSocketPeerbefore each attempt; the code now documents why.This keeps recovery behavior while avoiding runaway log spam.
Implementation notes
StaleMcpSessionDiagnosticMiddlewarewraps HTTP / streamable-HTTP ASGI apps and only rewrites the SDK's exact stale-session JSON-RPC response.GodotAIFastMCPapplies the middleware to directserver.run(transport="streamable-http")and reload ASGI paths.Validation
uv run pytest tests/unit/test_asgi_session_diagnostics.py tests/unit/test_cli_reload.py— 16 passeduv run ruff check src tests— passeduv run pytest tests— 624 passedCaveat
This does not make stale streamable-HTTP sessions resumable after server/plugin restart. It gives clients the correct recovery signal. Clients that ignore JSON-RPC
error.datamay still need a manual MCP reconnect/reinitialize.