You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Review pass #2 caught a cluster of silent-failure concerns around the
registry hot paths and the eviction hooks that now wire it to gateway
mutations, downstream DELETE, classification, and SIGHUP. Fixes land as
one commit because they share a design: narrow the catches, raise the
log level where a failure leaves state wrong, and introduce a dedicated
exception type so "registry not initialised" stops hiding other
RuntimeErrors.
mcpgateway/services/upstream_session_registry.py:
- New RegistryNotInitializedError(RuntimeError) so catch-sites can
distinguish the "not started yet" case from other runtime errors
(e.g. "Event loop is closed" during shutdown). Inherits RuntimeError
for backwards compatibility with catch-sites written pre-split.
- _probe_health: narrow the catch-all "except Exception → recreate"
to (OSError, TimeoutError, McpError). AttributeError from MCP SDK
drift, authorization errors, and other genuinely-unexpected
conditions now propagate instead of driving an infinite reconnect
loop against the same failure.
- _default_session_factory.owner(): change except BaseException to
except Exception so SystemExit / KeyboardInterrupt / CancelledError
propagate promptly during shutdown. Add an add_done_callback that
logs a warning if the owner task exits unexpectedly — previously
a post-init upstream death silently left an orphaned session in
self._sessions.
- is_closed: bump the MCP-internals introspection except from pass
to logger.debug with the exception type, so SDK drift is visible.
- acquire(): wrap the yield in try/except (OSError,
anyio.ClosedResourceError, anyio.BrokenResourceError). On
transport-level errors from the caller body, evict so the next
acquire rebuilds instead of handing back a dead session. Tool-
level errors still leave the session in place.
- close_all(): asyncio.gather the per-key evictions (with
return_exceptions=True). Previously serial with per-key 5s cap —
50 stuck sessions = 4+ minute shutdown stall.
mcpgateway/services/gateway_service.py _evict_upstream_sessions_for_gateway:
- Catch RegistryNotInitializedError specifically for the tests/early-
startup no-op case. Bump the generic-exception branch from debug
to warning with gateway_id + exception type — this fires POST-
commit, so a silent eviction failure leaves persisted stale
credentials/URL/TLS material pinned on in-flight upstream sessions.
mcpgateway/cache/session_registry.py remove_session:
- Same RegistryNotInitializedError / warning treatment. An orphaned
upstream after DELETE /mcp is otherwise invisible to ops.
mcpgateway/services/server_classification_service.py _perform_classification:
- Bump the Redis-purge catch from debug to warning. The entire point
of this method is to KEEP classification keys absent so
should_poll_server falls through to "always poll". A sustained
purge failure re-opens the very regression this method exists to
prevent (previous commit's Codex review fix).
mcpgateway/handlers/signal_handlers.py sighup_reload:
- Add upstream-registry drain between the SSL cache clear and the
affinity-mapping drain. Previously SIGHUP only refreshed SSL
contexts and cleared the affinity map — registry-held upstream
ClientSessions kept their stale TLS material on the socket.
- Catch RegistryNotInitializedError at debug for the uninitialised
case; warning for other drain failures.
Tests:
- test_upstream_session_registry.py FakeClientSession now raises
OSError (was RuntimeError) to match the narrowed _probe_health
catch — the test's intent was "broken transport → recreate" and
OSError is the accurate stand-in.
- test_main_sighup.py: rewritten for the new three-step drain.
Asserts SSL cache clear + registry.close_all() + affinity drain
all fire, with the new log-message strings. Added a test covering
the RegistryNotInitializedError debug-path branch.
529 related tests pass across registry + lifecycle + classification +
tool_service + cache + sighup suites.
Signed-off-by: Jonathan Springer <jps@s390x.com>
0 commit comments