fix: session pool resource exhaustion#3952
Conversation
Lang-Akshay
left a comment
There was a problem hiding this comment.
Thanks @marekdano for the PR.
Please implement following findings
Status: init_mcp_session_pool parameters prevents deployment.
| Finding | Status | Notes |
|---|---|---|
| Finding 1 — Global cap | ✅ Done | max_total_keys and max_total_sessions added to MCPSessionPool.__init__ with enforcement in _get_or_create_pool() and acquire(). However, init_mcp_session_pool() does not accept or forward these params → they're silently ineffective (actually crashes at startup, see Bug #2 below). |
| Finding 2 — Safer defaults | ✅ Done | MAX_PER_KEY reduced from 200 to 10 in docker-compose.yml with extensive blast-radius documentation. |
| Finding 3 — JWT identity extractor | ✅ Done | _create_jwt_identity_extractor() implemented and wired into lifespan. Blocked by the same startup crash. |
🔴 Critical Bugs
Bug 1 — NameError in check_permission() breaks all RBAC checks
File: permission_service.py:127, 137 | Severity: Critical | CWE: CWE-755
token_teams was removed from the check_permission() signature but is still referenced in the function body. This causes a NameError on every call, making every RBAC-protected endpoint return 403. The application fails closed, but is entirely non-functional.
Bug 2 — TypeError at startup crashes the application
File: mcp_session_pool.py:2115–2132 | Severity: Critical | CWE: CWE-453
init_mcp_session_pool() is missing the max_total_keys and max_total_sessions parameters. When main.py passes them at startup, a TypeError is raised and the application fails to boot entirely.
⚠️ Medium / Low Findings
Finding 3 — JWT decoded without signature verification
File: main.py:1539 | Severity: Medium | CWE: CWE-345
JWT is decoded without signature verification. This is acceptable for session pool bucketing (not a security boundary), but crafted JWTs could influence pool key selection. Add a defensive comment or restrict to JWT-authenticated requests only.
Finding 4 — Dead code: duplicate narrowed-admin guards
File: tokens.py:557–559, 654–657 | Severity: Low | CWE: CWE-561
Duplicate narrowed-admin guards — the second check is dead code. The first guard at L555/L651 already catches token_teams is not None. Test assertions will match the wrong error message.
Finding 5 — RuntimeError masked as asyncio.TimeoutError
File: mcp_session_pool.py:840 | Severity: Low | CWE: CWE-390
RuntimeError is converted to asyncio.TimeoutError, masking capacity exhaustion as a timeout in monitoring.
Invariant Violations
-
Two-layer model broken: Layer 2 (
check_permission) is entirely non-functional due to theNameError(Bug 1). Even after a fix, removingtoken_teamsfrom Layer 2 means public-only admin tokens get an admin bypass — this contradicts the security comment atpermission_service.py:124–126. Either fully remove the deadtoken_teamslogic from the body, or re-add the parameter. -
Deny-path gap: No integration test calls
check_permission()without mocking, so theNameErroris hidden. The existing test mocks the service, concealing this bug entirely.
Redundant Code to Remove
| # | File | Line(s) | Type | Description | Suggestion |
|---|---|---|---|---|---|
| 1 | tokens.py |
557–560 | Dead code | Second narrowed-admin guard after first guard already catches same condition | Remove lines 557–560 |
| 2 | tokens.py |
654–657 | Dead code | Same duplicate guard in admin_revoke_token |
Remove lines 654–657 |
| 3 | permission_service.py |
124–137 | Unreachable / broken | token_teams logic references undefined variable; if fixed by removing, the whole block is dead code since no caller passes token_teams anymore |
Either remove the block or re-add parameter forwarding |
| 4 | test_main_extended.py |
228–229 | Broken structure | TestConditionalPaths emptied to just pass; its methods are now accidentally inside TestJwtIdentityExtractor |
Restore TestConditionalPaths body properly |
|
Thanks for your review, @Lang-Akshay! Fixes:
|
80d7953 to
22d931c
Compare
There was a problem hiding this comment.
Thanks @marekdano for the contribution! Please implement the following two changes:
1. Enable JWT identity extraction (MCP_SESSION_POOL_JWT_IDENTITY_EXTRACTION=true) Default. This is the most impactful change.
2. Incorrect use of built-in callable instead of Callable from typing
Location
main.py, line 1519
Current Code
def _create_jwt_identity_extractor() -> callable:Problem
Uses the lowercase built-in callable instead of the proper Callable type from the typing module.
Correct Fix
- Add
Callableto the imports fromtypingon line ~42:
from typing import Callable- Update line
1519to:
def _create_jwt_identity_extractor() -> Callable[[dict], Optional[str]]:11096f9 to
20e4c4f
Compare
Signed-off-by: Marek Dano <Marek.Dano@ibm.com>
Signed-off-by: Marek Dano <Marek.Dano@ibm.com>
Signed-off-by: Marek Dano <Marek.Dano@ibm.com>
Signed-off-by: Marek Dano <Marek.Dano@ibm.com>
Signed-off-by: Marek Dano <Marek.Dano@ibm.com>
…allable from typing Signed-off-by: Marek Dano <Marek.Dano@ibm.com>
20e4c4f to
93f237e
Compare
crivetimihai
left a comment
There was a problem hiding this comment.
Thanks @marekdano for a solid contribution — the blast-radius analysis in the PR description is excellent and the multi-layered mitigation approach is well-thought-out. I've rebased this onto main and made a few fixes on top (commit 11488dc65). Here's what I changed and why:
Changes made during review
1. Restored token_teams forwarding in PermissionChecker.has_permission() (rbac.py)
The PR removed token_teams from has_permission() but left it in has_any_permission() and the @require_permission decorator. This created an inconsistency: a public-only admin token (token_teams=[]) going through the RPC path (main.py:1013 via PermissionChecker.has_permission) would bypass the admin bypass suppression in check_permission(), since token_teams would default to None (unrestricted).
The @require_permission decorator still passes token_teams at lines 749/764, so the two-layer model was already being enforced for HTTP endpoints — just not for the PermissionChecker manual path. Restored the parameter to keep all paths consistent.
Also restored the corresponding tests in test_rbac.py and test_permission_service.py that verify public-only tokens suppress admin bypass.
2. Guarded release() against RuntimeError from _get_or_create_pool() (mcp_session_pool.py)
release() calls _get_or_create_pool() in two places (lines 1034, 1065). If max_total_keys is configured and the pool key was evicted while a session was checked out, _get_or_create_pool raises RuntimeError — which would leak the session and semaphore slot. Added try/except RuntimeError to gracefully discard the session and close it instead. Added test coverage for this path.
3. Fixed Bearer prefix stripping (main.py)
str.replace("Bearer ", "") replaces all occurrences, not just the prefix. Switched to startswith() + slice, which is the standard approach and handles edge cases correctly.
4. Fixed mixed types in metrics (mcp_session_pool.py)
max_total_keys and max_total_sessions in get_metrics() returned int or "unlimited" string depending on the value. Consumers expecting a consistent type (JSON schema validation, Prometheus exporters) would break. Changed to always return int (0 = unlimited), consistent with the config semantics.
5. Removed orphaned test methods from TestJwtIdentityExtractor (test_main_extended.py)
Five methods from TestConditionalPaths (which exists on main) were accidentally duplicated inside TestJwtIdentityExtractor — an orphaned docstring """Test conditional code paths...""" at line 451 without a class statement caused the methods to become part of the wrong class. The test_client/auth_headers fixtures wouldn't resolve correctly under TestJwtIdentityExtractor. Removed the duplicates; the originals in TestConditionalPaths are untouched.
6. Added missing assertion in test_jwt_identity_extractor_exception_falls_back
The test computed identity_hash but had no assert — it passed vacuously. Added assert identity_hash != "anonymous" and await pool.close_all() for cleanup.
Everything else looks good
- Token router security split (separate
is_admin+token_teamschecks) — correct and improves error messages - JWT decode without verification — acceptable for pool bucketing with good security comment
- Thread safety / lock analysis — correct asyncio.Lock usage, no deadlock risk
identity_extractorintegration — proper fallback chain in_compute_identity_hash- Global caps with soft-cap TOCTOU documented accurately
init_mcp_session_poolkeyword-only*— good API hygiene- Test coverage is comprehensive across all new code paths
All 1457 affected tests passing.
- Restore token_teams forwarding in PermissionChecker.has_permission() to prevent privilege escalation for public-only admin tokens via RPC path (consistent with has_any_permission and @require_permission) - Guard release() against RuntimeError from _get_or_create_pool() when max_total_keys limit is hit for a previously-evicted pool key - Fix Bearer prefix stripping to use startswith() instead of replace() - Use consistent int type for max_total_keys/max_total_sessions metrics - Remove orphaned TestConditionalPaths methods from TestJwtIdentityExtractor - Add missing assertion in test_jwt_identity_extractor_exception_falls_back - Add test for release() graceful handling under max_total_keys limit - Restore test coverage for public-only token admin bypass suppression Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
93f237e to
3c0864e
Compare
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* fix: session pool resource exhaustion Signed-off-by: Marek Dano <Marek.Dano@ibm.com> * fix: lint issue Signed-off-by: Marek Dano <Marek.Dano@ibm.com> * fix: reported bugs by reviewer Signed-off-by: Marek Dano <Marek.Dano@ibm.com> * fix: lint and test coverage issues Signed-off-by: Marek Dano <Marek.Dano@ibm.com> * fix: lint issue Signed-off-by: Marek Dano <Marek.Dano@ibm.com> * fix: enable JWT identity extraction by default, replace callable by Callable from typing Signed-off-by: Marek Dano <Marek.Dano@ibm.com> * fix: address review findings for session pool resource exhaustion - Restore token_teams forwarding in PermissionChecker.has_permission() to prevent privilege escalation for public-only admin tokens via RPC path (consistent with has_any_permission and @require_permission) - Guard release() against RuntimeError from _get_or_create_pool() when max_total_keys limit is hit for a previously-evicted pool key - Fix Bearer prefix stripping to use startswith() instead of replace() - Use consistent int type for max_total_keys/max_total_sessions metrics - Remove orphaned TestConditionalPaths methods from TestJwtIdentityExtractor - Add missing assertion in test_jwt_identity_extractor_exception_falls_back - Add test for release() graceful handling under max_total_keys limit - Restore test coverage for public-only token admin bypass suppression Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * chore: update .secrets.baseline Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Marek Dano <Marek.Dano@ibm.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Marek Dano <Marek.Dano@ibm.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* fix: session pool resource exhaustion Signed-off-by: Marek Dano <Marek.Dano@ibm.com> * fix: lint issue Signed-off-by: Marek Dano <Marek.Dano@ibm.com> * fix: reported bugs by reviewer Signed-off-by: Marek Dano <Marek.Dano@ibm.com> * fix: lint and test coverage issues Signed-off-by: Marek Dano <Marek.Dano@ibm.com> * fix: lint issue Signed-off-by: Marek Dano <Marek.Dano@ibm.com> * fix: enable JWT identity extraction by default, replace callable by Callable from typing Signed-off-by: Marek Dano <Marek.Dano@ibm.com> * fix: address review findings for session pool resource exhaustion - Restore token_teams forwarding in PermissionChecker.has_permission() to prevent privilege escalation for public-only admin tokens via RPC path (consistent with has_any_permission and @require_permission) - Guard release() against RuntimeError from _get_or_create_pool() when max_total_keys limit is hit for a previously-evicted pool key - Fix Bearer prefix stripping to use startswith() instead of replace() - Use consistent int type for max_total_keys/max_total_sessions metrics - Remove orphaned TestConditionalPaths methods from TestJwtIdentityExtractor - Add missing assertion in test_jwt_identity_extractor_exception_falls_back - Add test for release() graceful handling under max_total_keys limit - Restore test coverage for public-only token admin bypass suppression Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * chore: update .secrets.baseline Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Marek Dano <Marek.Dano@ibm.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Marek Dano <Marek.Dano@ibm.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
🐛 Bug-fix PR
Closes #3777
📝 Summary
This PR fixes session pool resource exhaustion by implementing global session caps and JWT identity extraction to prevent bucket explosion from rotating tokens.
Problem: The session pool could create unbounded connections due to exponential bucket growth:
Example blast radius: 10 users × 3 servers × 2 transports × 32 token rotations × 200 sessions = 384,000 open connections
Solution: Multi-layered mitigation:
MCP_SESSION_POOL_MAX_TOTAL_KEYSandMCP_SESSION_POOL_MAX_TOTAL_SESSIONSto hard-limit resource usageMCP_SESSION_POOL_JWT_IDENTITY_EXTRACTIONextracts stable user ID (sub/email/user_id) to eliminate token rotation factorMAX_PER_KEYfrom 200 to 10🏷️ Type of Change
🔧 Changes
Core Session Pool (
mcpgateway/services/mcp_session_pool.py)max_total_keyslimit in_get_or_create_pool()- raisesRuntimeErrorwhen exceededmax_total_sessionslimit inacquire()- raisesasyncio.TimeoutErrorwhen exceededtotal_active_sessions,max_total_keys,max_total_sessionsConfiguration (
mcpgateway/config.py)MCP_SESSION_POOL_MAX_TOTAL_KEYS: Global cap on pool keys (default 0 = unlimited)MCP_SESSION_POOL_MAX_TOTAL_SESSIONS: Global cap on active sessions (default 0 = unlimited, soft cap)MCP_SESSION_POOL_JWT_IDENTITY_EXTRACTION: Extract stable user ID from JWT to prevent bucket explosionJWT Identity Extraction (
mcpgateway/main.py)_create_jwt_identity_extractor()function decodes JWT without signature verificationsub,email, oruser_idclaimsjti/exp/iatSecurity Hardening (
mcpgateway/routers/tokens.py)list_all_tokens,admin_revoke_token) now require un-narrowed admin accesstoken_teams != None) from token managementRBAC Layer Separation
token_teamsparameter from RBAC permission checks (mcpgateway/middleware/rbac.py,mcpgateway/services/permission_service.py)Docker Compose (
docker-compose.yml)MCP_SESSION_POOL_MAX_PER_KEYfrom 200 to 10 (safer default)Test Coverage (
tests/unit/mcpgateway/services/test_mcp_session_pool_coverage.py)TestMaxTotalKeysLimit: Tests RuntimeError enforcement and conversion to TimeoutErrorTestMaxTotalSessionsLimit: Tests soft cap enforcement and session reuse after releaseTestJwtIdentityExtractorNone: Tests fallback behavior when JWT extraction fails🧪 Verification
make lintmake testmake coverage✅ Checklist
make black isort pre-commit)📓 Mitigation Strategies
Operators should configure based on expected load:
Strategy 1: JWT Identity Extraction (Recommended)
Eliminates token rotation factor, allows higher per-key limits.
Strategy 2: Global Caps
Hard limits prevent unbounded growth.
Strategy 3: Conservative Defaults (Current)
MCP_SESSION_POOL_MAX_PER_KEY=10 # Global caps disabled (0 = unlimited)Safe for small deployments, requires manual tuning for scale.
🔄 Backward Compatibility
✅ Fully backward compatible - all new limits default to 0 (unlimited) to preserve existing behavior. Operators must explicitly configure limits based on their deployment scale.
📊 Impact