fix: session pool resource exhaustion by marekdano · Pull Request #3952 · IBM/mcp-context-forge

marekdano · 2026-03-31T15:15:56Z

🐛 Bug-fix PR

📝 Summary

This PR fixes session pool resource exhaustion by implementing global session caps and JWT identity extraction to prevent bucket explosion from rotating tokens.

Problem: The session pool could create unbounded connections due to exponential bucket growth:

max_connections = num_users × num_servers × num_transports × num_token_rotations × MAX_PER_KEY

Example blast radius: 10 users × 3 servers × 2 transports × 32 token rotations × 200 sessions = 384,000 open connections

Solution: Multi-layered mitigation:

Global caps: MCP_SESSION_POOL_MAX_TOTAL_KEYS and MCP_SESSION_POOL_MAX_TOTAL_SESSIONS to hard-limit resource usage
JWT identity extraction: MCP_SESSION_POOL_JWT_IDENTITY_EXTRACTION extracts stable user ID (sub/email/user_id) to eliminate token rotation factor
Safer defaults: Reduced MAX_PER_KEY from 200 to 10
Security hardening: Token oversight endpoints now require un-narrowed admin access

🏷️ Type of Change

🔧 Changes

Core Session Pool (`mcpgateway/services/mcp_session_pool.py`)

✅ Enforce max_total_keys limit in _get_or_create_pool() - raises RuntimeError when exceeded
✅ Enforce max_total_sessions limit in acquire() - raises asyncio.TimeoutError when exceeded
✅ Emit warning at 80% capacity threshold
✅ Convert pool key limit errors to user-friendly timeout errors
✅ Updated stats to include total_active_sessions, max_total_keys, max_total_sessions

Configuration (`mcpgateway/config.py`)

✅ MCP_SESSION_POOL_MAX_TOTAL_KEYS: Global cap on pool keys (default 0 = unlimited)
✅ MCP_SESSION_POOL_MAX_TOTAL_SESSIONS: Global cap on active sessions (default 0 = unlimited, soft cap)
✅ MCP_SESSION_POOL_JWT_IDENTITY_EXTRACTION: Extract stable user ID from JWT to prevent bucket explosion

JWT Identity Extraction (`mcpgateway/main.py`)

✅ New _create_jwt_identity_extractor() function decodes JWT without signature verification
✅ Extracts stable user ID from sub, email, or user_id claims
✅ Prevents bucket explosion when using short-lived JWTs with rotating jti/exp/iat
✅ Integrated into session pool initialization when enabled

Security Hardening (`mcpgateway/routers/tokens.py`)

✅ Token oversight endpoints (list_all_tokens, admin_revoke_token) now require un-narrowed admin access
✅ Blocks narrowed/public-only admin sessions (token_teams != None) from token management
✅ Prevents API tokens with team-scoped admin privileges from enumerating/managing tokens outside their scope

RBAC Layer Separation

✅ Removed token_teams parameter from RBAC permission checks (mcpgateway/middleware/rbac.py, mcpgateway/services/permission_service.py)
✅ Clarified that token-level narrowing (Layer 1) is enforced separately via data filtering
✅ High-risk endpoints enforce their own narrowing guards where required

Docker Compose (`docker-compose.yml`)

✅ Reduced default MCP_SESSION_POOL_MAX_PER_KEY from 200 to 10 (safer default)
✅ Added extensive documentation about blast radius calculation
✅ Documented mitigation strategies with recommended starting values

Test Coverage (`tests/unit/mcpgateway/services/test_mcp_session_pool_coverage.py`)

✅ TestMaxTotalKeysLimit: Tests RuntimeError enforcement and conversion to TimeoutError
✅ TestMaxTotalSessionsLimit: Tests soft cap enforcement and session reuse after release
✅ TestJwtIdentityExtractorNone: Tests fallback behavior when JWT extraction fails
✅ Tests for 80% warning threshold emission
✅ +225 lines of new test coverage

🧪 Verification

Check	Command	Status
Lint suite	`make lint`	✅
Unit tests	`make test`	✅
Coverage ≥ 80%	`make coverage`	✅

✅ Checklist

Code formatted (make black isort pre-commit)
Tests added/updated for changes
Documentation updated (docker-compose.yml comments)
No secrets or credentials committed
Signed commits (DCO)

📓 Mitigation Strategies

Operators should configure based on expected load:

Strategy 1: JWT Identity Extraction (Recommended)

MCP_SESSION_POOL_JWT_IDENTITY_EXTRACTION=true
MCP_SESSION_POOL_MAX_PER_KEY=20

Eliminates token rotation factor, allows higher per-key limits.

Strategy 2: Global Caps

MCP_SESSION_POOL_MAX_TOTAL_KEYS=1000
MCP_SESSION_POOL_MAX_TOTAL_SESSIONS=5000
MCP_SESSION_POOL_MAX_PER_KEY=10

Hard limits prevent unbounded growth.

Strategy 3: Conservative Defaults (Current)

MCP_SESSION_POOL_MAX_PER_KEY=10
# Global caps disabled (0 = unlimited)

Safe for small deployments, requires manual tuning for scale.

🔄 Backward Compatibility

✅ Fully backward compatible - all new limits default to 0 (unlimited) to preserve existing behavior. Operators must explicitly configure limits based on their deployment scale.

📊 Impact

Files changed: 13 files (+650 lines, -55 lines)
Test coverage: +225 lines of comprehensive test coverage
Breaking changes: None
Migration required: No (opt-in configuration)

Lang-Akshay

Thanks @marekdano for the PR.
Please implement following findings
Status: ⚠️ Partial — 3 findings addressed in code, but a startup crash from missing init_mcp_session_pool parameters prevents deployment.

Finding	Status	Notes
Finding 1 — Global cap	✅ Done	`max_total_keys` and `max_total_sessions` added to `MCPSessionPool.__init__` with enforcement in `_get_or_create_pool()` and `acquire()`. However, `init_mcp_session_pool()` does not accept or forward these params → they're silently ineffective (actually crashes at startup, see Bug #2 below).
Finding 2 — Safer defaults	✅ Done	`MAX_PER_KEY` reduced from `200` to `10` in `docker-compose.yml` with extensive blast-radius documentation.
Finding 3 — JWT identity extractor	✅ Done	`_create_jwt_identity_extractor()` implemented and wired into `lifespan`. Blocked by the same startup crash.

🔴 Critical Bugs

Bug 1 — `NameError` in `check_permission()` breaks all RBAC checks

File: permission_service.py:127, 137 | Severity: Critical | CWE: CWE-755

token_teams was removed from the check_permission() signature but is still referenced in the function body. This causes a NameError on every call, making every RBAC-protected endpoint return 403. The application fails closed, but is entirely non-functional.

Bug 2 — `TypeError` at startup crashes the application

File: mcp_session_pool.py:2115–2132 | Severity: Critical | CWE: CWE-453

init_mcp_session_pool() is missing the max_total_keys and max_total_sessions parameters. When main.py passes them at startup, a TypeError is raised and the application fails to boot entirely.

⚠️ Medium / Low Findings

Finding 3 — JWT decoded without signature verification

File: main.py:1539 | Severity: Medium | CWE: CWE-345

JWT is decoded without signature verification. This is acceptable for session pool bucketing (not a security boundary), but crafted JWTs could influence pool key selection. Add a defensive comment or restrict to JWT-authenticated requests only.

Finding 4 — Dead code: duplicate narrowed-admin guards

File: tokens.py:557–559, 654–657 | Severity: Low | CWE: CWE-561

Duplicate narrowed-admin guards — the second check is dead code. The first guard at L555/L651 already catches token_teams is not None. Test assertions will match the wrong error message.

Finding 5 — `RuntimeError` masked as `asyncio.TimeoutError`

File: mcp_session_pool.py:840 | Severity: Low | CWE: CWE-390

RuntimeError is converted to asyncio.TimeoutError, masking capacity exhaustion as a timeout in monitoring.

Invariant Violations

Two-layer model broken: Layer 2 (check_permission) is entirely non-functional due to the NameError (Bug 1). Even after a fix, removing token_teams from Layer 2 means public-only admin tokens get an admin bypass — this contradicts the security comment at permission_service.py:124–126. Either fully remove the dead token_teams logic from the body, or re-add the parameter.
Deny-path gap: No integration test calls check_permission() without mocking, so the NameError is hidden. The existing test mocks the service, concealing this bug entirely.

Redundant Code to Remove

#	File	Line(s)	Type	Description	Suggestion
1	`tokens.py`	557–560	Dead code	Second narrowed-admin guard after first guard already catches same condition	Remove lines 557–560
2	`tokens.py`	654–657	Dead code	Same duplicate guard in `admin_revoke_token`	Remove lines 654–657
3	`permission_service.py`	124–137	Unreachable / broken	`token_teams` logic references undefined variable; if fixed by removing, the whole block is dead code since no caller passes `token_teams` anymore	Either remove the block or re-add parameter forwarding
4	`test_main_extended.py`	228–229	Broken structure	`TestConditionalPaths` emptied to just `pass`; its methods are now accidentally inside `TestJwtIdentityExtractor`	Restore `TestConditionalPaths` body properly

marekdano · 2026-03-31T17:28:55Z

Thanks for your review, @Lang-Akshay!

Fixes:

Bug 1: token_teams restored to check_permission() signature and body — NameError resolved
Bug 2: max_total_keys / max_total_sessions added to init_mcp_session_pool() and forwarded in main.py — startup crash resolved
Finding 3: Defensive SECURITY NOTE comment added around the verify_signature=False decode in main.py:1535
Finding 4: Dead duplicate guards removed from tokens.py — both list_all_tokens and admin_revoke_token now use clean two-step checks
Finding 5: Explanatory comment added at mcp_session_pool.py:839 for the intentional RuntimeError → TimeoutError conversion
TestConditionalPaths: Restored the 5 original test methods (test_import_uses_rust_mcp_proxy_when_enabled, test_import_keeps_python_transport_, test_import_warns_, test_redis_initialization_path, test_event_loop_task_creation) that had been replaced with pass

Lang-Akshay

Thanks @marekdano for the contribution! Please implement the following two changes:

1. Enable JWT identity extraction (`MCP_SESSION_POOL_JWT_IDENTITY_EXTRACTION=true`) Default. This is the most impactful change.

2. Incorrect use of built-in `callable` instead of `Callable` from `typing`

Location

main.py, line 1519

Current Code

def _create_jwt_identity_extractor() -> callable:

Problem

Uses the lowercase built-in callable instead of the proper Callable type from the typing module.

Correct Fix

Add Callable to the imports from typing on line ~42:

    from typing import Callable

Update line 1519 to:

    def _create_jwt_identity_extractor() -> Callable[[dict], Optional[str]]:

Signed-off-by: Marek Dano <Marek.Dano@ibm.com>

…allable from typing Signed-off-by: Marek Dano <Marek.Dano@ibm.com>

crivetimihai

Thanks @marekdano for a solid contribution — the blast-radius analysis in the PR description is excellent and the multi-layered mitigation approach is well-thought-out. I've rebased this onto main and made a few fixes on top (commit 11488dc65). Here's what I changed and why:

Changes made during review

1. Restored `token_teams` forwarding in `PermissionChecker.has_permission()` (`rbac.py`)

The PR removed token_teams from has_permission() but left it in has_any_permission() and the @require_permission decorator. This created an inconsistency: a public-only admin token (token_teams=[]) going through the RPC path (main.py:1013 via PermissionChecker.has_permission) would bypass the admin bypass suppression in check_permission(), since token_teams would default to None (unrestricted).

The @require_permission decorator still passes token_teams at lines 749/764, so the two-layer model was already being enforced for HTTP endpoints — just not for the PermissionChecker manual path. Restored the parameter to keep all paths consistent.

Also restored the corresponding tests in test_rbac.py and test_permission_service.py that verify public-only tokens suppress admin bypass.

2. Guarded `release()` against `RuntimeError` from `_get_or_create_pool()` (`mcp_session_pool.py`)

release() calls _get_or_create_pool() in two places (lines 1034, 1065). If max_total_keys is configured and the pool key was evicted while a session was checked out, _get_or_create_pool raises RuntimeError — which would leak the session and semaphore slot. Added try/except RuntimeError to gracefully discard the session and close it instead. Added test coverage for this path.

3. Fixed Bearer prefix stripping (`main.py`)

str.replace("Bearer ", "") replaces all occurrences, not just the prefix. Switched to startswith() + slice, which is the standard approach and handles edge cases correctly.

4. Fixed mixed types in metrics (`mcp_session_pool.py`)

max_total_keys and max_total_sessions in get_metrics() returned int or "unlimited" string depending on the value. Consumers expecting a consistent type (JSON schema validation, Prometheus exporters) would break. Changed to always return int (0 = unlimited), consistent with the config semantics.

5. Removed orphaned test methods from `TestJwtIdentityExtractor` (`test_main_extended.py`)

Five methods from TestConditionalPaths (which exists on main) were accidentally duplicated inside TestJwtIdentityExtractor — an orphaned docstring """Test conditional code paths...""" at line 451 without a class statement caused the methods to become part of the wrong class. The test_client/auth_headers fixtures wouldn't resolve correctly under TestJwtIdentityExtractor. Removed the duplicates; the originals in TestConditionalPaths are untouched.

6. Added missing assertion in `test_jwt_identity_extractor_exception_falls_back`

The test computed identity_hash but had no assert — it passed vacuously. Added assert identity_hash != "anonymous" and await pool.close_all() for cleanup.

Everything else looks good

Token router security split (separate is_admin + token_teams checks) — correct and improves error messages
JWT decode without verification — acceptable for pool bucketing with good security comment
Thread safety / lock analysis — correct asyncio.Lock usage, no deadlock risk
identity_extractor integration — proper fallback chain in _compute_identity_hash
Global caps with soft-cap TOCTOU documented accurately
init_mcp_session_pool keyword-only * — good API hygiene
Test coverage is comprehensive across all new code paths

All 1457 affected tests passing.

- Restore token_teams forwarding in PermissionChecker.has_permission() to prevent privilege escalation for public-only admin tokens via RPC path (consistent with has_any_permission and @require_permission) - Guard release() against RuntimeError from _get_or_create_pool() when max_total_keys limit is hit for a previously-evicted pool key - Fix Bearer prefix stripping to use startswith() instead of replace() - Use consistent int type for max_total_keys/max_total_sessions metrics - Remove orphaned TestConditionalPaths methods from TestJwtIdentityExtractor - Add missing assertion in test_jwt_identity_extractor_exception_falls_back - Add test for release() graceful handling under max_total_keys limit - Restore test coverage for public-only token admin bypass suppression Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: session pool resource exhaustion Signed-off-by: Marek Dano <Marek.Dano@ibm.com> * fix: lint issue Signed-off-by: Marek Dano <Marek.Dano@ibm.com> * fix: reported bugs by reviewer Signed-off-by: Marek Dano <Marek.Dano@ibm.com> * fix: lint and test coverage issues Signed-off-by: Marek Dano <Marek.Dano@ibm.com> * fix: lint issue Signed-off-by: Marek Dano <Marek.Dano@ibm.com> * fix: enable JWT identity extraction by default, replace callable by Callable from typing Signed-off-by: Marek Dano <Marek.Dano@ibm.com> * fix: address review findings for session pool resource exhaustion - Restore token_teams forwarding in PermissionChecker.has_permission() to prevent privilege escalation for public-only admin tokens via RPC path (consistent with has_any_permission and @require_permission) - Guard release() against RuntimeError from _get_or_create_pool() when max_total_keys limit is hit for a previously-evicted pool key - Fix Bearer prefix stripping to use startswith() instead of replace() - Use consistent int type for max_total_keys/max_total_sessions metrics - Remove orphaned TestConditionalPaths methods from TestJwtIdentityExtractor - Add missing assertion in test_jwt_identity_extractor_exception_falls_back - Add test for release() graceful handling under max_total_keys limit - Restore test coverage for public-only token admin bypass suppression Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * chore: update .secrets.baseline Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Marek Dano <Marek.Dano@ibm.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Marek Dano <Marek.Dano@ibm.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>

marekdano requested review from crivetimihai, kevalmahajan and madhav165 as code owners March 31, 2026 15:15

marekdano requested a review from Lang-Akshay March 31, 2026 15:16

Lang-Akshay requested changes Mar 31, 2026

View reviewed changes

marekdano requested a review from Lang-Akshay March 31, 2026 17:52

marekdano added this to the Release 1.0.0-RC3 milestone Apr 1, 2026

marekdano force-pushed the 3777-fix-session-pool-resource-exhaustion branch from 80d7953 to 22d931c Compare April 1, 2026 14:38

Lang-Akshay requested changes Apr 2, 2026

View reviewed changes

marekdano force-pushed the 3777-fix-session-pool-resource-exhaustion branch from 11096f9 to 20e4c4f Compare April 2, 2026 13:50

marekdano requested a review from Lang-Akshay April 2, 2026 14:44

crivetimihai self-assigned this Apr 6, 2026

Marek Dano added 6 commits April 6, 2026 11:18

fix: session pool resource exhaustion

a059c43

Signed-off-by: Marek Dano <Marek.Dano@ibm.com>

fix: lint issue

ba6292b

Signed-off-by: Marek Dano <Marek.Dano@ibm.com>

fix: reported bugs by reviewer

e582d08

Signed-off-by: Marek Dano <Marek.Dano@ibm.com>

fix: lint and test coverage issues

05bd749

Signed-off-by: Marek Dano <Marek.Dano@ibm.com>

fix: lint issue

47efa5d

Signed-off-by: Marek Dano <Marek.Dano@ibm.com>

fix: enable JWT identity extraction by default, replace callable by C…

cb4886b

…allable from typing Signed-off-by: Marek Dano <Marek.Dano@ibm.com>

crivetimihai force-pushed the 3777-fix-session-pool-resource-exhaustion branch from 20e4c4f to 93f237e Compare April 6, 2026 12:21

crivetimihai previously approved these changes Apr 6, 2026

View reviewed changes

crivetimihai dismissed their stale review via 3c0864e April 6, 2026 12:44

crivetimihai force-pushed the 3777-fix-session-pool-resource-exhaustion branch from 93f237e to 3c0864e Compare April 6, 2026 12:44

chore: update .secrets.baseline

57914fa

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

crivetimihai merged commit 0332ef2 into main Apr 6, 2026
27 checks passed

crivetimihai deleted the 3777-fix-session-pool-resource-exhaustion branch April 6, 2026 13:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: session pool resource exhaustion#3952

fix: session pool resource exhaustion#3952
crivetimihai merged 8 commits intomainfrom
3777-fix-session-pool-resource-exhaustion

marekdano commented Mar 31, 2026

Uh oh!

Lang-Akshay left a comment

Uh oh!

marekdano commented Mar 31, 2026

Uh oh!

Lang-Akshay left a comment •

edited

Loading

Uh oh!

crivetimihai left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

marekdano commented Mar 31, 2026

🐛 Bug-fix PR

📝 Summary

🏷️ Type of Change

🔧 Changes

Core Session Pool (mcpgateway/services/mcp_session_pool.py)

Configuration (mcpgateway/config.py)

JWT Identity Extraction (mcpgateway/main.py)

Security Hardening (mcpgateway/routers/tokens.py)

RBAC Layer Separation

Docker Compose (docker-compose.yml)

Test Coverage (tests/unit/mcpgateway/services/test_mcp_session_pool_coverage.py)

🧪 Verification

✅ Checklist

📓 Mitigation Strategies

🔄 Backward Compatibility

📊 Impact

Uh oh!

Lang-Akshay left a comment

Choose a reason for hiding this comment

🔴 Critical Bugs

Bug 1 — NameError in check_permission() breaks all RBAC checks

Bug 2 — TypeError at startup crashes the application

⚠️ Medium / Low Findings

Finding 3 — JWT decoded without signature verification

Finding 4 — Dead code: duplicate narrowed-admin guards

Finding 5 — RuntimeError masked as asyncio.TimeoutError

Invariant Violations

Redundant Code to Remove

Uh oh!

marekdano commented Mar 31, 2026

Uh oh!

Lang-Akshay left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

1. Enable JWT identity extraction (MCP_SESSION_POOL_JWT_IDENTITY_EXTRACTION=true) Default. This is the most impactful change.

2. Incorrect use of built-in callable instead of Callable from typing

Location

Current Code

Problem

Correct Fix

Uh oh!

crivetimihai left a comment

Choose a reason for hiding this comment

Changes made during review

1. Restored token_teams forwarding in PermissionChecker.has_permission() (rbac.py)

2. Guarded release() against RuntimeError from _get_or_create_pool() (mcp_session_pool.py)

3. Fixed Bearer prefix stripping (main.py)

4. Fixed mixed types in metrics (mcp_session_pool.py)

5. Removed orphaned test methods from TestJwtIdentityExtractor (test_main_extended.py)

6. Added missing assertion in test_jwt_identity_extractor_exception_falls_back

Everything else looks good

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Core Session Pool (`mcpgateway/services/mcp_session_pool.py`)

Configuration (`mcpgateway/config.py`)

JWT Identity Extraction (`mcpgateway/main.py`)

Security Hardening (`mcpgateway/routers/tokens.py`)

Docker Compose (`docker-compose.yml`)

Test Coverage (`tests/unit/mcpgateway/services/test_mcp_session_pool_coverage.py`)

Bug 1 — `NameError` in `check_permission()` breaks all RBAC checks

Bug 2 — `TypeError` at startup crashes the application

Finding 5 — `RuntimeError` masked as `asyncio.TimeoutError`

Lang-Akshay left a comment •

edited

Loading

1. Enable JWT identity extraction (`MCP_SESSION_POOL_JWT_IDENTITY_EXTRACTION=true`) Default. This is the most impactful change.

2. Incorrect use of built-in `callable` instead of `Callable` from `typing`

1. Restored `token_teams` forwarding in `PermissionChecker.has_permission()` (`rbac.py`)

2. Guarded `release()` against `RuntimeError` from `_get_or_create_pool()` (`mcp_session_pool.py`)

3. Fixed Bearer prefix stripping (`main.py`)

4. Fixed mixed types in metrics (`mcp_session_pool.py`)

5. Removed orphaned test methods from `TestJwtIdentityExtractor` (`test_main_extended.py`)

6. Added missing assertion in `test_jwt_identity_extractor_exception_falls_back`