Skip to content

fix: session pool resource exhaustion#3952

Merged
crivetimihai merged 8 commits intomainfrom
3777-fix-session-pool-resource-exhaustion
Apr 6, 2026
Merged

fix: session pool resource exhaustion#3952
crivetimihai merged 8 commits intomainfrom
3777-fix-session-pool-resource-exhaustion

Conversation

@marekdano
Copy link
Copy Markdown
Collaborator

🐛 Bug-fix PR

Closes #3777


📝 Summary

This PR fixes session pool resource exhaustion by implementing global session caps and JWT identity extraction to prevent bucket explosion from rotating tokens.

Problem: The session pool could create unbounded connections due to exponential bucket growth:

max_connections = num_users × num_servers × num_transports × num_token_rotations × MAX_PER_KEY

Example blast radius: 10 users × 3 servers × 2 transports × 32 token rotations × 200 sessions = 384,000 open connections

Solution: Multi-layered mitigation:

  1. Global caps: MCP_SESSION_POOL_MAX_TOTAL_KEYS and MCP_SESSION_POOL_MAX_TOTAL_SESSIONS to hard-limit resource usage
  2. JWT identity extraction: MCP_SESSION_POOL_JWT_IDENTITY_EXTRACTION extracts stable user ID (sub/email/user_id) to eliminate token rotation factor
  3. Safer defaults: Reduced MAX_PER_KEY from 200 to 10
  4. Security hardening: Token oversight endpoints now require un-narrowed admin access

🏷️ Type of Change

  • Bug fix
  • Feature / Enhancement
  • Documentation
  • Refactor
  • Chore (deps, CI, tooling)
  • Other (describe below)

🔧 Changes

Core Session Pool (mcpgateway/services/mcp_session_pool.py)

  • ✅ Enforce max_total_keys limit in _get_or_create_pool() - raises RuntimeError when exceeded
  • ✅ Enforce max_total_sessions limit in acquire() - raises asyncio.TimeoutError when exceeded
  • ✅ Emit warning at 80% capacity threshold
  • ✅ Convert pool key limit errors to user-friendly timeout errors
  • ✅ Updated stats to include total_active_sessions, max_total_keys, max_total_sessions

Configuration (mcpgateway/config.py)

  • MCP_SESSION_POOL_MAX_TOTAL_KEYS: Global cap on pool keys (default 0 = unlimited)
  • MCP_SESSION_POOL_MAX_TOTAL_SESSIONS: Global cap on active sessions (default 0 = unlimited, soft cap)
  • MCP_SESSION_POOL_JWT_IDENTITY_EXTRACTION: Extract stable user ID from JWT to prevent bucket explosion

JWT Identity Extraction (mcpgateway/main.py)

  • ✅ New _create_jwt_identity_extractor() function decodes JWT without signature verification
  • ✅ Extracts stable user ID from sub, email, or user_id claims
  • ✅ Prevents bucket explosion when using short-lived JWTs with rotating jti/exp/iat
  • ✅ Integrated into session pool initialization when enabled

Security Hardening (mcpgateway/routers/tokens.py)

  • ✅ Token oversight endpoints (list_all_tokens, admin_revoke_token) now require un-narrowed admin access
  • ✅ Blocks narrowed/public-only admin sessions (token_teams != None) from token management
  • ✅ Prevents API tokens with team-scoped admin privileges from enumerating/managing tokens outside their scope

RBAC Layer Separation

  • ✅ Removed token_teams parameter from RBAC permission checks (mcpgateway/middleware/rbac.py, mcpgateway/services/permission_service.py)
  • ✅ Clarified that token-level narrowing (Layer 1) is enforced separately via data filtering
  • ✅ High-risk endpoints enforce their own narrowing guards where required

Docker Compose (docker-compose.yml)

  • ✅ Reduced default MCP_SESSION_POOL_MAX_PER_KEY from 200 to 10 (safer default)
  • ✅ Added extensive documentation about blast radius calculation
  • ✅ Documented mitigation strategies with recommended starting values

Test Coverage (tests/unit/mcpgateway/services/test_mcp_session_pool_coverage.py)

  • TestMaxTotalKeysLimit: Tests RuntimeError enforcement and conversion to TimeoutError
  • TestMaxTotalSessionsLimit: Tests soft cap enforcement and session reuse after release
  • TestJwtIdentityExtractorNone: Tests fallback behavior when JWT extraction fails
  • ✅ Tests for 80% warning threshold emission
  • ✅ +225 lines of new test coverage

🧪 Verification

Check Command Status
Lint suite make lint
Unit tests make test
Coverage ≥ 80% make coverage

✅ Checklist

  • Code formatted (make black isort pre-commit)
  • Tests added/updated for changes
  • Documentation updated (docker-compose.yml comments)
  • No secrets or credentials committed
  • Signed commits (DCO)

📓 Mitigation Strategies

Operators should configure based on expected load:

Strategy 1: JWT Identity Extraction (Recommended)

MCP_SESSION_POOL_JWT_IDENTITY_EXTRACTION=true
MCP_SESSION_POOL_MAX_PER_KEY=20

Eliminates token rotation factor, allows higher per-key limits.

Strategy 2: Global Caps

MCP_SESSION_POOL_MAX_TOTAL_KEYS=1000
MCP_SESSION_POOL_MAX_TOTAL_SESSIONS=5000
MCP_SESSION_POOL_MAX_PER_KEY=10

Hard limits prevent unbounded growth.

Strategy 3: Conservative Defaults (Current)

MCP_SESSION_POOL_MAX_PER_KEY=10
# Global caps disabled (0 = unlimited)

Safe for small deployments, requires manual tuning for scale.


🔄 Backward Compatibility

Fully backward compatible - all new limits default to 0 (unlimited) to preserve existing behavior. Operators must explicitly configure limits based on their deployment scale.


📊 Impact

  • Files changed: 13 files (+650 lines, -55 lines)
  • Test coverage: +225 lines of comprehensive test coverage
  • Breaking changes: None
  • Migration required: No (opt-in configuration)

Copy link
Copy Markdown
Collaborator

@Lang-Akshay Lang-Akshay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @marekdano for the PR.
Please implement following findings
Status: ⚠️ Partial — 3 findings addressed in code, but a startup crash from missing init_mcp_session_pool parameters prevents deployment.

Finding Status Notes
Finding 1 — Global cap ✅ Done max_total_keys and max_total_sessions added to MCPSessionPool.__init__ with enforcement in _get_or_create_pool() and acquire(). However, init_mcp_session_pool() does not accept or forward these params → they're silently ineffective (actually crashes at startup, see Bug #2 below).
Finding 2 — Safer defaults ✅ Done MAX_PER_KEY reduced from 200 to 10 in docker-compose.yml with extensive blast-radius documentation.
Finding 3 — JWT identity extractor ✅ Done _create_jwt_identity_extractor() implemented and wired into lifespan. Blocked by the same startup crash.

🔴 Critical Bugs

Bug 1 — NameError in check_permission() breaks all RBAC checks

File: permission_service.py:127, 137 | Severity: Critical | CWE: CWE-755

token_teams was removed from the check_permission() signature but is still referenced in the function body. This causes a NameError on every call, making every RBAC-protected endpoint return 403. The application fails closed, but is entirely non-functional.


Bug 2 — TypeError at startup crashes the application

File: mcp_session_pool.py:2115–2132 | Severity: Critical | CWE: CWE-453

init_mcp_session_pool() is missing the max_total_keys and max_total_sessions parameters. When main.py passes them at startup, a TypeError is raised and the application fails to boot entirely.


⚠️ Medium / Low Findings

Finding 3 — JWT decoded without signature verification

File: main.py:1539 | Severity: Medium | CWE: CWE-345

JWT is decoded without signature verification. This is acceptable for session pool bucketing (not a security boundary), but crafted JWTs could influence pool key selection. Add a defensive comment or restrict to JWT-authenticated requests only.


Finding 4 — Dead code: duplicate narrowed-admin guards

File: tokens.py:557–559, 654–657 | Severity: Low | CWE: CWE-561

Duplicate narrowed-admin guards — the second check is dead code. The first guard at L555/L651 already catches token_teams is not None. Test assertions will match the wrong error message.


Finding 5 — RuntimeError masked as asyncio.TimeoutError

File: mcp_session_pool.py:840 | Severity: Low | CWE: CWE-390

RuntimeError is converted to asyncio.TimeoutError, masking capacity exhaustion as a timeout in monitoring.


Invariant Violations

  • Two-layer model broken: Layer 2 (check_permission) is entirely non-functional due to the NameError (Bug 1). Even after a fix, removing token_teams from Layer 2 means public-only admin tokens get an admin bypass — this contradicts the security comment at permission_service.py:124–126. Either fully remove the dead token_teams logic from the body, or re-add the parameter.

  • Deny-path gap: No integration test calls check_permission() without mocking, so the NameError is hidden. The existing test mocks the service, concealing this bug entirely.


Redundant Code to Remove

# File Line(s) Type Description Suggestion
1 tokens.py 557–560 Dead code Second narrowed-admin guard after first guard already catches same condition Remove lines 557–560
2 tokens.py 654–657 Dead code Same duplicate guard in admin_revoke_token Remove lines 654–657
3 permission_service.py 124–137 Unreachable / broken token_teams logic references undefined variable; if fixed by removing, the whole block is dead code since no caller passes token_teams anymore Either remove the block or re-add parameter forwarding
4 test_main_extended.py 228–229 Broken structure TestConditionalPaths emptied to just pass; its methods are now accidentally inside TestJwtIdentityExtractor Restore TestConditionalPaths body properly

@marekdano
Copy link
Copy Markdown
Collaborator Author

Thanks for your review, @Lang-Akshay!

Fixes:

  • Bug 1: token_teams restored to check_permission() signature and body — NameError resolved
  • Bug 2: max_total_keys / max_total_sessions added to init_mcp_session_pool() and forwarded in main.py — startup crash resolved
  • Finding 3: Defensive SECURITY NOTE comment added around the verify_signature=False decode in main.py:1535
  • Finding 4: Dead duplicate guards removed from tokens.py — both list_all_tokens and admin_revoke_token now use clean two-step checks
  • Finding 5: Explanatory comment added at mcp_session_pool.py:839 for the intentional RuntimeErrorTimeoutError conversion
  • TestConditionalPaths: Restored the 5 original test methods (test_import_uses_rust_mcp_proxy_when_enabled, test_import_keeps_python_transport_, test_import_warns_, test_redis_initialization_path, test_event_loop_task_creation) that had been replaced with pass

@marekdano marekdano requested a review from Lang-Akshay March 31, 2026 17:52
@marekdano marekdano added this to the Release 1.0.0-RC3 milestone Apr 1, 2026
@marekdano marekdano force-pushed the 3777-fix-session-pool-resource-exhaustion branch from 80d7953 to 22d931c Compare April 1, 2026 14:38
Copy link
Copy Markdown
Collaborator

@Lang-Akshay Lang-Akshay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @marekdano for the contribution! Please implement the following two changes:


1. Enable JWT identity extraction (MCP_SESSION_POOL_JWT_IDENTITY_EXTRACTION=true) Default. This is the most impactful change.


2. Incorrect use of built-in callable instead of Callable from typing

Location

main.py, line 1519

Current Code

def _create_jwt_identity_extractor() -> callable:

Problem

Uses the lowercase built-in callable instead of the proper Callable type from the typing module.

Correct Fix

  1. Add Callable to the imports from typing on line ~42:
    from typing import Callable
  1. Update line 1519 to:
    def _create_jwt_identity_extractor() -> Callable[[dict], Optional[str]]:

@marekdano marekdano force-pushed the 3777-fix-session-pool-resource-exhaustion branch from 11096f9 to 20e4c4f Compare April 2, 2026 13:50
@marekdano marekdano requested a review from Lang-Akshay April 2, 2026 14:44
@crivetimihai crivetimihai self-assigned this Apr 6, 2026
Marek Dano added 6 commits April 6, 2026 11:18
Signed-off-by: Marek Dano <Marek.Dano@ibm.com>
Signed-off-by: Marek Dano <Marek.Dano@ibm.com>
Signed-off-by: Marek Dano <Marek.Dano@ibm.com>
Signed-off-by: Marek Dano <Marek.Dano@ibm.com>
Signed-off-by: Marek Dano <Marek.Dano@ibm.com>
…allable from typing

Signed-off-by: Marek Dano <Marek.Dano@ibm.com>
@crivetimihai crivetimihai force-pushed the 3777-fix-session-pool-resource-exhaustion branch from 20e4c4f to 93f237e Compare April 6, 2026 12:21
crivetimihai
crivetimihai previously approved these changes Apr 6, 2026
Copy link
Copy Markdown
Member

@crivetimihai crivetimihai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @marekdano for a solid contribution — the blast-radius analysis in the PR description is excellent and the multi-layered mitigation approach is well-thought-out. I've rebased this onto main and made a few fixes on top (commit 11488dc65). Here's what I changed and why:

Changes made during review

1. Restored token_teams forwarding in PermissionChecker.has_permission() (rbac.py)

The PR removed token_teams from has_permission() but left it in has_any_permission() and the @require_permission decorator. This created an inconsistency: a public-only admin token (token_teams=[]) going through the RPC path (main.py:1013 via PermissionChecker.has_permission) would bypass the admin bypass suppression in check_permission(), since token_teams would default to None (unrestricted).

The @require_permission decorator still passes token_teams at lines 749/764, so the two-layer model was already being enforced for HTTP endpoints — just not for the PermissionChecker manual path. Restored the parameter to keep all paths consistent.

Also restored the corresponding tests in test_rbac.py and test_permission_service.py that verify public-only tokens suppress admin bypass.

2. Guarded release() against RuntimeError from _get_or_create_pool() (mcp_session_pool.py)

release() calls _get_or_create_pool() in two places (lines 1034, 1065). If max_total_keys is configured and the pool key was evicted while a session was checked out, _get_or_create_pool raises RuntimeError — which would leak the session and semaphore slot. Added try/except RuntimeError to gracefully discard the session and close it instead. Added test coverage for this path.

3. Fixed Bearer prefix stripping (main.py)

str.replace("Bearer ", "") replaces all occurrences, not just the prefix. Switched to startswith() + slice, which is the standard approach and handles edge cases correctly.

4. Fixed mixed types in metrics (mcp_session_pool.py)

max_total_keys and max_total_sessions in get_metrics() returned int or "unlimited" string depending on the value. Consumers expecting a consistent type (JSON schema validation, Prometheus exporters) would break. Changed to always return int (0 = unlimited), consistent with the config semantics.

5. Removed orphaned test methods from TestJwtIdentityExtractor (test_main_extended.py)

Five methods from TestConditionalPaths (which exists on main) were accidentally duplicated inside TestJwtIdentityExtractor — an orphaned docstring """Test conditional code paths...""" at line 451 without a class statement caused the methods to become part of the wrong class. The test_client/auth_headers fixtures wouldn't resolve correctly under TestJwtIdentityExtractor. Removed the duplicates; the originals in TestConditionalPaths are untouched.

6. Added missing assertion in test_jwt_identity_extractor_exception_falls_back

The test computed identity_hash but had no assert — it passed vacuously. Added assert identity_hash != "anonymous" and await pool.close_all() for cleanup.

Everything else looks good

  • Token router security split (separate is_admin + token_teams checks) — correct and improves error messages
  • JWT decode without verification — acceptable for pool bucketing with good security comment
  • Thread safety / lock analysis — correct asyncio.Lock usage, no deadlock risk
  • identity_extractor integration — proper fallback chain in _compute_identity_hash
  • Global caps with soft-cap TOCTOU documented accurately
  • init_mcp_session_pool keyword-only * — good API hygiene
  • Test coverage is comprehensive across all new code paths

All 1457 affected tests passing.

- Restore token_teams forwarding in PermissionChecker.has_permission()
  to prevent privilege escalation for public-only admin tokens via RPC
  path (consistent with has_any_permission and @require_permission)
- Guard release() against RuntimeError from _get_or_create_pool() when
  max_total_keys limit is hit for a previously-evicted pool key
- Fix Bearer prefix stripping to use startswith() instead of replace()
- Use consistent int type for max_total_keys/max_total_sessions metrics
- Remove orphaned TestConditionalPaths methods from TestJwtIdentityExtractor
- Add missing assertion in test_jwt_identity_extractor_exception_falls_back
- Add test for release() graceful handling under max_total_keys limit
- Restore test coverage for public-only token admin bypass suppression

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
@crivetimihai crivetimihai force-pushed the 3777-fix-session-pool-resource-exhaustion branch from 93f237e to 3c0864e Compare April 6, 2026 12:44
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
@crivetimihai crivetimihai merged commit 0332ef2 into main Apr 6, 2026
27 checks passed
@crivetimihai crivetimihai deleted the 3777-fix-session-pool-resource-exhaustion branch April 6, 2026 13:11
jonpspri pushed a commit that referenced this pull request Apr 10, 2026
* fix: session pool resource exhaustion

Signed-off-by: Marek Dano <Marek.Dano@ibm.com>

* fix: lint issue

Signed-off-by: Marek Dano <Marek.Dano@ibm.com>

* fix: reported bugs by reviewer

Signed-off-by: Marek Dano <Marek.Dano@ibm.com>

* fix: lint and test coverage issues

Signed-off-by: Marek Dano <Marek.Dano@ibm.com>

* fix: lint issue

Signed-off-by: Marek Dano <Marek.Dano@ibm.com>

* fix: enable JWT identity extraction by default, replace callable by Callable from typing

Signed-off-by: Marek Dano <Marek.Dano@ibm.com>

* fix: address review findings for session pool resource exhaustion

- Restore token_teams forwarding in PermissionChecker.has_permission()
  to prevent privilege escalation for public-only admin tokens via RPC
  path (consistent with has_any_permission and @require_permission)
- Guard release() against RuntimeError from _get_or_create_pool() when
  max_total_keys limit is hit for a previously-evicted pool key
- Fix Bearer prefix stripping to use startswith() instead of replace()
- Use consistent int type for max_total_keys/max_total_sessions metrics
- Remove orphaned TestConditionalPaths methods from TestJwtIdentityExtractor
- Add missing assertion in test_jwt_identity_extractor_exception_falls_back
- Add test for release() graceful handling under max_total_keys limit
- Restore test coverage for public-only token admin bypass suppression

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* chore: update .secrets.baseline

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Marek Dano <Marek.Dano@ibm.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Marek Dano <Marek.Dano@ibm.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
claudia-gray pushed a commit that referenced this pull request Apr 13, 2026
* fix: session pool resource exhaustion

Signed-off-by: Marek Dano <Marek.Dano@ibm.com>

* fix: lint issue

Signed-off-by: Marek Dano <Marek.Dano@ibm.com>

* fix: reported bugs by reviewer

Signed-off-by: Marek Dano <Marek.Dano@ibm.com>

* fix: lint and test coverage issues

Signed-off-by: Marek Dano <Marek.Dano@ibm.com>

* fix: lint issue

Signed-off-by: Marek Dano <Marek.Dano@ibm.com>

* fix: enable JWT identity extraction by default, replace callable by Callable from typing

Signed-off-by: Marek Dano <Marek.Dano@ibm.com>

* fix: address review findings for session pool resource exhaustion

- Restore token_teams forwarding in PermissionChecker.has_permission()
  to prevent privilege escalation for public-only admin tokens via RPC
  path (consistent with has_any_permission and @require_permission)
- Guard release() against RuntimeError from _get_or_create_pool() when
  max_total_keys limit is hit for a previously-evicted pool key
- Fix Bearer prefix stripping to use startswith() instead of replace()
- Use consistent int type for max_total_keys/max_total_sessions metrics
- Remove orphaned TestConditionalPaths methods from TestJwtIdentityExtractor
- Add missing assertion in test_jwt_identity_extractor_exception_falls_back
- Add test for release() graceful handling under max_total_keys limit
- Restore test coverage for public-only token admin bypass suppression

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* chore: update .secrets.baseline

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Marek Dano <Marek.Dano@ibm.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Marek Dano <Marek.Dano@ibm.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG][PERFORMANCE]: Session pool resource exhaustion — no global cap, high per-bucket limit, and rotating JWT identity explosion

3 participants