Skip to content

[FEATURE]: Export MCP session pool metrics to Prometheus #2118

@crivetimihai

Description

@crivetimihai

Summary

The MCP session pool tracks internal metrics (_hits, _misses, _evictions, _health_check_failures) but does not expose them to Prometheus for monitoring.

Current State

Location: mcpgateway/services/mcp_session_pool.py

The session pool maintains these metrics internally:

class MCPSessionPool:
    def __init__(self):
        # Internal counters (not exposed)
        self._hits = 0
        self._misses = 0
        self._evictions = 0
        self._health_check_failures = 0
        self._created_sessions = 0
        self._active_sessions = 0

These are valuable for understanding:

  • Session reuse efficiency (hit rate)
  • Pool pressure (miss rate, evictions)
  • Upstream MCP server health (health check failures)

Proposed Solution

Add Prometheus metrics for session pool observability:

from prometheus_client import Counter, Gauge

# Counters
mcp_session_pool_hits = Counter(
    "mcp_session_pool_hits_total",
    "Number of session pool cache hits",
    ["gateway_id"]
)
mcp_session_pool_misses = Counter(
    "mcp_session_pool_misses_total",
    "Number of session pool cache misses",
    ["gateway_id"]
)
mcp_session_pool_evictions = Counter(
    "mcp_session_pool_evictions_total",
    "Number of sessions evicted from pool",
    ["gateway_id", "reason"]  # reason: ttl_expired, circuit_breaker, manual
)
mcp_session_pool_health_check_failures = Counter(
    "mcp_session_pool_health_check_failures_total",
    "Number of session health check failures",
    ["gateway_id"]
)

# Gauges
mcp_session_pool_active_sessions = Gauge(
    "mcp_session_pool_active_sessions",
    "Number of currently active sessions in pool",
    ["gateway_id"]
)
mcp_session_pool_size = Gauge(
    "mcp_session_pool_size",
    "Total number of sessions in pool",
    ["gateway_id"]
)

Usage Example

async def get_session(self, gateway_id: str, ...):
    if cached := self._get_from_pool(key):
        mcp_session_pool_hits.labels(gateway_id=gateway_id).inc()
        return cached
    
    mcp_session_pool_misses.labels(gateway_id=gateway_id).inc()
    session = await self._create_session(...)
    return session

Grafana Dashboard Queries

# Session pool hit rate
sum(rate(mcp_session_pool_hits_total[5m])) / 
(sum(rate(mcp_session_pool_hits_total[5m])) + sum(rate(mcp_session_pool_misses_total[5m])))

# Health check failure rate
sum(rate(mcp_session_pool_health_check_failures_total[5m])) by (gateway_id)

# Active sessions by gateway
mcp_session_pool_active_sessions

Acceptance Criteria

  • Session pool metrics exposed at /metrics endpoint
  • Metrics labeled by gateway_id for per-gateway visibility
  • Hit/miss/eviction counters increment correctly
  • Active session gauge reflects actual pool state
  • Documentation for new metrics

Related


From: todo/performance-review.md - MCP session pool analysis

Metadata

Metadata

Assignees

Labels

SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releaseenhancementNew feature or requestobservabilityObservability, logging, monitoringperformancePerformance related itemspythonPython / backend development (FastAPI)
No fields configured for Feature.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions