Skip to content

[FEATURE][UI]: Built-in MCP server health dashboard #547

@crivetimihai

Description

@crivetimihai

Epic: Built-in MCP Server Health Dashboard

🎯 Overview

Summary

Implement a comprehensive real-time health monitoring dashboard for MCP servers, providing visibility into server status, performance metrics, error rates, and usage patterns with configurable alerting capabilities.

Problem Statement

Currently, monitoring MCP server health requires external tools or manual log analysis with several limitations:

  • No real-time visibility into server performance
  • Difficult to identify performance degradation before failures
  • No centralized view of distributed MCP server health
  • Limited historical trend analysis capabilities
  • Manual correlation between errors and system events

Solution

Create an integrated health monitoring dashboard that:

  • Real-time server status and connection monitoring
  • Performance metrics with historical trending
  • Error rate tracking per endpoint and server
  • Configurable alert thresholds with notifications
  • Usage pattern analysis and capacity planning
  • Federated view for multiple MCP servers

Dependencies

👥 User Stories

Story 1: Real-Time Server Status Overview

As a system administrator
I want a real-time overview of all MCP servers
So that I can quickly identify unhealthy servers and take action

Acceptance Criteria:

Scenario: View server status grid
  Given I access the health dashboard
  When I view the main dashboard
  Then I see a grid showing all MCP servers with:
    - Server name and endpoint
    - Current status (Healthy/Warning/Critical/Offline)
    - Response time (last 1 min average)
    - Active connections count
    - Last health check timestamp

Scenario: Server goes offline
  Given a server was healthy
  When the server fails 3 consecutive health checks
  Then the status changes to "Offline"
  And the card turns red
  And an alert is triggered if configured

Scenario: Quick server actions
  Given I see an unhealthy server
  When I click on the server card
  Then I see detailed diagnostics
  And I can perform actions:
    - Force health check
    - Disable server temporarily
    - View recent logs

UI Mockup:

┌─────────────────────────────────────────────────────────────────────┐
│ MCP Server Health Dashboard                    Last update: 2s ago  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│ Overview: 12 Healthy | 2 Warning | 1 Critical | 0 Offline          │
│                                                                     │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐       │
│ │ ✅ Tool Server  │ │ ⚠️ Resource API │ │ ❌ Legacy Bridge │       │
│ │ tools.mcp.local │ │ res.mcp.local   │ │ bridge.mcp.local│       │
│ │                 │ │                 │ │                 │       │
│ │ Response: 45ms  │ │ Response: 850ms │ │ Response: ---   │       │
│ │ Connections: 23 │ │ Connections: 5  │ │ Connections: 0  │       │
│ │ Uptime: 15d 3h  │ │ Uptime: 2d 14h  │ │ Uptime: ---     │       │
│ │                 │ │                 │ │                 │       │
│ │ [View Details]  │ │ [View Details]  │ │ [View Details]  │       │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘       │
│                                                                     │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐       │
│ │ ✅ Prompt Store │ │ ✅ Auth Service │ │ ✅ Vector DB    │       │
│ │ prompts.local  │ │ auth.mcp.local  │ │ vectors.local   │       │
│ │                 │ │                 │ │                 │       │
│ │ Response: 32ms  │ │ Response: 28ms  │ │ Response: 156ms │       │
│ │ Connections: 8  │ │ Connections: 45 │ │ Connections: 12 │       │
│ │ Uptime: 45d 2h  │ │ Uptime: 45d 2h  │ │ Uptime: 8d 19h  │       │
│ │                 │ │                 │ │                 │       │
│ │ [View Details]  │ │ [View Details]  │ │ [View Details]  │       │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘       │
└─────────────────────────────────────────────────────────────────────┘

Story 2: Performance Metrics Visualization

As a performance engineer
I want detailed performance metrics with historical trends
So that I can identify performance degradation and optimize server configuration

Acceptance Criteria:

Scenario: View response time trends
  Given I select a server from the dashboard
  When I click "Performance" tab
  Then I see time-series graphs showing:
    - Response time (p50, p95, p99)
    - Requests per second
    - Active connections over time
    - CPU and memory usage (if available)

Scenario: Compare time periods
  Given I'm viewing performance metrics
  When I select "Compare with last week"
  Then I see overlay graphs comparing:
    - Current period vs previous period
    - Percentage change indicators
    - Anomaly highlights

Scenario: Drill down to endpoint level
  Given I'm viewing server performance
  When I click "Endpoint Breakdown"
  Then I see metrics per endpoint:
    - /tools/list average response time
    - /tools/call average response time
    - Error rates per endpoint

UI Mockup:

┌─────────────────────────────────────────────────────────────────────┐
│ Tool Server - Performance Metrics              [1h][6h][24h][7d]    │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│ Response Time (last 24h)                    📊 Compare with: [---] │
│ ┌───────────────────────────────────────────────────────────────┐   │
│ │     ms                                                         │   │
│ │    200│                                      ╱╲                │   │
│ │    150│                  ╱╲                 ╱  ╲   p99         │   │
│ │    100│      ╱╲         ╱  ╲___________╱╲_╱    ╲              │   │
│ │     50│─────╱──╲───────╱────────────────────────╲── p95       │   │
│ │     25│─────────────────────────────────────────── p50        │   │
│ │      0└────┴────┴────┴────┴────┴────┴────┴────┴────┴          │   │
│ │       00:00  04:00  08:00  12:00  16:00  20:00  24:00        │   │
│ └───────────────────────────────────────────────────────────────┘   │
│                                                                     │
│ Requests/Second                          Active Connections         │
│ ┌─────────────────────────────┐         ┌─────────────────────────┐ │
│ │ req/s                       │         │ count                   │ │
│ │   500│    ╱╲    ╱╲         │         │    50│      ___        │ │
│ │   250│   ╱  ╲__╱  ╲___     │         │    25│  ___╱   ╲___    │ │
│ │     0└───┴────┴────┴───    │         │     0└──┴────┴────┴──  │ │
│ └─────────────────────────────┘         └─────────────────────────┘ │
│                                                                     │
│ Endpoint Breakdown (last hour)                                      │
│ ┌─────────────────────┬──────────┬─────────┬──────────┬─────────┐  │
│ │ Endpoint            │ Avg (ms) │ P95 (ms)│ Req/min  │ Errors  │  │
│ ├─────────────────────┼──────────┼─────────┼──────────┼─────────┤  │
│ │ /tools/list         │ 23       │ 45      │ 120      │ 0       │  │
│ │ /tools/call         │ 156      │ 234     │ 450      │ 2 (0.4%)│  │
│ │ /protocol/initialize│ 12       │ 18      │ 15       │ 0       │  │
│ └─────────────────────┴──────────┴─────────┴──────────┴─────────┘  │
└─────────────────────────────────────────────────────────────────────┘

Story 3: Error Tracking and Analysis

As a DevOps engineer
I want detailed error tracking with root cause analysis
So that I can quickly identify and resolve issues

Acceptance Criteria:

Scenario: View error trends
  Given I access the errors tab
  When I view the error dashboard
  Then I see:
    - Error rate trends over time
    - Top error types with counts
    - Error distribution by server
    - Recent error samples with stack traces

Scenario: Error correlation
  Given an error spike occurs
  When I click on the spike in the graph
  Then I see:
    - All errors in that time window
    - Correlated events (deployments, config changes)
    - Affected endpoints and servers
    - Similar historical incidents

Scenario: Error alerting
  Given I configure an alert threshold
  When error rate exceeds 5% for 5 minutes
  Then an alert is triggered
  And I receive notification via configured channel

Story 4: Usage Pattern Analysis

As a capacity planner
I want usage pattern insights
So that I can optimize resource allocation and predict scaling needs

Acceptance Criteria:

Scenario: View usage patterns
  Given I access the usage analytics
  When I select a 30-day view
  Then I see:
    - Peak usage times (hourly/daily)
    - Most used endpoints
    - Client distribution
    - Protocol version usage

Scenario: Capacity forecasting
  Given historical usage data exists
  When I view capacity planning
  Then I see:
    - Growth trends
    - Predicted capacity needs
    - Resource utilization patterns
    - Scaling recommendations

Story 5: Alert Configuration and Management

As a operations manager
I want configurable alerts for various health metrics
So that I can be proactively notified of issues

Acceptance Criteria:

Scenario: Configure metric alert
  Given I access alert configuration
  When I create a new alert for "Response Time > 500ms"
  And I set evaluation period to "5 minutes"
  Then the alert is created
  And it monitors all servers by default

Scenario: Alert notification channels
  Given I have configured alerts
  When I set up notification channels
  Then I can choose:
    - Email notifications
    - Webhook calls
    - Admin UI notifications
    - Log entries

Scenario: Alert history
  Given alerts have been triggered
  When I view alert history
  Then I see:
    - All triggered alerts with timestamps
    - Resolution status
    - Actions taken
    - Related metrics at alert time

Story 6: Federated Health View

As a platform administrator
I want a unified view of health across federated gateways
So that I can monitor the entire MCP ecosystem

Acceptance Criteria:

Scenario: View federated health
  Given multiple gateways are federated
  When I access federated view
  Then I see:
    - Health status per gateway
    - Cross-gateway metrics
    - Federation link health
    - Global error rates

Scenario: Drill down to specific gateway
  Given I'm in federated view
  When I click on a gateway
  Then I see that gateway's detailed dashboard
  And I can navigate back to federated view

📊 Architecture

flowchart TB
    subgraph "Data Collection Layer"
        MS1[MCP Server 1] -->|Metrics| MC[Metrics Collector]
        MS2[MCP Server 2] -->|Metrics| MC
        MS3[MCP Server N] -->|Metrics| MC
        
        MC -->|Store| TS[(Time Series DB)]
        MC -->|Real-time| WS[WebSocket Server]
    end
    
    subgraph "Health Check System"
        HC[Health Checker] -->|Probe| MS1
        HC -->|Probe| MS2
        HC -->|Probe| MS3
        HC -->|Status| HS[(Health Status)]
        HC -->|Alerts| AS[Alert Service]
    end
    
    subgraph "Analytics Engine"
        TS -->|Query| AE[Analytics Engine]
        AE -->|Patterns| ML[ML Analyzer]
        AE -->|Trends| TP[Trend Processor]
        ML -->|Anomalies| AS
    end
    
    subgraph "Dashboard UI"
        WS -->|Live Data| UI[Dashboard UI]
        HS -->|Status| UI
        AE -->|Historical| UI
        AS -->|Alerts| UI
        
        UI -->|Display| OV[Overview Grid]
        UI -->|Display| PM[Performance Metrics]
        UI -->|Display| ER[Error Reports]
        UI -->|Display| UA[Usage Analytics]
    end
    
    subgraph "Alert Channels"
        AS -->|Send| EMAIL[Email]
        AS -->|Send| WEBHOOK[Webhooks]
        AS -->|Send| LOG[Audit Logs]
    end
    
    style MC fill:#90EE90
    style HS fill:#87CEEB
    style AS fill:#FFB6C1
    style UI fill:#DDA0DD
    style ML fill:#FFD700
Loading

🏗️ Technical Design

Database Schema

-- Server health status
CREATE TABLE server_health (
    id INTEGER PRIMARY KEY,
    server_id VARCHAR(255) NOT NULL,
    server_name VARCHAR(255) NOT NULL,
    endpoint_url TEXT NOT NULL,
    status VARCHAR(20) NOT NULL, -- healthy, warning, critical, offline
    last_check TIMESTAMP NOT NULL,
    response_time_ms INTEGER,
    error_message TEXT,
    metadata JSON,
    INDEX idx_server_status (server_id, status),
    INDEX idx_last_check (last_check)
);

-- Time series metrics
CREATE TABLE server_metrics (
    id INTEGER PRIMARY KEY,
    server_id VARCHAR(255) NOT NULL,
    metric_name VARCHAR(100) NOT NULL,
    metric_value DOUBLE NOT NULL,
    tags JSON,
    timestamp TIMESTAMP NOT NULL,
    INDEX idx_server_time (server_id, timestamp),
    INDEX idx_metric_time (metric_name, timestamp)
);

-- Error tracking
CREATE TABLE error_events (
    id INTEGER PRIMARY KEY,
    server_id VARCHAR(255) NOT NULL,
    endpoint VARCHAR(255),
    error_type VARCHAR(100),
    error_message TEXT,
    stack_trace TEXT,
    request_id VARCHAR(100),
    client_info JSON,
    timestamp TIMESTAMP NOT NULL,
    INDEX idx_server_errors (server_id, timestamp),
    INDEX idx_error_type (error_type, timestamp)
);

-- Alert configuration
CREATE TABLE alert_rules (
    id INTEGER PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    description TEXT,
    metric_name VARCHAR(100) NOT NULL,
    condition VARCHAR(20) NOT NULL, -- gt, lt, eq, gte, lte
    threshold DOUBLE NOT NULL,
    evaluation_period INTEGER NOT NULL, -- seconds
    servers JSON, -- null means all servers
    enabled BOOLEAN DEFAULT TRUE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Alert history
CREATE TABLE alert_history (
    id INTEGER PRIMARY KEY,
    rule_id INTEGER REFERENCES alert_rules(id),
    server_id VARCHAR(255),
    triggered_at TIMESTAMP NOT NULL,
    resolved_at TIMESTAMP,
    metric_value DOUBLE,
    notification_sent BOOLEAN DEFAULT FALSE,
    INDEX idx_triggered (triggered_at),
    INDEX idx_server_alerts (server_id, triggered_at)
);

Metrics Collection Configuration

# health_config.py
class HealthMonitoringConfig(BaseSettings):
    # Collection intervals
    health_check_interval: int = Field(default=30, description="Health check interval in seconds")
    metrics_collection_interval: int = Field(default=10, description="Metrics collection interval")
    
    # Retention policies
    metrics_retention_days: int = Field(default=30, description="Days to retain metrics")
    error_retention_days: int = Field(default=90, description="Days to retain error logs")
    
    # Health check configuration
    health_check_timeout: int = Field(default=5, description="Health check timeout in seconds")
    health_check_retries: int = Field(default=3, description="Number of retries before marking offline")
    
    # Thresholds
    response_time_warning_ms: int = Field(default=500, description="Warning threshold")
    response_time_critical_ms: int = Field(default=1000, description="Critical threshold")
    error_rate_warning_percent: float = Field(default=1.0, description="Warning error rate")
    error_rate_critical_percent: float = Field(default=5.0, description="Critical error rate")
    
    # Dashboard settings
    dashboard_refresh_interval: int = Field(default=5, description="Dashboard refresh in seconds")
    max_timeline_points: int = Field(default=1000, description="Max data points in timeline")

Metric Definitions

# metrics.py
HEALTH_METRICS = {
    "response_time": {
        "type": "histogram",
        "unit": "milliseconds",
        "buckets": [10, 25, 50, 100, 250, 500, 1000, 2500, 5000]
    },
    "request_rate": {
        "type": "counter",
        "unit": "requests/second"
    },
    "active_connections": {
        "type": "gauge",
        "unit": "connections"
    },
    "error_rate": {
        "type": "counter",
        "unit": "errors/second",
        "labels": ["error_type", "endpoint"]
    },
    "cpu_usage": {
        "type": "gauge",
        "unit": "percent"
    },
    "memory_usage": {
        "type": "gauge",
        "unit": "megabytes"
    }
}

# Health check endpoints
HEALTH_CHECK_ENDPOINTS = {
    "basic": "/health",
    "detailed": "/health/detailed",
    "mcp_ping": "/protocol/ping"
}

🛠️ Implementation Tasks

Phase 1: Core Infrastructure

  • Create database schema for health metrics
  • Implement metrics collector service
  • Build health check probe system
  • Create time-series data storage layer
  • Implement WebSocket server for real-time updates

Phase 2: Data Collection

  • Add metrics instrumentation to MCP endpoints
  • Implement health check scheduler
  • Create error event collector
  • Build metrics aggregation pipeline
  • Add data retention policies

Phase 3: Dashboard UI

  • Create main dashboard layout
  • Implement server status grid component
  • Build real-time metric charts
  • Add WebSocket client for live updates
  • Create responsive mobile layout

Phase 4: Analytics Features

  • Implement performance trending
  • Add error correlation engine
  • Create usage pattern analyzer
  • Build capacity forecasting
  • Add anomaly detection

Phase 5: Alerting System

  • Create alert rule configuration UI
  • Implement alert evaluation engine
  • Add notification channels
  • Build alert history viewer
  • Create alert suppression logic

Phase 6: Federation Support

  • Add cross-gateway health aggregation
  • Implement federated dashboard view
  • Create gateway topology visualization
  • Add federation link monitoring

📋 Acceptance Criteria

Performance Requirements

  • Dashboard loads in < 2 seconds
  • Real-time updates with < 5 second delay
  • Support 100+ monitored servers
  • Metrics query response < 500ms
  • Minimal overhead on monitored servers (< 1% CPU)

Functionality

  • All health metrics collected accurately
  • Historical data retained per policy
  • Alerts trigger within evaluation period
  • Error correlation works across servers
  • Federation view shows all gateways

User Experience

  • Intuitive navigation between views
  • Clear visual health indicators
  • Responsive on mobile devices
  • Exportable metrics and reports
  • Customizable dashboard layouts

🚫 Out of Scope

  • Log aggregation (separate feature)
  • Distributed tracing
  • Application Performance Monitoring (APM)
  • Infrastructure monitoring (CPU, disk, network)
  • Custom metric definitions

📊 Success Metrics

  • 99.9% health check reliability
  • < 5 minute MTTR improvement
  • 90% of issues detected before user impact
  • 100% critical alerts delivered

🔗 Standards Compliance

  • ✅ Uses standard MCP health endpoints
  • ✅ Compatible with existing monitoring tools
  • ✅ Exports metrics in Prometheus format
  • ✅ Follows OpenTelemetry standards

📝 Notes

  • Consider integration with existing monitoring stacks
  • Plan for high-cardinality metrics
  • Implement gradual rollout for federation
  • Document performance impact on servers
  • Create runbooks for common alerts

Metadata

Metadata

Assignees

No one assigned

    Labels

    COULDP3: Nice-to-have features with minimal impact if left out; included if time permitsenhancementNew feature or requestfrontendFrontend development (HTML, CSS, JavaScript)pythonPython / backend development (FastAPI)uiUser Interface
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions