[FEATURE][UI]: Built-in MCP server health dashboard

# Epic: Built-in MCP Server Health Dashboard

## 🎯 Overview

### Summary
Implement a comprehensive real-time health monitoring dashboard for MCP servers, providing visibility into server status, performance metrics, error rates, and usage patterns with configurable alerting capabilities.

### Problem Statement
Currently, monitoring MCP server health requires external tools or manual log analysis with several limitations:
- No real-time visibility into server performance
- Difficult to identify performance degradation before failures
- No centralized view of distributed MCP server health
- Limited historical trend analysis capabilities
- Manual correlation between errors and system events

### Solution
Create an integrated health monitoring dashboard that:
- Real-time server status and connection monitoring
- Performance metrics with historical trending
- Error rate tracking per endpoint and server
- Configurable alert thresholds with notifications
- Usage pattern analysis and capacity planning
- Federated view for multiple MCP servers

### Dependencies
- **Depends on**: Existing metrics collection infrastructure
- **Complements**: #257 - Gateway-Level Rate Limiting (provides additional metrics)
- **Enhances**: Protocol Version Negotiation (version-specific metrics)

## 👥 User Stories

### Story 1: Real-Time Server Status Overview
**As a** system administrator  
**I want** a real-time overview of all MCP servers  
**So that** I can quickly identify unhealthy servers and take action  

**Acceptance Criteria:**
```gherkin
Scenario: View server status grid
  Given I access the health dashboard
  When I view the main dashboard
  Then I see a grid showing all MCP servers with:
    - Server name and endpoint
    - Current status (Healthy/Warning/Critical/Offline)
    - Response time (last 1 min average)
    - Active connections count
    - Last health check timestamp

Scenario: Server goes offline
  Given a server was healthy
  When the server fails 3 consecutive health checks
  Then the status changes to "Offline"
  And the card turns red
  And an alert is triggered if configured

Scenario: Quick server actions
  Given I see an unhealthy server
  When I click on the server card
  Then I see detailed diagnostics
  And I can perform actions:
    - Force health check
    - Disable server temporarily
    - View recent logs
```

**UI Mockup:**
```
┌─────────────────────────────────────────────────────────────────────┐
│ MCP Server Health Dashboard                    Last update: 2s ago  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│ Overview: 12 Healthy | 2 Warning | 1 Critical | 0 Offline          │
│                                                                     │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐       │
│ │ ✅ Tool Server  │ │ ⚠️ Resource API │ │ ❌ Legacy Bridge │       │
│ │ tools.mcp.local │ │ res.mcp.local   │ │ bridge.mcp.local│       │
│ │                 │ │                 │ │                 │       │
│ │ Response: 45ms  │ │ Response: 850ms │ │ Response: ---   │       │
│ │ Connections: 23 │ │ Connections: 5  │ │ Connections: 0  │       │
│ │ Uptime: 15d 3h  │ │ Uptime: 2d 14h  │ │ Uptime: ---     │       │
│ │                 │ │                 │ │                 │       │
│ │ [View Details]  │ │ [View Details]  │ │ [View Details]  │       │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘       │
│                                                                     │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐       │
│ │ ✅ Prompt Store │ │ ✅ Auth Service │ │ ✅ Vector DB    │       │
│ │ prompts.local  │ │ auth.mcp.local  │ │ vectors.local   │       │
│ │                 │ │                 │ │                 │       │
│ │ Response: 32ms  │ │ Response: 28ms  │ │ Response: 156ms │       │
│ │ Connections: 8  │ │ Connections: 45 │ │ Connections: 12 │       │
│ │ Uptime: 45d 2h  │ │ Uptime: 45d 2h  │ │ Uptime: 8d 19h  │       │
│ │                 │ │                 │ │                 │       │
│ │ [View Details]  │ │ [View Details]  │ │ [View Details]  │       │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘       │
└─────────────────────────────────────────────────────────────────────┘
```

### Story 2: Performance Metrics Visualization
**As a** performance engineer  
**I want** detailed performance metrics with historical trends  
**So that** I can identify performance degradation and optimize server configuration  

**Acceptance Criteria:**
```gherkin
Scenario: View response time trends
  Given I select a server from the dashboard
  When I click "Performance" tab
  Then I see time-series graphs showing:
    - Response time (p50, p95, p99)
    - Requests per second
    - Active connections over time
    - CPU and memory usage (if available)

Scenario: Compare time periods
  Given I'm viewing performance metrics
  When I select "Compare with last week"
  Then I see overlay graphs comparing:
    - Current period vs previous period
    - Percentage change indicators
    - Anomaly highlights

Scenario: Drill down to endpoint level
  Given I'm viewing server performance
  When I click "Endpoint Breakdown"
  Then I see metrics per endpoint:
    - /tools/list average response time
    - /tools/call average response time
    - Error rates per endpoint
```

**UI Mockup:**
```
┌─────────────────────────────────────────────────────────────────────┐
│ Tool Server - Performance Metrics              [1h][6h][24h][7d]    │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│ Response Time (last 24h)                    📊 Compare with: [---] │
│ ┌───────────────────────────────────────────────────────────────┐   │
│ │     ms                                                         │   │
│ │    200│                                      ╱╲                │   │
│ │    150│                  ╱╲                 ╱  ╲   p99         │   │
│ │    100│      ╱╲         ╱  ╲___________╱╲_╱    ╲              │   │
│ │     50│─────╱──╲───────╱────────────────────────╲── p95       │   │
│ │     25│─────────────────────────────────────────── p50        │   │
│ │      0└────┴────┴────┴────┴────┴────┴────┴────┴────┴          │   │
│ │       00:00  04:00  08:00  12:00  16:00  20:00  24:00        │   │
│ └───────────────────────────────────────────────────────────────┘   │
│                                                                     │
│ Requests/Second                          Active Connections         │
│ ┌─────────────────────────────┐         ┌─────────────────────────┐ │
│ │ req/s                       │         │ count                   │ │
│ │   500│    ╱╲    ╱╲         │         │    50│      ___        │ │
│ │   250│   ╱  ╲__╱  ╲___     │         │    25│  ___╱   ╲___    │ │
│ │     0└───┴────┴────┴───    │         │     0└──┴────┴────┴──  │ │
│ └─────────────────────────────┘         └─────────────────────────┘ │
│                                                                     │
│ Endpoint Breakdown (last hour)                                      │
│ ┌─────────────────────┬──────────┬─────────┬──────────┬─────────┐  │
│ │ Endpoint            │ Avg (ms) │ P95 (ms)│ Req/min  │ Errors  │  │
│ ├─────────────────────┼──────────┼─────────┼──────────┼─────────┤  │
│ │ /tools/list         │ 23       │ 45      │ 120      │ 0       │  │
│ │ /tools/call         │ 156      │ 234     │ 450      │ 2 (0.4%)│  │
│ │ /protocol/initialize│ 12       │ 18      │ 15       │ 0       │  │
│ └─────────────────────┴──────────┴─────────┴──────────┴─────────┘  │
└─────────────────────────────────────────────────────────────────────┘
```

### Story 3: Error Tracking and Analysis
**As a** DevOps engineer  
**I want** detailed error tracking with root cause analysis  
**So that** I can quickly identify and resolve issues  

**Acceptance Criteria:**
```gherkin
Scenario: View error trends
  Given I access the errors tab
  When I view the error dashboard
  Then I see:
    - Error rate trends over time
    - Top error types with counts
    - Error distribution by server
    - Recent error samples with stack traces

Scenario: Error correlation
  Given an error spike occurs
  When I click on the spike in the graph
  Then I see:
    - All errors in that time window
    - Correlated events (deployments, config changes)
    - Affected endpoints and servers
    - Similar historical incidents

Scenario: Error alerting
  Given I configure an alert threshold
  When error rate exceeds 5% for 5 minutes
  Then an alert is triggered
  And I receive notification via configured channel
```

### Story 4: Usage Pattern Analysis
**As a** capacity planner  
**I want** usage pattern insights  
**So that** I can optimize resource allocation and predict scaling needs  

**Acceptance Criteria:**
```gherkin
Scenario: View usage patterns
  Given I access the usage analytics
  When I select a 30-day view
  Then I see:
    - Peak usage times (hourly/daily)
    - Most used endpoints
    - Client distribution
    - Protocol version usage

Scenario: Capacity forecasting
  Given historical usage data exists
  When I view capacity planning
  Then I see:
    - Growth trends
    - Predicted capacity needs
    - Resource utilization patterns
    - Scaling recommendations
```

### Story 5: Alert Configuration and Management
**As a** operations manager  
**I want** configurable alerts for various health metrics  
**So that** I can be proactively notified of issues  

**Acceptance Criteria:**
```gherkin
Scenario: Configure metric alert
  Given I access alert configuration
  When I create a new alert for "Response Time > 500ms"
  And I set evaluation period to "5 minutes"
  Then the alert is created
  And it monitors all servers by default

Scenario: Alert notification channels
  Given I have configured alerts
  When I set up notification channels
  Then I can choose:
    - Email notifications
    - Webhook calls
    - Admin UI notifications
    - Log entries

Scenario: Alert history
  Given alerts have been triggered
  When I view alert history
  Then I see:
    - All triggered alerts with timestamps
    - Resolution status
    - Actions taken
    - Related metrics at alert time
```

### Story 6: Federated Health View
**As a** platform administrator  
**I want** a unified view of health across federated gateways  
**So that** I can monitor the entire MCP ecosystem  

**Acceptance Criteria:**
```gherkin
Scenario: View federated health
  Given multiple gateways are federated
  When I access federated view
  Then I see:
    - Health status per gateway
    - Cross-gateway metrics
    - Federation link health
    - Global error rates

Scenario: Drill down to specific gateway
  Given I'm in federated view
  When I click on a gateway
  Then I see that gateway's detailed dashboard
  And I can navigate back to federated view
```

## 📊 Architecture

```mermaid
flowchart TB
    subgraph "Data Collection Layer"
        MS1[MCP Server 1] -->|Metrics| MC[Metrics Collector]
        MS2[MCP Server 2] -->|Metrics| MC
        MS3[MCP Server N] -->|Metrics| MC
        
        MC -->|Store| TS[(Time Series DB)]
        MC -->|Real-time| WS[WebSocket Server]
    end
    
    subgraph "Health Check System"
        HC[Health Checker] -->|Probe| MS1
        HC -->|Probe| MS2
        HC -->|Probe| MS3
        HC -->|Status| HS[(Health Status)]
        HC -->|Alerts| AS[Alert Service]
    end
    
    subgraph "Analytics Engine"
        TS -->|Query| AE[Analytics Engine]
        AE -->|Patterns| ML[ML Analyzer]
        AE -->|Trends| TP[Trend Processor]
        ML -->|Anomalies| AS
    end
    
    subgraph "Dashboard UI"
        WS -->|Live Data| UI[Dashboard UI]
        HS -->|Status| UI
        AE -->|Historical| UI
        AS -->|Alerts| UI
        
        UI -->|Display| OV[Overview Grid]
        UI -->|Display| PM[Performance Metrics]
        UI -->|Display| ER[Error Reports]
        UI -->|Display| UA[Usage Analytics]
    end
    
    subgraph "Alert Channels"
        AS -->|Send| EMAIL[Email]
        AS -->|Send| WEBHOOK[Webhooks]
        AS -->|Send| LOG[Audit Logs]
    end
    
    style MC fill:#90EE90
    style HS fill:#87CEEB
    style AS fill:#FFB6C1
    style UI fill:#DDA0DD
    style ML fill:#FFD700
```

## 🏗️ Technical Design

### Database Schema
```sql
-- Server health status
CREATE TABLE server_health (
    id INTEGER PRIMARY KEY,
    server_id VARCHAR(255) NOT NULL,
    server_name VARCHAR(255) NOT NULL,
    endpoint_url TEXT NOT NULL,
    status VARCHAR(20) NOT NULL, -- healthy, warning, critical, offline
    last_check TIMESTAMP NOT NULL,
    response_time_ms INTEGER,
    error_message TEXT,
    metadata JSON,
    INDEX idx_server_status (server_id, status),
    INDEX idx_last_check (last_check)
);

-- Time series metrics
CREATE TABLE server_metrics (
    id INTEGER PRIMARY KEY,
    server_id VARCHAR(255) NOT NULL,
    metric_name VARCHAR(100) NOT NULL,
    metric_value DOUBLE NOT NULL,
    tags JSON,
    timestamp TIMESTAMP NOT NULL,
    INDEX idx_server_time (server_id, timestamp),
    INDEX idx_metric_time (metric_name, timestamp)
);

-- Error tracking
CREATE TABLE error_events (
    id INTEGER PRIMARY KEY,
    server_id VARCHAR(255) NOT NULL,
    endpoint VARCHAR(255),
    error_type VARCHAR(100),
    error_message TEXT,
    stack_trace TEXT,
    request_id VARCHAR(100),
    client_info JSON,
    timestamp TIMESTAMP NOT NULL,
    INDEX idx_server_errors (server_id, timestamp),
    INDEX idx_error_type (error_type, timestamp)
);

-- Alert configuration
CREATE TABLE alert_rules (
    id INTEGER PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    description TEXT,
    metric_name VARCHAR(100) NOT NULL,
    condition VARCHAR(20) NOT NULL, -- gt, lt, eq, gte, lte
    threshold DOUBLE NOT NULL,
    evaluation_period INTEGER NOT NULL, -- seconds
    servers JSON, -- null means all servers
    enabled BOOLEAN DEFAULT TRUE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Alert history
CREATE TABLE alert_history (
    id INTEGER PRIMARY KEY,
    rule_id INTEGER REFERENCES alert_rules(id),
    server_id VARCHAR(255),
    triggered_at TIMESTAMP NOT NULL,
    resolved_at TIMESTAMP,
    metric_value DOUBLE,
    notification_sent BOOLEAN DEFAULT FALSE,
    INDEX idx_triggered (triggered_at),
    INDEX idx_server_alerts (server_id, triggered_at)
);
```

### Metrics Collection Configuration
```python
# health_config.py
class HealthMonitoringConfig(BaseSettings):
    # Collection intervals
    health_check_interval: int = Field(default=30, description="Health check interval in seconds")
    metrics_collection_interval: int = Field(default=10, description="Metrics collection interval")
    
    # Retention policies
    metrics_retention_days: int = Field(default=30, description="Days to retain metrics")
    error_retention_days: int = Field(default=90, description="Days to retain error logs")
    
    # Health check configuration
    health_check_timeout: int = Field(default=5, description="Health check timeout in seconds")
    health_check_retries: int = Field(default=3, description="Number of retries before marking offline")
    
    # Thresholds
    response_time_warning_ms: int = Field(default=500, description="Warning threshold")
    response_time_critical_ms: int = Field(default=1000, description="Critical threshold")
    error_rate_warning_percent: float = Field(default=1.0, description="Warning error rate")
    error_rate_critical_percent: float = Field(default=5.0, description="Critical error rate")
    
    # Dashboard settings
    dashboard_refresh_interval: int = Field(default=5, description="Dashboard refresh in seconds")
    max_timeline_points: int = Field(default=1000, description="Max data points in timeline")
```

### Metric Definitions
```python
# metrics.py
HEALTH_METRICS = {
    "response_time": {
        "type": "histogram",
        "unit": "milliseconds",
        "buckets": [10, 25, 50, 100, 250, 500, 1000, 2500, 5000]
    },
    "request_rate": {
        "type": "counter",
        "unit": "requests/second"
    },
    "active_connections": {
        "type": "gauge",
        "unit": "connections"
    },
    "error_rate": {
        "type": "counter",
        "unit": "errors/second",
        "labels": ["error_type", "endpoint"]
    },
    "cpu_usage": {
        "type": "gauge",
        "unit": "percent"
    },
    "memory_usage": {
        "type": "gauge",
        "unit": "megabytes"
    }
}

# Health check endpoints
HEALTH_CHECK_ENDPOINTS = {
    "basic": "/health",
    "detailed": "/health/detailed",
    "mcp_ping": "/protocol/ping"
}
```

## 🛠️ Implementation Tasks

### Phase 1: Core Infrastructure
- [ ] Create database schema for health metrics
- [ ] Implement metrics collector service
- [ ] Build health check probe system
- [ ] Create time-series data storage layer
- [ ] Implement WebSocket server for real-time updates

### Phase 2: Data Collection
- [ ] Add metrics instrumentation to MCP endpoints
- [ ] Implement health check scheduler
- [ ] Create error event collector
- [ ] Build metrics aggregation pipeline
- [ ] Add data retention policies

### Phase 3: Dashboard UI
- [ ] Create main dashboard layout
- [ ] Implement server status grid component
- [ ] Build real-time metric charts
- [ ] Add WebSocket client for live updates
- [ ] Create responsive mobile layout

### Phase 4: Analytics Features
- [ ] Implement performance trending
- [ ] Add error correlation engine
- [ ] Create usage pattern analyzer
- [ ] Build capacity forecasting
- [ ] Add anomaly detection

### Phase 5: Alerting System
- [ ] Create alert rule configuration UI
- [ ] Implement alert evaluation engine
- [ ] Add notification channels
- [ ] Build alert history viewer
- [ ] Create alert suppression logic

### Phase 6: Federation Support
- [ ] Add cross-gateway health aggregation
- [ ] Implement federated dashboard view
- [ ] Create gateway topology visualization
- [ ] Add federation link monitoring

## 📋 Acceptance Criteria

### Performance Requirements
- [ ] Dashboard loads in < 2 seconds
- [ ] Real-time updates with < 5 second delay
- [ ] Support 100+ monitored servers
- [ ] Metrics query response < 500ms
- [ ] Minimal overhead on monitored servers (< 1% CPU)

### Functionality
- [ ] All health metrics collected accurately
- [ ] Historical data retained per policy
- [ ] Alerts trigger within evaluation period
- [ ] Error correlation works across servers
- [ ] Federation view shows all gateways

### User Experience
- [ ] Intuitive navigation between views
- [ ] Clear visual health indicators
- [ ] Responsive on mobile devices
- [ ] Exportable metrics and reports
- [ ] Customizable dashboard layouts

## 🚫 Out of Scope
- Log aggregation (separate feature)
- Distributed tracing
- Application Performance Monitoring (APM)
- Infrastructure monitoring (CPU, disk, network)
- Custom metric definitions

## 📊 Success Metrics
- 99.9% health check reliability
- < 5 minute MTTR improvement
- 90% of issues detected before user impact
- 100% critical alerts delivered

## 🔗 Standards Compliance
- ✅ Uses standard MCP health endpoints
- ✅ Compatible with existing monitoring tools
- ✅ Exports metrics in Prometheus format
- ✅ Follows OpenTelemetry standards

## 📝 Notes
- Consider integration with existing monitoring stacks
- Plan for high-cardinality metrics
- Implement gradual rollout for federation
- Document performance impact on servers
- Create runbooks for common alerts

[FEATURE][UI]: Built-in MCP server health dashboard #547

Description

Epic: Built-in MCP Server Health Dashboard

🎯 Overview

Summary

Problem Statement

Solution

Dependencies

👥 User Stories

Story 1: Real-Time Server Status Overview

Story 2: Performance Metrics Visualization

Story 3: Error Tracking and Analysis

Story 4: Usage Pattern Analysis

Story 5: Alert Configuration and Management

Story 6: Federated Health View

📊 Architecture

🏗️ Technical Design

Database Schema

Metrics Collection Configuration

Metric Definitions

🛠️ Implementation Tasks

Phase 1: Core Infrastructure

Phase 2: Data Collection

Phase 3: Dashboard UI

Phase 4: Analytics Features

Phase 5: Alerting System

Phase 6: Federation Support

📋 Acceptance Criteria

Performance Requirements

Functionality

User Experience

🚫 Out of Scope

📊 Success Metrics

🔗 Standards Compliance

📝 Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions