Skip to content

[EPIC][ADMIN]: Admin alerting and alert management #2311

@crivetimihai

Description

@crivetimihai

Overview

Provide administrators with tools to define, manage, and respond to system alerts based on thresholds, anomalies, and operational events.

User Stories

Alert Rules

  • As an admin, I want to define alert rules based on metrics thresholds
  • As an admin, I want to create alerts for specific events (server down, auth failures)
  • As an admin, I want to set alert severity levels (critical, warning, info)
  • As an admin, I want to test alert rules before enabling

Alert Conditions

  • As an admin, I want threshold-based alerts (e.g., error rate > 5%)
  • As an admin, I want availability alerts (server unreachable for X minutes)
  • As an admin, I want rate limit alerts (approaching or exceeded limits)
  • As an admin, I want certificate expiration alerts
  • As an admin, I want anomaly detection alerts (unusual patterns)

Alert Routing

  • As an admin, I want to route alerts to specific users or teams
  • As an admin, I want to configure delivery channels (email, webhook, SMS)
  • As an admin, I want integration with PagerDuty, OpsGenie, Slack
  • As an admin, I want different routing based on severity or type

Escalation Policies

  • As an admin, I want to define escalation chains
  • As an admin, I want automatic escalation if not acknowledged in X minutes
  • As an admin, I want on-call schedules for alert routing
  • As an admin, I want escalation override for critical alerts

Alert Management

  • As an admin, I want to view all active alerts in a dashboard
  • As an admin, I want to acknowledge alerts
  • As an admin, I want to resolve alerts with resolution notes
  • As an admin, I want to snooze/silence alerts temporarily

Maintenance Windows

  • As an admin, I want to define maintenance windows (suppress alerts)
  • As an admin, I want to schedule recurring maintenance windows
  • As an admin, I want alerts to resume automatically after window

Alert History

  • As an admin, I want to view alert history and trends
  • As an admin, I want to see mean time to acknowledge (MTTA)
  • As an admin, I want to see mean time to resolve (MTTR)
  • As an admin, I want to export alert reports

Configuration

MCPGATEWAY_ALERTING_ENABLED=true
MCPGATEWAY_ALERT_EMAIL_ENABLED=true
MCPGATEWAY_ALERT_WEBHOOK_URL=https://hooks.slack.com/...
MCPGATEWAY_PAGERDUTY_API_KEY=...
MCPGATEWAY_OPSGENIE_API_KEY=...

Data Model

class AlertRule(Base):
    id: UUID
    name: str
    description: str
    condition: JSON  # threshold, event type, etc.
    severity: str  # critical, warning, info
    enabled: bool
    routing: JSON  # who to notify
    escalation_policy_id: Optional[UUID]

class Alert(Base):
    id: UUID
    rule_id: UUID
    status: str  # firing, acknowledged, resolved
    severity: str
    message: str
    triggered_at: datetime
    acknowledged_at: Optional[datetime]
    acknowledged_by: Optional[str]
    resolved_at: Optional[datetime]
    resolved_by: Optional[str]
    resolution_notes: Optional[str]

Acceptance Criteria

  • Alert rules can be created and tested
  • Alerts fire when conditions are met
  • Routing delivers alerts to configured channels
  • Escalation works when alerts are not acknowledged
  • Maintenance windows suppress alerts correctly
  • Alert history and metrics are tracked

Related

Milestone

Release 1.4.0 - Enterprise Features

Metadata

Metadata

Assignees

No one assigned

    Labels

    COULDP3: Nice-to-have features with minimal impact if left out; included if time permitsenhancementNew feature or requestepicLarge feature spanning multiple issuesfrontendFrontend development (HTML, CSS, JavaScript)observabilityObservability, logging, monitoringuiUser Interface
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions