Skip to content

Latest commit

 

History

History
299 lines (225 loc) · 11.2 KB

File metadata and controls

299 lines (225 loc) · 11.2 KB

AiSOC Multi-Region Operations

Audience: Platform / SRE teams running AiSOC in production across multiple cloud regions. Last updated: auto-generated (see scripts/generate_runbook.py)


Table of contents

  1. Architecture overview
  2. Region topology
  3. Data residency & replication strategy
  4. Traffic routing & failover
  5. Deployment procedures
  6. Observability & alerting
  7. Runbooks
  8. Recovery objectives (RTO / RPO)
  9. Chaos engineering checklist
  10. Contact & escalation matrix

1. Architecture overview

AiSOC is deployed as a set of independent microservices managed by Helm. In a multi-region setup each region runs a full replica of the control plane with:

  • Active–passive PostgreSQL: one writer in the primary region; read replicas in secondary regions promoted on failover.
  • Active–active ClickHouse: distributed cluster with per-shard replicas across regions; ZooKeeper or ClickHouse Keeper runs in every region.
  • Active–active ingest pipeline: events are fan-out written to all region Kafka clusters; correlation happens locally.
  • Global load balancer (e.g. Cloudflare, AWS Global Accelerator, or GCP Traffic Director) directing API traffic to the nearest healthy region.
                ┌──────────────────────────────────────────────────────┐
                │             Global Load Balancer / Anycast DNS       │
                └───────────┬──────────────────────┬───────────────────┘
                            │                      │
              ┌─────────────▼──────────┐  ┌────────▼──────────────┐
              │  Region: us-east-1      │  │  Region: eu-west-1    │
              │ ─────────────────────  │  │ ──────────────────── │
              │  Kubernetes cluster    │  │  Kubernetes cluster   │
              │  ├─ api (×2)           │  │  ├─ api (×2)          │
              │  ├─ ingest (×3)        │  │  ├─ ingest (×3)       │
              │  ├─ enrichment (×2)    │  │  ├─ enrichment (×2)   │
              │  ├─ alert-fusion (×2)  │  │  ├─ alert-fusion (×2) │
              │  └─ agents (×2)        │  │  └─ agents (×2)       │
              │                        │  │                        │
              │  PostgreSQL PRIMARY ─────────► PostgreSQL REPLICA  │
              │  ClickHouse shard 1    │  │  ClickHouse shard 2   │
              │  Redis (leader)  ────────►  Redis (replica)       │
              └────────────────────────┘  └───────────────────────┘

2. Region topology

Region label Cloud / zone Role Postgres ClickHouse shards
us-east-1 AWS us-east-1a/b Primary Writer Shard 1 (1 replica each)
eu-west-1 AWS eu-west-1a/b Secondary Async replica Shard 2 (1 replica each)
ap-southeast-1 AWS ap-southeast-1a DR-only Async replica

Adding a new region

# 1. Provision cluster (Terraform / eksctl / etc.)
# 2. Install cert-manager, nginx-ingress, external-secrets
# 3. Deploy AiSOC chart pointing to existing secrets store
helm upgrade --install aisoc infra/helm/aisoc \
  --namespace aisoc \
  --create-namespace \
  --set global.environment=production \
  --set ingress.hosts[0].host=aisoc-eu.example.com \
  -f infra/helm/aisoc/values-eu-west-1.yaml

# 4. Register region in Global LB (health-check /api/health)
# 5. Stream Postgres WAL to new replica (pg_basebackup)
# 6. Extend ClickHouse cluster config to include new shard

3. Data residency & replication strategy

PostgreSQL

  • Streaming replication (wal_level=replica, max_wal_senders=5).
  • Replication lag target: < 5 s. Alert at 30 s, page at 2 min.
  • Failover: automatic with Patroni or managed RDS Multi-AZ. Replica becomes writer; old writer enters standby when recovered.
  • GDPR: tenants with EU data residency requirements are assigned to the eu-west-1 writer via per-tenant routing in tenant_sla_config.

ClickHouse

  • Distributed table events_dist over all shards.
  • Each shard has a replica in the same region; cross-region replication runs over ZooKeeper quorum.
  • Replication lag target: < 10 s. Alert at 60 s.

Redis

  • Read replicas in secondary regions for cache warming; sentinel setup for HA within a region.
  • Session data: replicated. Ephemeral rate-limit keys: local only.

Object storage (S3/R2)

  • Backup objects replicated to a second bucket in an alternate region via bucket replication rules.
  • Plugin artifacts: single bucket with multi-region access enabled.

4. Traffic routing & failover

Healthy-region selection

The global LB runs active health checks every 10 s against GET /api/health. A region is removed from rotation if:

  • HTTP status ≠ 200 for 3 consecutive checks, or
  • Latency p99 > 2 s for 5 consecutive checks.

Planned failover (maintenance)

# Drain traffic from us-east-1 before maintenance window
# 1. Weight us-east-1 to 0 in LB config
# 2. Wait for in-flight requests to drain (~60 s)
# 3. Perform maintenance
# 4. Restore weight

# CloudFlare example
cf_zone_id=<ZONE_ID>
cf_record_id=<RECORD_ID>
curl -X PATCH "https://api.cloudflare.com/client/v4/zones/${cf_zone_id}/dns_records/${cf_record_id}" \
  -H "Authorization: Bearer ${CF_API_TOKEN}" \
  -d '{"data":{"weight":0}}'

Unplanned failover

  1. PagerDuty alert fires (region_health_check rule).
  2. On-call SRE confirms outage (./scripts/health_check.sh --region us-east-1).
  3. Execute runbook RB-003-region-failover (auto-generated; see §7).
  4. Postgres: promote replica via Patroni or aws rds promote-read-replica.
  5. Update DATABASE_URL secret in secondary region to the new writer endpoint.
  6. Restart API pods: kubectl rollout restart deployment -n aisoc -l app.kubernetes.io/name=api.

5. Deployment procedures

Rolling update (standard)

# Bump image tag in CI/CD (GitHub Actions), then:
helm upgrade aisoc infra/helm/aisoc \
  --namespace aisoc \
  --atomic \
  --timeout 5m \
  --set services.api.image.tag=${GIT_SHA} \
  --set services.ingest.image.tag=${GIT_SHA}

maxUnavailable: 0 and maxSurge: 1 are enforced in deployment.yaml; pods are updated one at a time.

Blue/green release

  1. Deploy new version to a parallel namespace (aisoc-green).
  2. Run smoke tests against green ingress host.
  3. Switch LB to green namespace via weighted routing.
  4. Keep blue idle for 1 hour (rollback window).
  5. Delete blue namespace.

Rollback

helm rollback aisoc 0 --namespace aisoc   # 0 = previous release
# or target a specific revision:
helm history aisoc --namespace aisoc
helm rollback aisoc <REVISION> --namespace aisoc

6. Observability & alerting

AiSOC emits OpenTelemetry traces, metrics, and structured logs to a configurable OTLP endpoint (global.otelEndpoint in values.yaml).

Key SLIs

Service Metric SLO target
api http_request_duration_p99 < 500 ms
api http_error_rate < 0.5 %
ingest event_ingestion_lag_p99 < 2 s
alert-fusion alert_fusion_latency_p99 < 5 s
agents agent_run_duration_p95 < 30 s
All Pod ready ratio > 99 %

Recommended dashboards

  • Service map: trace-based topology from OTLP backend (Tempo, Jaeger, Honeycomb).
  • Golden signals: per-service latency / error / saturation / traffic (Grafana aisoc-golden-signals.json).
  • SLA tracker: AiSOC built-in /sla dashboard (/apps/web/src/app/(app)/sla/page.tsx).

Alerting rules (Prometheus/AlertManager)

# Example PrometheusRule
- alert: AiSOCHighErrorRate
  expr: |
    rate(http_requests_total{service="aisoc-api",status=~"5.."}[5m])
    / rate(http_requests_total{service="aisoc-api"}[5m]) > 0.005
  for: 3m
  labels:
    severity: page
  annotations:
    summary: "AiSOC API error rate > 0.5%"

- alert: AiSOCIngestLag
  expr: histogram_quantile(0.99, rate(ingest_lag_seconds_bucket[5m])) > 2
  for: 5m
  labels:
    severity: warn

7. Runbooks

Runbooks are auto-generated from live OTel trace data by scripts/generate_runbook.py. The output lives in docs/operations/runbooks/. Each runbook follows the format:

RB-NNN-<slug>.md
  Title
  Trigger condition
  Impact assessment
  Diagnosis steps (from trace topology)
  Remediation steps
  Verification steps
  Escalation path

Available runbooks

ID Slug Trigger
RB-001 api-high-latency http_request_duration_p99 > 500ms
RB-002 postgres-replica-lag Replication lag > 30 s
RB-003 region-failover Region health check failure
RB-004 ingest-pipeline-stall Ingest lag > 30 s
RB-005 agent-runner-oom OOMKilled in agents pods
RB-006 cert-expiry TLS cert expires in < 14 days

To regenerate all runbooks:

OTEL_ENDPOINT=http://tempo:4317 \
  python scripts/generate_runbook.py \
  --output docs/operations/runbooks/ \
  --lookback-hours 168    # 1 week of traces

8. Recovery objectives (RTO / RPO)

Failure scenario RPO RTO
Single pod crash 0 s < 30 s (K8s restarts)
Availability zone failure < 30 s < 5 min
Region failure (warm standby) < 60 s < 15 min
Region failure (cold DR) < 5 min < 60 min
Total data loss (from backup) 24 h < 4 h

9. Chaos engineering checklist

Run monthly in the staging environment:

  • Kill 50 % of api pods; confirm remaining pods handle load and HPA scales up.
  • Inject 500 ms network latency to postgres; confirm circuit breaker opens.
  • Stop Kafka consumer group for ingest; confirm lag alert fires within 5 min.
  • Simulate AZ failure by cordoning one node group; confirm pods reschedule.
  • Promote Postgres replica; confirm API recovers within RTO.
  • Delete and restore from backup; confirm RPO.

10. Contact & escalation matrix

Severity First responder Escalate to SLA
P1 – Production down On-call SRE (PagerDuty) Engineering lead 15 min response
P2 – Degraded performance On-call SRE Engineering lead 1 h response
P3 – Non-critical issue Slack #aisoc-ops Next business day
P4 – Informational Ticketing system Best effort

This document is maintained alongside the codebase. Run scripts/generate_runbook.py --update-toc to refresh section links after adding runbooks.