Audience: Platform / SRE teams running AiSOC in production across multiple cloud regions. Last updated: auto-generated (see
scripts/generate_runbook.py)
- Architecture overview
- Region topology
- Data residency & replication strategy
- Traffic routing & failover
- Deployment procedures
- Observability & alerting
- Runbooks
- Recovery objectives (RTO / RPO)
- Chaos engineering checklist
- Contact & escalation matrix
AiSOC is deployed as a set of independent microservices managed by Helm. In a multi-region setup each region runs a full replica of the control plane with:
- Active–passive PostgreSQL: one writer in the primary region; read replicas in secondary regions promoted on failover.
- Active–active ClickHouse: distributed cluster with per-shard replicas across regions; ZooKeeper or ClickHouse Keeper runs in every region.
- Active–active ingest pipeline: events are fan-out written to all region Kafka clusters; correlation happens locally.
- Global load balancer (e.g. Cloudflare, AWS Global Accelerator, or GCP Traffic Director) directing API traffic to the nearest healthy region.
┌──────────────────────────────────────────────────────┐
│ Global Load Balancer / Anycast DNS │
└───────────┬──────────────────────┬───────────────────┘
│ │
┌─────────────▼──────────┐ ┌────────▼──────────────┐
│ Region: us-east-1 │ │ Region: eu-west-1 │
│ ───────────────────── │ │ ──────────────────── │
│ Kubernetes cluster │ │ Kubernetes cluster │
│ ├─ api (×2) │ │ ├─ api (×2) │
│ ├─ ingest (×3) │ │ ├─ ingest (×3) │
│ ├─ enrichment (×2) │ │ ├─ enrichment (×2) │
│ ├─ alert-fusion (×2) │ │ ├─ alert-fusion (×2) │
│ └─ agents (×2) │ │ └─ agents (×2) │
│ │ │ │
│ PostgreSQL PRIMARY ─────────► PostgreSQL REPLICA │
│ ClickHouse shard 1 │ │ ClickHouse shard 2 │
│ Redis (leader) ────────► Redis (replica) │
└────────────────────────┘ └───────────────────────┘
| Region label | Cloud / zone | Role | Postgres | ClickHouse shards |
|---|---|---|---|---|
us-east-1 |
AWS us-east-1a/b | Primary | Writer | Shard 1 (1 replica each) |
eu-west-1 |
AWS eu-west-1a/b | Secondary | Async replica | Shard 2 (1 replica each) |
ap-southeast-1 |
AWS ap-southeast-1a | DR-only | Async replica | — |
# 1. Provision cluster (Terraform / eksctl / etc.)
# 2. Install cert-manager, nginx-ingress, external-secrets
# 3. Deploy AiSOC chart pointing to existing secrets store
helm upgrade --install aisoc infra/helm/aisoc \
--namespace aisoc \
--create-namespace \
--set global.environment=production \
--set ingress.hosts[0].host=aisoc-eu.example.com \
-f infra/helm/aisoc/values-eu-west-1.yaml
# 4. Register region in Global LB (health-check /api/health)
# 5. Stream Postgres WAL to new replica (pg_basebackup)
# 6. Extend ClickHouse cluster config to include new shard- Streaming replication (
wal_level=replica,max_wal_senders=5). - Replication lag target: < 5 s. Alert at 30 s, page at 2 min.
- Failover: automatic with Patroni or managed RDS Multi-AZ. Replica becomes writer; old writer enters standby when recovered.
- GDPR: tenants with EU data residency requirements are assigned to the
eu-west-1writer via per-tenant routing intenant_sla_config.
- Distributed table
events_distover all shards. - Each shard has a replica in the same region; cross-region replication runs over ZooKeeper quorum.
- Replication lag target: < 10 s. Alert at 60 s.
- Read replicas in secondary regions for cache warming; sentinel setup for HA within a region.
- Session data: replicated. Ephemeral rate-limit keys: local only.
- Backup objects replicated to a second bucket in an alternate region via bucket replication rules.
- Plugin artifacts: single bucket with multi-region access enabled.
The global LB runs active health checks every 10 s against GET /api/health. A region is removed from rotation if:
- HTTP status ≠ 200 for 3 consecutive checks, or
- Latency p99 > 2 s for 5 consecutive checks.
# Drain traffic from us-east-1 before maintenance window
# 1. Weight us-east-1 to 0 in LB config
# 2. Wait for in-flight requests to drain (~60 s)
# 3. Perform maintenance
# 4. Restore weight
# CloudFlare example
cf_zone_id=<ZONE_ID>
cf_record_id=<RECORD_ID>
curl -X PATCH "https://api.cloudflare.com/client/v4/zones/${cf_zone_id}/dns_records/${cf_record_id}" \
-H "Authorization: Bearer ${CF_API_TOKEN}" \
-d '{"data":{"weight":0}}'- PagerDuty alert fires (
region_health_checkrule). - On-call SRE confirms outage (
./scripts/health_check.sh --region us-east-1). - Execute runbook
RB-003-region-failover(auto-generated; see §7). - Postgres: promote replica via Patroni or
aws rds promote-read-replica. - Update
DATABASE_URLsecret in secondary region to the new writer endpoint. - Restart API pods:
kubectl rollout restart deployment -n aisoc -l app.kubernetes.io/name=api.
# Bump image tag in CI/CD (GitHub Actions), then:
helm upgrade aisoc infra/helm/aisoc \
--namespace aisoc \
--atomic \
--timeout 5m \
--set services.api.image.tag=${GIT_SHA} \
--set services.ingest.image.tag=${GIT_SHA}maxUnavailable: 0 and maxSurge: 1 are enforced in deployment.yaml; pods are updated one at a time.
- Deploy new version to a parallel namespace (
aisoc-green). - Run smoke tests against
greeningress host. - Switch LB to
greennamespace via weighted routing. - Keep
blueidle for 1 hour (rollback window). - Delete
bluenamespace.
helm rollback aisoc 0 --namespace aisoc # 0 = previous release
# or target a specific revision:
helm history aisoc --namespace aisoc
helm rollback aisoc <REVISION> --namespace aisocAiSOC emits OpenTelemetry traces, metrics, and structured logs to a configurable OTLP endpoint (global.otelEndpoint in values.yaml).
| Service | Metric | SLO target |
|---|---|---|
api |
http_request_duration_p99 |
< 500 ms |
api |
http_error_rate |
< 0.5 % |
ingest |
event_ingestion_lag_p99 |
< 2 s |
alert-fusion |
alert_fusion_latency_p99 |
< 5 s |
agents |
agent_run_duration_p95 |
< 30 s |
| All | Pod ready ratio | > 99 % |
- Service map: trace-based topology from OTLP backend (Tempo, Jaeger, Honeycomb).
- Golden signals: per-service latency / error / saturation / traffic (Grafana
aisoc-golden-signals.json). - SLA tracker: AiSOC built-in
/sladashboard (/apps/web/src/app/(app)/sla/page.tsx).
# Example PrometheusRule
- alert: AiSOCHighErrorRate
expr: |
rate(http_requests_total{service="aisoc-api",status=~"5.."}[5m])
/ rate(http_requests_total{service="aisoc-api"}[5m]) > 0.005
for: 3m
labels:
severity: page
annotations:
summary: "AiSOC API error rate > 0.5%"
- alert: AiSOCIngestLag
expr: histogram_quantile(0.99, rate(ingest_lag_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warnRunbooks are auto-generated from live OTel trace data by scripts/generate_runbook.py. The output lives in docs/operations/runbooks/. Each runbook follows the format:
RB-NNN-<slug>.md
Title
Trigger condition
Impact assessment
Diagnosis steps (from trace topology)
Remediation steps
Verification steps
Escalation path
| ID | Slug | Trigger |
|---|---|---|
| RB-001 | api-high-latency |
http_request_duration_p99 > 500ms |
| RB-002 | postgres-replica-lag |
Replication lag > 30 s |
| RB-003 | region-failover |
Region health check failure |
| RB-004 | ingest-pipeline-stall |
Ingest lag > 30 s |
| RB-005 | agent-runner-oom |
OOMKilled in agents pods |
| RB-006 | cert-expiry |
TLS cert expires in < 14 days |
To regenerate all runbooks:
OTEL_ENDPOINT=http://tempo:4317 \
python scripts/generate_runbook.py \
--output docs/operations/runbooks/ \
--lookback-hours 168 # 1 week of traces| Failure scenario | RPO | RTO |
|---|---|---|
| Single pod crash | 0 s | < 30 s (K8s restarts) |
| Availability zone failure | < 30 s | < 5 min |
| Region failure (warm standby) | < 60 s | < 15 min |
| Region failure (cold DR) | < 5 min | < 60 min |
| Total data loss (from backup) | 24 h | < 4 h |
Run monthly in the staging environment:
- Kill 50 % of
apipods; confirm remaining pods handle load and HPA scales up. - Inject 500 ms network latency to
postgres; confirm circuit breaker opens. - Stop Kafka consumer group for
ingest; confirm lag alert fires within 5 min. - Simulate AZ failure by cordoning one node group; confirm pods reschedule.
- Promote Postgres replica; confirm API recovers within RTO.
- Delete and restore from backup; confirm RPO.
| Severity | First responder | Escalate to | SLA |
|---|---|---|---|
| P1 – Production down | On-call SRE (PagerDuty) | Engineering lead | 15 min response |
| P2 – Degraded performance | On-call SRE | Engineering lead | 1 h response |
| P3 – Non-critical issue | Slack #aisoc-ops |
— | Next business day |
| P4 – Informational | Ticketing system | — | Best effort |
This document is maintained alongside the codebase. Run scripts/generate_runbook.py --update-toc to refresh section links after adding runbooks.