AiSOC Multi-Region Operations

Audience: Platform / SRE teams running AiSOC in production across multiple cloud regions. Last updated: auto-generated (see scripts/generate_runbook.py)

Architecture overview
Region topology
Data residency & replication strategy
Traffic routing & failover
Deployment procedures
Observability & alerting
Runbooks
Recovery objectives (RTO / RPO)
Chaos engineering checklist
Contact & escalation matrix

1. Architecture overview

AiSOC is deployed as a set of independent microservices managed by Helm. In a multi-region setup each region runs a full replica of the control plane with:

Active–passive PostgreSQL: one writer in the primary region; read replicas in secondary regions promoted on failover.
Active–active ClickHouse: distributed cluster with per-shard replicas across regions; ZooKeeper or ClickHouse Keeper runs in every region.
Active–active ingest pipeline: events are fan-out written to all region Kafka clusters; correlation happens locally.
Global load balancer (e.g. Cloudflare, AWS Global Accelerator, or GCP Traffic Director) directing API traffic to the nearest healthy region.

                ┌──────────────────────────────────────────────────────┐
                │             Global Load Balancer / Anycast DNS       │
                └───────────┬──────────────────────┬───────────────────┘
                            │                      │
              ┌─────────────▼──────────┐  ┌────────▼──────────────┐
              │  Region: us-east-1      │  │  Region: eu-west-1    │
              │ ─────────────────────  │  │ ──────────────────── │
              │  Kubernetes cluster    │  │  Kubernetes cluster   │
              │  ├─ api (×2)           │  │  ├─ api (×2)          │
              │  ├─ ingest (×3)        │  │  ├─ ingest (×3)       │
              │  ├─ enrichment (×2)    │  │  ├─ enrichment (×2)   │
              │  ├─ alert-fusion (×2)  │  │  ├─ alert-fusion (×2) │
              │  └─ agents (×2)        │  │  └─ agents (×2)       │
              │                        │  │                        │
              │  PostgreSQL PRIMARY ─────────► PostgreSQL REPLICA  │
              │  ClickHouse shard 1    │  │  ClickHouse shard 2   │
              │  Redis (leader)  ────────►  Redis (replica)       │
              └────────────────────────┘  └───────────────────────┘

2. Region topology

Region label	Cloud / zone	Role	Postgres	ClickHouse shards
`us-east-1`	AWS us-east-1a/b	Primary	Writer	Shard 1 (1 replica each)
`eu-west-1`	AWS eu-west-1a/b	Secondary	Async replica	Shard 2 (1 replica each)
`ap-southeast-1`	AWS ap-southeast-1a	DR-only	Async replica	—

Adding a new region

# 1. Provision cluster (Terraform / eksctl / etc.)
# 2. Install cert-manager, nginx-ingress, external-secrets
# 3. Deploy AiSOC chart pointing to existing secrets store
helm upgrade --install aisoc infra/helm/aisoc \
  --namespace aisoc \
  --create-namespace \
  --set global.environment=production \
  --set ingress.hosts[0].host=aisoc-eu.example.com \
  -f infra/helm/aisoc/values-eu-west-1.yaml

# 4. Register region in Global LB (health-check /api/health)
# 5. Stream Postgres WAL to new replica (pg_basebackup)
# 6. Extend ClickHouse cluster config to include new shard

3. Data residency & replication strategy

PostgreSQL

Streaming replication (wal_level=replica, max_wal_senders=5).
Replication lag target: < 5 s. Alert at 30 s, page at 2 min.
Failover: automatic with Patroni or managed RDS Multi-AZ. Replica becomes writer; old writer enters standby when recovered.
GDPR: tenants with EU data residency requirements are assigned to the eu-west-1 writer via per-tenant routing in tenant_sla_config.

ClickHouse

Distributed table events_dist over all shards.
Each shard has a replica in the same region; cross-region replication runs over ZooKeeper quorum.
Replication lag target: < 10 s. Alert at 60 s.

Redis

Read replicas in secondary regions for cache warming; sentinel setup for HA within a region.
Session data: replicated. Ephemeral rate-limit keys: local only.

Object storage (S3/R2)

Backup objects replicated to a second bucket in an alternate region via bucket replication rules.
Plugin artifacts: single bucket with multi-region access enabled.

4. Traffic routing & failover

Healthy-region selection

The global LB runs active health checks every 10 s against GET /api/health. A region is removed from rotation if:

HTTP status ≠ 200 for 3 consecutive checks, or
Latency p99 > 2 s for 5 consecutive checks.

Planned failover (maintenance)

# Drain traffic from us-east-1 before maintenance window
# 1. Weight us-east-1 to 0 in LB config
# 2. Wait for in-flight requests to drain (~60 s)
# 3. Perform maintenance
# 4. Restore weight

# CloudFlare example
cf_zone_id=<ZONE_ID>
cf_record_id=<RECORD_ID>
curl -X PATCH "https://api.cloudflare.com/client/v4/zones/${cf_zone_id}/dns_records/${cf_record_id}" \
  -H "Authorization: Bearer ${CF_API_TOKEN}" \
  -d '{"data":{"weight":0}}'

Unplanned failover

PagerDuty alert fires (region_health_check rule).
On-call SRE confirms outage (./scripts/health_check.sh --region us-east-1).
Execute runbook RB-003-region-failover (auto-generated; see §7).
Postgres: promote replica via Patroni or aws rds promote-read-replica.
Update DATABASE_URL secret in secondary region to the new writer endpoint.
Restart API pods: kubectl rollout restart deployment -n aisoc -l app.kubernetes.io/name=api.

5. Deployment procedures

Rolling update (standard)

# Bump image tag in CI/CD (GitHub Actions), then:
helm upgrade aisoc infra/helm/aisoc \
  --namespace aisoc \
  --atomic \
  --timeout 5m \
  --set services.api.image.tag=${GIT_SHA} \
  --set services.ingest.image.tag=${GIT_SHA}

maxUnavailable: 0 and maxSurge: 1 are enforced in deployment.yaml; pods are updated one at a time.

Blue/green release

Deploy new version to a parallel namespace (aisoc-green).
Run smoke tests against green ingress host.
Switch LB to green namespace via weighted routing.
Keep blue idle for 1 hour (rollback window).
Delete blue namespace.

Rollback

helm rollback aisoc 0 --namespace aisoc   # 0 = previous release
# or target a specific revision:
helm history aisoc --namespace aisoc
helm rollback aisoc <REVISION> --namespace aisoc

6. Observability & alerting

AiSOC emits OpenTelemetry traces, metrics, and structured logs to a configurable OTLP endpoint (global.otelEndpoint in values.yaml).

Key SLIs

Service	Metric	SLO target
`api`	`http_request_duration_p99`	< 500 ms
`api`	`http_error_rate`	< 0.5 %
`ingest`	`event_ingestion_lag_p99`	< 2 s
`alert-fusion`	`alert_fusion_latency_p99`	< 5 s
`agents`	`agent_run_duration_p95`	< 30 s
All	Pod ready ratio	> 99 %

Recommended dashboards

Service map: trace-based topology from OTLP backend (Tempo, Jaeger, Honeycomb).
Golden signals: per-service latency / error / saturation / traffic (Grafana aisoc-golden-signals.json).
SLA tracker: AiSOC built-in /sla dashboard (/apps/web/src/app/(app)/sla/page.tsx).

Alerting rules (Prometheus/AlertManager)

# Example PrometheusRule
- alert: AiSOCHighErrorRate
  expr: |
    rate(http_requests_total{service="aisoc-api",status=~"5.."}[5m])
    / rate(http_requests_total{service="aisoc-api"}[5m]) > 0.005
  for: 3m
  labels:
    severity: page
  annotations:
    summary: "AiSOC API error rate > 0.5%"

- alert: AiSOCIngestLag
  expr: histogram_quantile(0.99, rate(ingest_lag_seconds_bucket[5m])) > 2
  for: 5m
  labels:
    severity: warn

7. Runbooks

Runbooks are auto-generated from live OTel trace data by scripts/generate_runbook.py. The output lives in docs/operations/runbooks/. Each runbook follows the format:

RB-NNN-<slug>.md
  Title
  Trigger condition
  Impact assessment
  Diagnosis steps (from trace topology)
  Remediation steps
  Verification steps
  Escalation path

Available runbooks

ID	Slug	Trigger
RB-001	`api-high-latency`	`http_request_duration_p99 > 500ms`
RB-002	`postgres-replica-lag`	Replication lag > 30 s
RB-003	`region-failover`	Region health check failure
RB-004	`ingest-pipeline-stall`	Ingest lag > 30 s
RB-005	`agent-runner-oom`	OOMKilled in `agents` pods
RB-006	`cert-expiry`	TLS cert expires in < 14 days

To regenerate all runbooks:

OTEL_ENDPOINT=http://tempo:4317 \
  python scripts/generate_runbook.py \
  --output docs/operations/runbooks/ \
  --lookback-hours 168    # 1 week of traces

8. Recovery objectives (RTO / RPO)

Failure scenario	RPO	RTO
Single pod crash	0 s	< 30 s (K8s restarts)
Availability zone failure	< 30 s	< 5 min
Region failure (warm standby)	< 60 s	< 15 min
Region failure (cold DR)	< 5 min	< 60 min
Total data loss (from backup)	24 h	< 4 h

9. Chaos engineering checklist

Run monthly in the staging environment:

Kill 50 % of api pods; confirm remaining pods handle load and HPA scales up.
Inject 500 ms network latency to postgres; confirm circuit breaker opens.
Stop Kafka consumer group for ingest; confirm lag alert fires within 5 min.
Simulate AZ failure by cordoning one node group; confirm pods reschedule.
Promote Postgres replica; confirm API recovers within RTO.
Delete and restore from backup; confirm RPO.

10. Contact & escalation matrix

Severity	First responder	Escalate to	SLA
P1 – Production down	On-call SRE (PagerDuty)	Engineering lead	15 min response
P2 – Degraded performance	On-call SRE	Engineering lead	1 h response
P3 – Non-critical issue	Slack `#aisoc-ops`	—	Next business day
P4 – Informational	Ticketing system	—	Best effort

This document is maintained alongside the codebase. Run scripts/generate_runbook.py --update-toc to refresh section links after adding runbooks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AiSOC Multi-Region Operations

Table of contents

1. Architecture overview

2. Region topology

Adding a new region

3. Data residency & replication strategy

PostgreSQL

ClickHouse

Redis

Object storage (S3/R2)

4. Traffic routing & failover

Healthy-region selection

Planned failover (maintenance)

Unplanned failover

5. Deployment procedures

Rolling update (standard)

Blue/green release

Rollback

6. Observability & alerting

Key SLIs

Recommended dashboards

Alerting rules (Prometheus/AlertManager)

7. Runbooks

Available runbooks

8. Recovery objectives (RTO / RPO)

9. Chaos engineering checklist

10. Contact & escalation matrix

FilesExpand file tree

multi-region.md

Latest commit

History

multi-region.md

File metadata and controls

AiSOC Multi-Region Operations

Table of contents

1. Architecture overview

2. Region topology

Adding a new region

3. Data residency & replication strategy

PostgreSQL

ClickHouse

Redis

Object storage (S3/R2)

4. Traffic routing & failover

Healthy-region selection

Planned failover (maintenance)

Unplanned failover

5. Deployment procedures

Rolling update (standard)

Blue/green release

Rollback

6. Observability & alerting

Key SLIs

Recommended dashboards

Alerting rules (Prometheus/AlertManager)

7. Runbooks

Available runbooks

8. Recovery objectives (RTO / RPO)

9. Chaos engineering checklist

10. Contact & escalation matrix