Operator documentation for deploying kodiai with graceful shutdown and webhook queue replay.
When Azure sends SIGTERM (during a deploy or scale-down), kodiai follows this sequence:
- SIGTERM received -- shutdown manager sets
isShuttingDown = true - Stop new work -- new webhook dispatches are blocked; incoming webhooks are queued to PostgreSQL (
webhook_queuetable) instead of being dispatched - Drain in-flight work -- wait for active HTTP requests and background jobs to complete within the grace window
- Grace window -- controlled by
SHUTDOWN_GRACE_MS(default: 300000 = 5 minutes)- If drain times out, the grace window extends once (doubles to 10 minutes)
- If extended drain also times out, the process force-exits with code 1, logging abandoned work counts
- Clean exit -- close PostgreSQL connection pool, exit with code 0
- Next startup -- queued webhooks are replayed sequentially before accepting new traffic
Check for active reviews in container logs:
az containerapp logs show -n ca-kodiai -g rg-kodiai --tail 50 | grep "Job execution started"If active reviews are running, consider waiting for them to complete or accept that they will be abandoned if they exceed the grace window.
Run the deploy script:
./deploy.shThis script:
- Validates required environment variables (including
DATABASE_URL) - Builds the container image via Azure Container Registry (remote build)
- Updates secrets and environment variables
- Creates a new revision with the updated image
- Runs a post-deploy health check against
/healthz
Azure then:
- Sends SIGTERM to the old revision
- Waits for the termination grace period (330 seconds) before force-killing
- Starts the new revision
- Runs startup probe (
/healthz, every 5s, up to 40 failures = ~200s for cold start with queue replay)
With a single replica, there is a brief downtime gap between the old revision shutting down and the new revision becoming ready. This is acceptable because:
- GitHub retries webhooks that receive non-2xx responses
- Slack retries event deliveries that fail
Webhooks arriving during the drain period are queued to PostgreSQL and replayed on the next startup.
- Verify the health endpoint returns 200:
curl -s https://<FQDN>/healthz | jq .
# Expected: { "status": "ok", "db": "connected" }- Check container logs for the startup summary:
az containerapp logs show -n ca-kodiai -g rg-kodiai --tail 20
# Look for: "Startup webhook queue replay complete" with queuedWebhooksProcessed countSymptom: Container takes a long time to shut down, logs show "Drain timeout, extending grace window once".
Cause: Long-running review jobs have not completed within the grace window.
Resolution:
- Check logs for active job counts:
activeRequests,activeJobs,activeTotal - If this happens frequently, consider reducing
SHUTDOWN_GRACE_MSto force earlier exits - The extended grace window (2x) gives additional time, after which the process force-exits
Symptom: After deploy, webhooks from the drain period are not being processed.
Cause: Startup replay may have failed or there were no queued webhooks.
Resolution:
- Check logs for "Dequeued pending webhooks for replay" or "Startup webhook queue replay complete"
- Check the
webhook_queuetable for rows with status = 'pending' or 'processing' - If rows are stuck in 'processing', they may need manual reset to 'pending'
-- Check queued webhooks
SELECT id, source, status, queued_at FROM webhook_queue ORDER BY queued_at DESC LIMIT 20;
-- Reset stuck processing entries (if needed)
UPDATE webhook_queue SET status = 'pending' WHERE status = 'processing';Symptom: /healthz returns 503 with { "status": "unhealthy", "db": "unreachable" }.
Cause: PostgreSQL connection is not working.
Resolution:
- Verify
DATABASE_URLis set correctly in the container app secrets - Check PostgreSQL server is running and accessible from the container network
- Check connection pool limits (max 10 connections configured)
- Review container logs for PostgreSQL connection errors at startup
If a deploy goes wrong, rollback to the previous revision:
# List revisions
az containerapp revision list -n ca-kodiai -g rg-kodiai -o table
# Route all traffic to the previous revision
az containerapp ingress traffic set \
-n ca-kodiai -g rg-kodiai \
--revision-weight <previous-revision-name>=100| Variable | Required | Default | Description |
|---|---|---|---|
DATABASE_URL |
Yes | -- | PostgreSQL connection string |
SHUTDOWN_GRACE_MS |
No | 300000 (5 min) |
Grace window for drain before force-exit |
All other required environment variables are documented in deploy.sh.
| Message | Meaning |
|---|---|
Shutdown signal received, starting graceful drain |
SIGTERM/SIGINT received, drain starting |
Graceful drain completed successfully |
All in-flight work finished, clean exit |
Drain timeout, extending grace window once |
First drain timed out, extending |
Force exit after extended grace timeout, work abandoned |
Both drain attempts failed |
Webhook queued to PostgreSQL for drain-time replay |
Webhook saved during shutdown |
Dequeued pending webhooks for replay |
Startup found queued webhooks |
Startup webhook queue replay complete |
All queued webhooks processed |
Health check failed: PostgreSQL unreachable |
DB connection issue on liveness probe |
| Code | Meaning |
|---|---|
0 |
Clean shutdown -- all work drained, DB closed |
1 |
Forced shutdown -- extended grace timeout exceeded, some work abandoned |
| Endpoint | Type | What it checks |
|---|---|---|
/healthz |
Liveness | Process up + PostgreSQL SELECT 1 |
/readiness |
Readiness | GitHub API connectivity |
/health |
Alias | Same as /healthz (backward compatibility) |