Skip to content

fix(bootstrap): improve startup reliability for multi-replica deploys#4444

Open
gandhipratik203 wants to merge 2 commits intomainfrom
fix/4051-alembic-pgbouncer-transaction-pool
Open

fix(bootstrap): improve startup reliability for multi-replica deploys#4444
gandhipratik203 wants to merge 2 commits intomainfrom
fix/4051-alembic-pgbouncer-transaction-pool

Conversation

@gandhipratik203
Copy link
Copy Markdown
Collaborator

@gandhipratik203 gandhipratik203 commented Apr 26, 2026

Summary

Fixes issue #4051 — Alembic migration advisory lock hangs when multiple gateway replicas start through PgBouncer in transaction-pooling mode. The session-scoped pg_advisory_lock is orphaned across PgBouncer's backend handoffs; the N-th gateway pod sees the lock held by an effectively-dead session and spins in its retry loop until the ~10-minute timeout fires.

This PR closes the bug along three pieces and ships them together so the chart is safe under both deployment patterns:

  • Layer 1 (correctness)bootstrap_db.main() now probes alembic_version before attempting the advisory lock. When the schema is at the script directory's head, the lock acquisition is skipped entirely. Replicas 2..N never participate in the race.
  • Layer 2 (operator ergonomics) — new MCPGATEWAY_SKIP_MIGRATIONS settings flag suppresses the in-pod bootstrap_db.main() call entirely. The mcp-stack Helm chart wires it from migration.enabled so the contract "the pre-install Job owns migrations, app pods skip" is enforced at the chart layer rather than left to operators to coordinate.
  • Schema-at-head startup probe (chart-level Ready gate) — gateway pod's startup probe now runs python -m mcpgateway.utils.check_schema_at_head instead of db_isready. Pods refuse Ready until alembic_version matches the script-directory head, regardless of which entity (pre-install Job, post-install Job, init container, external CD pipeline) ran the migration. Defense in depth: even if L2 mis-fires, no pod ever serves traffic against a half-migrated DB.

Library defaults preserve backward compatibility: a direct docker run mcpgateway:latest continues to bootstrap its own schema. Only the chart and (future) compose overlays opt into SKIP_MIGRATIONS=true.

Related: existing migration.hostKey: host / portKey: port knobs in charts/mcp-stack/profiles/ocp/values-pgo.yaml already let the migration Job bypass PgBouncer; that workaround stays in place but is no longer load-bearing.

Approach

The fix was built reproduction-first: rather than implement against the issue text alone, the bug was first reproduced deterministically in a throwaway docker-compose harness (inlined in the Test results section below) and then again on a real OpenShift cluster against CrunchyData PGO + transaction-pool PgBouncer. Both reproductions are documented in todo/issue-4051-ocp-reproduction.md, and the compose-side stack lives permanently in the repo as a test fixture at tests/integration/fixtures/transaction_pool/.

For each layer (L1 fast-path, L2 chart-wired skip), the cycle was:

   1. write a failing test that captures the invariant ──────► RED
   2. implement just enough to make it pass ─────────────────► GREEN
   3. verify on the OCP cluster against pre-fix and post-fix
      images, side-by-side, on the same stack ──────────────► EVIDENCE
   4. commit

The regression test for the bug — test_bootstrap_db_skips_lock_when_schema_already_at_head — is reliably RED pre-fix (240s timeout, 29/60 retry attempts visible in logs) and reliably GREEN post-fix (~1.4s). It's not a flaky timing-based test: it synthesizes the held-lock condition deterministically by holding the advisory lock in a distinct, still-alive Postgres session for the duration of the second bootstrap_db.main() call. That guarantees the bootstrap's PgBouncer-side session sees pg_try_advisory_lock return FALSE regardless of PgBouncer's internal backend assignment — without that guarantee, PgBouncer can hand the bootstrap the same server backend that the holder used, and pg_try_advisory_lock succeeds via PostgreSQL's reentrant-within-session semantics, masking the bug. (We landed on this design after observing a passing test that should have failed; the false-positive is documented as test_reentrant_acquire_through_same_pgbouncer_is_not_a_counter_example so a future reader who hits the same gotcha understands what they're seeing.)


Gaps closed

HIGH — Bug correctness

Gap 1 (HIGH) — Advisory lock orphan via PgBouncer transaction-pool handoffs: bootstrap_db.advisory_lock always called pg_try_advisory_lock and retried up to 60 times with exponential backoff. When PgBouncer reused a backend whose previous client took the lock, new clients saw it held by an unreachable session and spun until timeout. Fixed with a _alembic_at_head probe that runs before the lock — when the DB is at the script-directory head, the lock is never attempted. The probe uses Alembic's MigrationContext.get_current_heads() and ScriptDirectory.from_config(cfg).get_heads() for the canonical source of truth; any error during probing falls through to the existing slow path so empty / partial / out-of-date databases still go through the explicit handling.

Gap 2 (HIGH) — Worker-level fan-out under-stated in the issue text: each gateway pod runs N gunicorn workers (default 8 in the OCP profile), and each worker calls bootstrap_db.main() in its own lifespan. So a 3-replica deployment produces 24 racing acquirers, not 3. Reproduction on OpenShift confirmed the fan-out: pre-fix, the same pod's logs show interleaved Acquired lock on attempt N and Lock held by another instance, attempt N/60 at the same timestamps — different workers in the same gunicorn process. The L1 fast-path eliminates the race for replicas 2..N entirely; for replica 1 the within-pod race is bounded to a sub-second seed window.

Gap 3 (HIGH) — Chart safety today depends on the OCP-only workaround: migration.hostKey: host / portKey: port route the pre-install Job direct to Postgres (bypassing PgBouncer), masking the bug whenever the Job is enabled. A community user setting migration.enabled=false (because they manage migrations externally, or run the chart in compose-flavoured contexts) loses that protection. Layer 1 makes migration.enabled=false safe.

HIGH — Operator ergonomics

Gap 4 (HIGH) — In-pod bootstrap runs even when a dedicated migration runner has already populated the schema: redundant probe activity per worker, plus the within-pod race for replica 1's seed window. Fixed with MCPGATEWAY_SKIP_MIGRATIONS=true which short-circuits the lifespan call to bootstrap_db.main() and emits a single audit-log line so operators can see the choice in pod logs.

Gap 5 (HIGH) — No chart wiring for the SKIP/Job pairing: even if the env var existed, expecting operators to set both migration.enabled=true AND MCPGATEWAY_SKIP_MIGRATIONS=true is a foot-gun (forgetting one means redundant in-pod bootstrap; forgetting the other means schema never populated). Fixed in charts/mcp-stack/templates/deployment-mcpgateway.yaml: the env var is rendered as {{ ternary "true" "false" .Values.migration.enabled }} and placed after extraEnv so a user override cannot accidentally undo the contract.

MEDIUM — Test infrastructure

Gap 6 (MEDIUM) — No regression test for the advisory-lock invariant: no existing test exercised concurrent bootstrap_db.main() callers through PgBouncer. Fixed with tests/integration/test_migrations_under_transaction_pool.py containing five @pytest.mark.integration tests: three pinning the PgBouncer mechanism (so the substrate behavior is captured for posterity even if PgBouncer changes), one regression gate (synthesizes the orphaned-lock condition and asserts bootstrap_db.main() returns within 10s — fails in 240s pre-fix), and one idempotency invariant (asserts a second call after schema is at head never enters the advisory_lock context manager). Tests skip cleanly without --with-integration.

Gap 7 (MEDIUM) — No reproduction artifact in the repo: future debuggers had nothing to anchor on. Fixed by promoting the reproduction's compose stack to a permanent test fixture at tests/integration/fixtures/transaction_pool/docker-compose.yml (driven by the integration suite). A throwaway shell harness (demonstrate_orphan.sh proving the PgBouncer mechanism in 3 psql calls; reproduce.sh scaling gateway replicas) lived on the branch during development and was removed in the squashed commit; the script contents are inlined verbatim in the Test results section below so a future reader can rebuild the reproduction without checking out an old commit.

LOW — Documentation and observability

Gap 8 (LOW) — Two readiness markers, one log-line: the L1 fast-path returns without emitting "Database ready" (only the slow-path return logs it). Operators reading pod logs see either the slow-path marker or the fast-path skip line, not a single canonical "bootstrap finished" line. Tracked as a tiny follow-up; not blocking.


Architecture

Before — every replica × every worker races for the lock

                        DB schema starts populated (alembic_version at head)
                                                │
                                                ▼
   ┌────────── pod 1 (8 gunicorn workers) ──────────┐
   │  W1 → SELECT pg_try_advisory_lock(42…42) → t   │ ─┐
   │  W2 → SELECT pg_try_advisory_lock(42…42) → f   │ ─┤
   │  W3 → SELECT pg_try_advisory_lock(42…42) → f   │ ─┤
   │  …                                              │  │
   └─────────────────────────────────────────────────┘  │
                                                        │ all 24 workers
   ┌────────── pod 2 (8 gunicorn workers) ──────────┐  │ converge through
   │  W1..W8 → all 'f', spin in retry loop          │ ─┤ the same pgbouncer
   └─────────────────────────────────────────────────┘  │ pool
                                                        │
   ┌────────── pod 3 (8 gunicorn workers) ──────────┐  │
   │  W1..W8 → all 'f', spin in retry loop          │ ─┘
   └─────────────────────────────────────────────────┘
                                                        │
                                                        ▼
                        ┌──────────────────────────────────────────┐
                        │ PgBouncer (transaction pool)             │
                        │   each transaction commit → backend      │
                        │   handed to next client                  │
                        │   advisory lock outlives client session  │
                        └──────────────────────────────────────────┘
                                                        │
                                                        ▼
                        ┌──────────────────────────────────────────┐
                        │ Postgres                                  │
                        │   1 backend granted advisory lock         │
                        │   N–1 backends "idle in transaction" on   │
                        │   pg_try_advisory_lock(42424242424242)    │
                        └──────────────────────────────────────────┘

   Observable on OCP (issue reporter + our reproduction):
     · pods cycle 0/1 Running → CrashLoopBackOff → … → Ready over ~13 min
     · 7 restarts per pod (×3 = 21 restart events)
     · 60+ visible "Lock held by another instance" retries per restart
     · helm release status: failed

After Layer 1 — fast-path before the lock

   ┌────────── pod 1 worker ─────────────────────────────────────┐
   │  bootstrap_db.main()                                         │
   │    ┌─ probe (read-only) ─────────────────────────────────┐  │
   │    │  alembic_version == script_dir.get_heads() ?        │  │
   │    │       │                                              │  │
   │    │       ├── YES → return early, no lock involved      │  │
   │    │       │                                              │  │
   │    │       └── NO → fall through to advisory_lock path   │  │
   │    │              (existing slow path, unchanged)         │  │
   │    └─────────────────────────────────────────────────────┘  │
   └──────────────────────────────────────────────────────────────┘

   On a fresh deploy (empty DB), only the FIRST worker that wins the
   lock runs the slow path; replicas 2..N probe AFTER the seed and
   skip the lock entirely.

   Observed on OCP (verified end-to-end on <your-cluster>):
     · pod 1 (seed):    0 retries → completes in ~1s
     · pods 2 & 3:      0 retries, 8 fast-path firings each
     · all 3 pods 1/1 Ready in 56 seconds, 0 restarts
     · helm release status: deployed

After Layer 2 — pods don't even probe when the Job owns it

                                                       Helm pre-install Job
                                                       ─────────────────────
                                                       direct to postgres :5432
                                                       runs bootstrap_db,
                                                       exits 0
                                                              │
                                                              ▼
                                              alembic_version at head
                                                              │
                                                              ▼
   ┌────────── gateway Deployment (chart-rendered) ────────────┐
   │  env:                                                      │
   │    - MCPGATEWAY_SKIP_MIGRATIONS: "true"   ← from           │
   │                                              .Values.      │
   │                                              migration.    │
   │                                              enabled       │
   │                                                            │
   │  lifespan:                                                 │
   │    wait_for_db_ready()                                     │
   │    if settings.mcpgateway_skip_migrations:                 │
   │        logger.info("Skipping in-pod migration              │
   │                     bootstrap (MCPGATEWAY_                 │
   │                     SKIP_MIGRATIONS=true)…")               │
   │        return                                              │
   │    else:                                                   │
   │        await bootstrap_db.main()                           │
   └────────────────────────────────────────────────────────────┘

   No probe. No lock. No worker fan-out.
   Single audit-log line per pod so operators can see the choice.

After: schema-at-head startup probe (chart-level Ready gate)

   ┌── gateway pod (chart-rendered when migration.enabled=true) ──┐
   │  spec.containers[0].startupProbe.exec:                       │
   │    python -m mcpgateway.utils.check_schema_at_head           │
   │      ├── db_heads == script_dir.get_heads() ─► exit 0  Ready │
   │      ├── any mismatch                       ─► exit 1  0/1   │
   │      │                                          probe runs   │
   │      │                                          every 5s     │
   │      └── unhandled exception (bad URL, etc.) ─► exit 1  0/1  │
   │                                                  fail closed │
   └──────────────────────────────────────────────────────────────┘

   Defence in depth complementing L1 + L2: ensures no pod serves
   traffic against a half-migrated DB regardless of which entity
   (pre/post-install Job, init container, external CD pipeline)
   actually ran the migration. The chart-level Ready gate is
   independent of the migration runner's identity.

Test results

The change ships with three layers of verification, each addressing a different failure mode:

   ┌──────────────────────────────────────────────────────────────────────┐
   │ Layer            │ Catches                                            │
   ├──────────────────┼────────────────────────────────────────────────────┤
   │ Unit             │ Fast-path predicate logic; settings field          │
   │                  │ defaults; lifespan helper gate; chart render       │
   │                  │ outputs the right env var; schema-at-head probe    │
   │                  │ entry-point exit-code contract.                    │
   │                  │                                                    │
   │                  │ A future refactor that breaks ANY of these gets a  │
   │                  │ red unit test in seconds.                          │
   ├──────────────────┼────────────────────────────────────────────────────┤
   │ Integration      │ The actual bug surface: bootstrap_db running       │
   │ (5 tests)        │ against a real Postgres + PgBouncer stack with     │
   │                  │ the held-lock precondition synthesized.            │
   │                  │                                                    │
   │                  │ Reliably RED on main (240s timeout). Reliably      │
   │                  │ GREEN with the L1 fast-path. Plus 3 mechanism      │
   │                  │ tests pinning PgBouncer's transaction-pool         │
   │                  │ semantics for documentation-as-code.               │
   ├──────────────────┼────────────────────────────────────────────────────┤
   │ Cluster smoke    │ End-to-end on OpenShift with CrunchyData PGO +     │
   │ (manual)         │ transaction-pool PgBouncer. Same image, same       │
   │                  │ stack, only the gateway code differs between       │
   │                  │ pre- and post-fix runs. Concrete numbers below.    │
   └──────────────────────────────────────────────────────────────────────┘

The integration tests sit behind --with-integration; the unit tests run on every CI pass. The cluster smoke is documented evidence, not automated, and lives as todo/issue-4051-ocp-reproduction.md for the PR record.

Reproduction harness — files, steps, and observed results

A throwaway harness lived on the branch during development at
tests/reproduction/issue-4051/, then was removed before merge. Its three
files are inlined below so a future reader can rebuild the reproduction
without checking out an old commit.

The compose stack the scripts drove was promoted to
tests/integration/fixtures/transaction_pool/docker-compose.yml so the
integration test owns the same infrastructure long-term. The shell harness
is a separate, optional layer on top of that for ad-hoc runs.

   demonstrate_orphan.sh   ── deterministic mechanism proof (3 psql calls)
                              always passes; documents PgBouncer behavior

   reproduce.sh            ── replica-scaling symptom reproduction
                              fails on pre-fix images, succeeds on post-fix

Why two harnesses for one bug? demonstrate_orphan.sh proves the PgBouncer
mechanism
exists (orphaned advisory locks); reproduce.sh proves the
operational symptom exists (replicas hang). Together they answer two
questions any reviewer might ask: "is this a real PgBouncer property?" and
"does it actually cause user-visible breakage?" — without making us guess
timing.

PgBouncer's default_pool_size is deliberately small (2) in the fixture so
total client connections comfortably exceed it, forcing backend multiplexing
— without that pressure, the bug is hard to trigger locally because bootstrap
can finish before any backend handoff happens.

Step 1 — demonstrate_orphan.sh (mechanism proof)

cd <repo-root>
./tests/reproduction/issue-4051/demonstrate_orphan.sh

Three steps:

  1. Through PgBouncer, take advisory lock 42424242424242, disconnect.
  2. Query pg_locks directly on Postgres → lock is still held by the
    now-orphaned backend.
  3. From a fresh direct-to-Postgres session,
    pg_try_advisory_lock(42424242424242) returns f.

Step 3 returning f is the exact condition that makes
bootstrap_db.main() hang: a new gateway pod spins on a lock no live
client owns. Always reproducible; runs in ~10 s.

Step 2 — reproduce.sh (symptom reproduction at replica scale)

./tests/reproduction/issue-4051/reproduce.sh

Scales gateway to N replicas (default 3) against the same PgBouncer-fronted
Postgres, watches each pod's logs for the bootstrap-completion line,
then dumps pg_locks and pg_stat_activity if any pod is still hung.

Pre-fix expected output:

  • Exits 1.
  • One container logs "Database ready"; the others end at
    INFO [alembic.runtime.migration] Will assume transactional DDL.
  • pg_locks shows one granted advisory row plus several backends
    idle in transaction on SELECT pg_try_advisory_lock(...).

Post-fix expected output: all replicas log Database ready within the
timeout, no stuck backends, exits 0.

Inlined file contents

demonstrate_orphan.sh
#!/usr/bin/env bash
# Direct demonstration of the mechanism behind issue #4051.
#
# Three steps against the same stack that reproduce.sh uses:
#
#   1. Through PgBouncer (transaction pool, POOL_SIZE=2), take a Postgres
#      session-scoped advisory lock, then close the connection. PgBouncer
#      returns the backend to its pool.
#   2. Query pg_locks on Postgres directly. The lock is STILL HELD — the
#      pgbouncer "session" is gone but the server session on the backend
#      is still alive.
#   3. Open a fresh connection through PgBouncer and try to acquire the
#      SAME lock. Returns FALSE because the backend PgBouncer hands us is
#      the same one that still holds it.
#
# That "FALSE in step 3 even though no-one is actually doing anything" is
# what makes bootstrap_db.main() hang: the advisory-lock retry loop spins
# forever because the lock is orphaned on a backend no client owns.
#
# Requires:
#   - docker compose with the stack defined in
#     tests/integration/fixtures/transaction_pool/docker-compose.yml
#     (postgres + pgbouncer only; no gateway needed).

set -u

REPO_ROOT=$(git rev-parse --show-toplevel)
COMPOSE_FILE="${REPO_ROOT}/tests/integration/fixtures/transaction_pool/docker-compose.yml"
COMPOSE=(docker compose -f "${COMPOSE_FILE}")
LOCK_ID="${LOCK_ID:-42424242424242}"

log() { printf '[demo] %s\n' "$*"; }

PSQL_PGBOUNCER=("${COMPOSE[@]}" exec -T -e PGPASSWORD=reprosecret postgres
                psql -h pgbouncer -p 6432 -U postgres -d mcp -t -A)
PSQL_DIRECT=("${COMPOSE[@]}" exec -T -e PGPASSWORD=reprosecret postgres
             psql -h postgres -p 5432 -U postgres -d mcp -t -A)

log "clean slate"
"${COMPOSE[@]}" down -v --remove-orphans >/dev/null 2>&1 || true

log "start postgres + pgbouncer"
"${COMPOSE[@]}" up -d --wait postgres pgbouncer

# Sanity: confirm both are reachable.
"${PSQL_DIRECT[@]}" -c "SELECT 1;" >/dev/null
"${PSQL_PGBOUNCER[@]}" -c "SELECT 1;" >/dev/null

echo
log "=== STEP 1: through PgBouncer, acquire advisory lock ${LOCK_ID}, then disconnect"
"${PSQL_PGBOUNCER[@]}" -c "SELECT pg_advisory_lock(${LOCK_ID});" >/dev/null
log "    done (pgbouncer-facing connection closed)"

echo
log "=== STEP 2: direct to Postgres, check pg_locks for any advisory lock"
direct_locks=$("${PSQL_DIRECT[@]}" -c \
    "SELECT pid::text||'|'||locktype||'|'||classid::text||'|'||objid::text||'|'||granted::text FROM pg_locks WHERE locktype='advisory';")
if [ -z "$direct_locks" ]; then
    log "    NO advisory lock found on Postgres — backend was fully reset."
    exit 1
fi
log "    lock still held at server level (classid<<32 | objid == ${LOCK_ID}):"
printf '      %s\n' "$direct_locks" | awk -F'|' '{printf "pid=%s type=%s classid=%s objid=%s granted=%s\n", $1,$2,$3,$4,$5}'

echo
log "=== STEP 3: from a DIFFERENT Postgres session (direct, bypass PgBouncer),"
log "           try to take the same advisory lock"
result=$("${PSQL_DIRECT[@]}" -c "SELECT pg_try_advisory_lock(${LOCK_ID});" | tr -d '[:space:]')
log "    pg_try_advisory_lock (from fresh session) -> ${result}"

result_via_bouncer=$("${PSQL_PGBOUNCER[@]}" -c "SELECT pg_try_advisory_lock(${LOCK_ID});" | tr -d '[:space:]')
log "    pg_try_advisory_lock (via pgbouncer, may reuse same backend) -> ${result_via_bouncer}"

echo
case "$result" in
    f)
        log "RESULT: ORPHANED. The lock set in step 1 is held by a backend"
        log "        that no pgbouncer client still owns, and a different"
        log "        postgres session cannot acquire it. A new gateway pod"
        log "        taking this path would spin on pg_try_advisory_lock"
        log "        until the advisory_lock retry timeout fires — exactly"
        log "        the symptom described in issue #4051."
        exit 0 ;;
    t)
        log "RESULT: lock was NOT orphaned. PgBouncer cleared it between"
        log "        clients (likely via DISCARD ALL on server_reset_query)."
        exit 1 ;;
    *)
        log "RESULT: unexpected psql output: '${result}'"
        exit 2 ;;
esac
reproduce.sh
#!/usr/bin/env bash
# Reproduces issue #4051.
#
# Brings up postgres + pgbouncer (transaction pool) + N gateway replicas and
# reports how many of them reached "Database ready" within TIMEOUT_SECONDS.
# When the bug is present the first replica finishes bootstrap; the rest hang
# at Alembic's advisory-lock acquisition.

set -u

REPO_ROOT=$(git rev-parse --show-toplevel)
COMPOSE_FILE="${REPO_ROOT}/tests/integration/fixtures/transaction_pool/docker-compose.yml"
COMPOSE=(docker compose -f "${COMPOSE_FILE}")
REPLICAS="${REPLICAS:-3}"
TIMEOUT_SECONDS="${TIMEOUT_SECONDS:-600}"
POLL_INTERVAL="${POLL_INTERVAL:-10}"

READY_PATTERN='"name": "mcpgateway.bootstrap_db".*"message": "Database ready"'

log() { printf '[repro] %s\n' "$*"; }

count_ready() {
    "${COMPOSE[@]}" logs --no-log-prefix gateway 2>/dev/null \
        | grep -cE "${READY_PATTERN}" || true
}

log "clean slate"
"${COMPOSE[@]}" down -v --remove-orphans >/dev/null 2>&1 || true

log "start postgres + pgbouncer"
"${COMPOSE[@]}" up -d postgres pgbouncer

log "scale gateway to ${REPLICAS}"
"${COMPOSE[@]}" up -d --scale gateway="${REPLICAS}" --no-recreate gateway

log "waiting up to ${TIMEOUT_SECONDS}s for all replicas to log 'Database ready'..."
elapsed=0
ready=0
while (( elapsed < TIMEOUT_SECONDS )); do
    ready=$(count_ready)
    running=$("${COMPOSE[@]}" ps -q gateway | wc -l | tr -d ' ')
    printf '  t=%4ds  ready=%s/%s  running=%s/%s\n' "$elapsed" "$ready" "$REPLICAS" "$running" "$REPLICAS"
    if (( ready >= REPLICAS )); then break; fi
    sleep "$POLL_INTERVAL"
    elapsed=$((elapsed + POLL_INTERVAL))
done

echo "=============== gateway replica status ==============="
"${COMPOSE[@]}" ps gateway
echo "=============== advisory locks on postgres ==============="
"${COMPOSE[@]}" exec -T postgres psql -U postgres -d mcp -c \
    "SELECT locktype, mode, pid, granted, objid, classid FROM pg_locks WHERE locktype='advisory';" || true
echo "=============== active backends on postgres ==============="
"${COMPOSE[@]}" exec -T postgres psql -U postgres -d mcp -c \
    "SELECT pid, application_name, state, wait_event_type, wait_event, left(query, 80) AS query FROM pg_stat_activity WHERE datname='mcp';" || true

if (( ready >= REPLICAS )); then
    log "RESULT: all ${REPLICAS} replicas reached 'Database ready' (BUG NOT REPRODUCED)"
    exit 0
else
    log "RESULT: only ${ready}/${REPLICAS} replicas reached 'Database ready' (BUG REPRODUCED)"
    exit 1
fi
README.md
# Reproduction — Issue #4051

Alembic migration advisory lock hangs when multiple gateway replicas start
concurrently through PgBouncer in transaction-pooling mode.

Issue: https://github.com/IBM/mcp-context-forge/issues/4051

## What's here

Three ways to observe / prove the bug, in increasing order of abstraction:

| Artifact | Scope | When to use |
|---|---|---|
| `demonstrate_orphan.sh` | Mechanism proof | Fastest, deterministic. Proves the PgBouncer session-advisory-lock orphan exists. Runs in ~10s against just postgres + pgbouncer. |
| `reproduce.sh` | End-to-end symptom | Scales gateway to N replicas through PgBouncer and watches for the hang. Matches the issue reporter's OCP symptoms. |
| `tests/integration/test_migrations_under_transaction_pool.py` | Pinned regression | The mechanism as three pytest assertions. Keeps the invariants enforceable as the stack evolves. |

All three drive the same `tests/integration/fixtures/transaction_pool/docker-compose.yml` stack:

- `postgres:17`
- `edoburu/pgbouncer:latest` with `POOL_MODE=transaction` and a deliberately small `DEFAULT_POOL_SIZE=2` to force backend multiplexing
- N gateway replicas (default 3) pointed at `pgbouncer:6432` (only used by `reproduce.sh`)

## Prerequisites

- Docker and Docker Compose v2.
- For `reproduce.sh` only: a local gateway image tagged `mcpgateway/mcpgateway:latest`. Build from the repo root with `make docker` (or `make docker-prod`); override the tag via `IMAGE_LOCAL` if needed.
- For the pytest regression test: the `postgres` optional extra — `uv sync --extra postgres`.

## Tear down

```bash
docker compose -f tests/integration/fixtures/transaction_pool/docker-compose.yml down -v
```

## Host-exposed ports

| Service   | Host port | In-network port |
|-----------|-----------|-----------------|
| postgres  | `54320`   | `5432`          |
| pgbouncer | `64320`   | `6432`          |

```bash
PGPASSWORD=reprosecret psql -h localhost -p 54320 -U postgres -d mcp   # direct to postgres
PGPASSWORD=reprosecret psql -h localhost -p 64320 -U postgres -d mcp   # through pgbouncer
PGPASSWORD=reprosecret psql -h localhost -p 64320 -U postgres -d pgbouncer -c 'SHOW pools;'   # pgbouncer admin
```

## Notes

- The stack deliberately omits Redis, the admin UI, plugins, federation, and A2A — they're not required to trigger the bug.
- `PgBouncer`'s `DEFAULT_POOL_SIZE` is set low (`2`) so total client connections comfortably exceed it; without multiplexing pressure the bug is hard to trigger locally because bootstrap can finish before any backend handoff happens.
- Production OCP/Helm deployments hit this reliably because real migrations take seconds to tens of seconds (plenty of handoff opportunities) and pool sizes are often sized for steady-state, not for migration bursts.
Why the regression test is deterministic (not flaky)

Three ideas had to land before test_bootstrap_db_skips_lock_when_schema_already_at_head was reliably RED on main:

   ❌ First attempt — race the bootstrap calls
       Open 3 concurrent gateway processes, hope one wins the lock and
       the others spin. Worked sometimes (lucky timing) → fragile.

   ❌ Second attempt — synthesize an orphan via PgBouncer
       Take the lock through PgBouncer, disconnect, then call
       bootstrap_db.main() through PgBouncer. PgBouncer reused the same
       server backend for the bootstrap → pg_try_advisory_lock returned
       TRUE (reentrant within the same Postgres session) → FALSE
       POSITIVE (test passed when bug was present).

   ✅ Final design — hold the lock in a distinct postgres session
       Open a direct postgres connection (NOT through PgBouncer), take
       the advisory lock, KEEP the connection alive. Now run
       bootstrap_db.main() through PgBouncer. The bootstrap's session
       is guaranteed different from the holder's session, so
       pg_try_advisory_lock MUST return FALSE → bootstrap retries →
       240s pytest.mark.timeout fires. RED, every time.

The middle attempt's false-positive is captured as a permanent test (test_reentrant_acquire_through_same_pgbouncer_is_not_a_counter_example) so future readers who see "but I called pg_try_advisory_lock through PgBouncer and it returned TRUE!" understand the reentrant-session gotcha.

The same logic underpins the _clean_orphaned_lock autouse fixture: it runs pg_terminate_backend on any session still holding the test lock id between tests, so a single failed test cannot wedge the next one behind a leaked orphan.

Unit — `_alembic_at_head` truth table (5 tests)
tests/unit/mcpgateway/test_bootstrap_db.py::TestAlembicAtHead

  test_returns_true_when_db_heads_match_script_heads          PASSED
  test_returns_false_when_db_has_no_alembic_version           PASSED
  test_returns_false_when_db_head_does_not_match_script_head  PASSED
  test_returns_false_when_script_directory_has_no_heads       PASSED
  test_returns_false_on_probe_exception                       PASSED

5 passed

Pins each branch of the fast-path predicate independently of the integration test, so a future refactor that widens or narrows the trigger condition is caught locally.

Unit — `MCPGATEWAY_SKIP_MIGRATIONS` settings field (3 tests)
tests/unit/mcpgateway/test_config.py

  test_skip_migrations_defaults_to_false        PASSED
  test_skip_migrations_env_true_flips_flag      PASSED
  test_skip_migrations_env_false_keeps_flag_off PASSED

3 passed

Library default stays False (the docker run mcpgateway happy path), env var override flips True, explicit False matches the default. No accidental coercion.

Unit — lifespan helper respects the flag (2 tests)
tests/unit/mcpgateway/test_main.py

  test_run_initial_db_bootstrap_invokes_bootstrap_when_flag_off  PASSED
  test_run_initial_db_bootstrap_skips_when_flag_on               PASSED

2 passed

The skip-case test also asserts the audit-log line is emitted so operators can see the choice in pod logs.

Unit — Helm chart render (3 tests)
tests/unit/charts/test_deployment_mcpgateway.py

  test_skip_migrations_set_to_true_when_migration_job_enabled        PASSED
  test_skip_migrations_off_when_migration_job_disabled               PASSED
  test_skip_migrations_default_matches_migration_enabled_default     PASSED

3 passed

Tests run helm template as a subprocess, parse the rendered manifests, and assert on the gateway Deployment's env block. Skipped automatically when helm is not on PATH; no cluster needed.

Integration — PgBouncer mechanism + bootstrap regression (5 tests, gated by `--with-integration`)
tests/integration/test_migrations_under_transaction_pool.py

  test_session_advisory_lock_persists_across_pgbouncer_client_disconnect  PASSED
  test_orphaned_lock_blocks_a_fresh_postgres_session                       PASSED
  test_reentrant_acquire_through_same_pgbouncer_is_not_a_counter_example  PASSED
  test_bootstrap_db_skips_lock_when_schema_already_at_head                 PASSED
  test_bootstrap_db_is_idempotent_once_schema_is_at_head                   PASSED

5 passed in 1.47s   (post-fix)
1 fails in 240.34s  (pre-fix, hits pytest.mark.timeout — confirmed RED on main)

The mechanism tests document what PgBouncer does (lock orphaning is a real, named property of transaction-pool handoffs) so a future reader who sees "but pg_try_advisory_lock returned TRUE through PgBouncer!" understands the reentrance gotcha. The regression tests gate the actual fix.

Run with the docker-compose stack from tests/integration/fixtures/transaction_pool/docker-compose.yml.

OCP verification — L1 fix on real OpenShift + CrunchyData PGO (cluster: )
Image:    docker.io/<your-docker-namespace>/mcp-context-forge:<your-l1-image-tag>
Stack:    mcp-stack chart, default values + migration.enabled=false
Initial:  empty DB (DROP SCHEMA public CASCADE; CREATE SCHEMA public; …)

                                       PRE-FIX            POST-L1
                                       ───────            ───────
   Helm release status                 failed             deployed
   Pod restarts before Ready           7 each (21 total)  0
   Wall-clock to 3/3 Ready             ~13 min            56 sec
   Pods using fast-path                n/a                2 of 3, every worker
   Total visible retry messages        64+                28 (all confined to one pod's
                                                              worker-race during a
                                                              sub-second seed window)
   Advisory locks held at sample time  0 (after storm)    0 (no storm)

Per-pod log signature — post-L1
─────────────────────────────────
Pod 9jvtd  (won the seed race — slow path expected here):
  Empty DB detected:           1
  Acquired migration lock:     8
  Lock held by another:       28   ← within-pod race during ~1s seed window
  Schema already at head:      0
  Database ready:              8

Pods 2p9ls and jvb4t  (replicas 2 & 3):
  Empty DB detected:           0
  Acquired migration lock:     0   ← skipped the lock entirely
  Lock held by another:        0   ← no retries
  Schema already at head:      8   ← fast-path fired for every worker
  Database ready:              0

Full reproduction-vs-fix evidence in todo/issue-4051-ocp-reproduction.md.

OCP verification — L2 (chart-wired SKIP_MIGRATIONS) on real OpenShift
Image:    docker.io/<your-docker-namespace>/mcp-context-forge:<your-l1-l2-image-tag>
Stack:    mcp-stack chart, DEFAULT values (migration.enabled=true), so the
          chart's pre-install Job runs the migration and the gateway
          Deployment is rendered with MCPGATEWAY_SKIP_MIGRATIONS=true.
Initial:  empty DB (DROP SCHEMA public CASCADE; CREATE SCHEMA public; …)

                                       PRE-FIX            POST-L1            POST-L2
                                       ───────            ───────            ───────
   Helm release status                 failed             deployed           deployed*
   Pod restarts before Ready           7 each (21 total)  0                  0
   Wall-clock to 3/3 Ready             ~13 min            56 sec             91 sec
   bootstrap_db activity in pods       very high          slow-path on 1     ZERO
   Advisory lock acquisitions in pods  many               8 (1 pod's race)   0
   Workers using fast-path             n/a                16 of 24           n/a (skipped)
   Workers using SKIP path             n/a                n/a                24 of 24
   Migration Job runs                  no (disabled)      no (disabled)      yes (Complete)

   * register-fast-time post-install Job failed; it's unrelated to the
     migration/skip story (registers the test fast-time-server via gateway
     API call). Doesn't affect L2 contract verification. Tracked separately.

Per-pod log signature — post-L2 (all 3 pods identical)
──────────────────────────────────────────────────────
  Skipping in-pod migration bootstrap:   8   ← one per gunicorn worker
  Acquired migration lock:               0
  Schema already at Alembic head:        0
  Lock held by another instance:         0
  Empty DB detected:                     0
  Database ready:                        0

DB state after pods Ready:
  Tables in public schema:               61
  alembic_version:                       d3e4f5a6b7c8 (at head)
  Advisory locks held:                   0

The 3-way pre-fix / post-L1 / post-L2 comparison is the headline win. The chart's contract — migration Job runs → app pods skip — is honored automatically; operators don't have to remember to set both migration.enabled AND MCPGATEWAY_SKIP_MIGRATIONS because the chart wires them together.

Full reproduction-vs-fix evidence in todo/issue-4051-ocp-reproduction.md.

OCP verification — L2 on minimal profile (no PGO operator, chart-managed Postgres + PgBouncer)

The OCP-PGO smoke above proves the fix on the operator-managed path. This second smoke proves it on the community Helm path — the chart's base values.yaml with the chart's own standalone postgres + pgbouncer Deployments (no CrunchyData operator). That's the configuration most community users will hit and the one a downstream pipeline using DEPLOYMENT_PROFILE=minimal lands on.

Cluster:   same OpenShift cluster as the PGO smoke
Namespace: issue-4051-minimal-test                  (fresh project)
Image:     docker.io/<your-docker-namespace>/mcp-context-forge:<your-l1-l2-image-tag>
Profile:   MINIMAL — chart's base values.yaml only,
           NO profiles/ocp/values-pgo.yaml, NO PGO operator
Override:  storageClassName=nfs-client (PVCs)
           REQUIRE_STRONG_SECRETS=false (test secrets)
Result
──────
  Helm release status                  deployed
  Wall-clock to all gateway pods Ready ~58 s
  Pod restarts before Ready            0
  Migration Job (pre-install hook)     ran to completion (cleaned up
                                        by hook-succeeded policy)
  Schema in postgresdb                  61 tables, alembic_version at head
  Advisory locks held after stabilise   0

Per-pod log signature (each gateway pod, identical)
───────────────────────────────────────────────────
  Skipping in-pod migration bootstrap:   1   ← per gunicorn worker
  Acquired migration lock:               0
  Lock held by another instance:         0
  Schema already at Alembic head:        0
  Empty DB detected:                     0
  Database ready:                        0

Stack components running in the namespace:

NAME                                              READY   STATUS    RESTARTS
mcp-stack-mcp-fast-time-server-7bb594568b-d5ctj   1/1     Running   0
mcp-stack-mcp-fast-time-server-7bb594568b-zgp5b   1/1     Running   0
mcp-stack-mcpgateway-b8f4776bf-44cpz              1/1     Running   0
mcp-stack-mcpgateway-b8f4776bf-x44m5              1/1     Running   0
mcp-stack-postgres-7499dfc6c8-krdbc               1/1     Running   0
mcp-stack-redis-7c5fc464c4-6xfb5                  1/1     Running   0

What this confirms beyond the PGO smoke:

  • The chart's migration.enabledMCPGATEWAY_SKIP_MIGRATIONS wiring is profile-agnostic — works on both operator-managed and chart-managed Postgres.
  • Community users on vanilla K8s with the chart's defaults get the L2 contract automatically; no special profile or extra knob needed.
  • A downstream CD pipeline using DEPLOYMENT_PROFILE=minimal (without PGO) is unblocked by this PR with no additional configuration changes — they only need to bump their image tag to one with this fix.

Additional improvements

  • Idempotent post-migration bootstrap extracted to _run_post_migration_bootstrap() — admin-user creation, default-role seeding, and orphaned-resource assignment are now called from both the fast-path and slow-path branches. Previously these were only inside the advisory_lock block, so a fast-path skip would have missed them on a partial-startup edge case (replica 1 stamps head then crashes before bootstrapping the admin user → replica 2 fast-paths and runs the bootstrap steps to fill the gap).

  • Reproduction harness inlined in this description — the development-time harness directory has been removed in the squashed commit; the compose stack it drove was promoted to tests/integration/fixtures/transaction_pool/docker-compose.yml so the integration suite owns the same infrastructure long-term. The two shell scripts (reproduce.sh, demonstrate_orphan.sh) are inlined in the Test results section above so a future reader can rebuild the harness from the PR alone.

  • Module docstrings reference the issue number — both the helper functions and the integration test module carry an inline reference to issue [BUG]: Alembic migration advisory lock hangs when multiple gateway pods start through PgBouncer (transaction pooling mode) #4051 so a future reader greps once and lands in the right context.


Limitations

  1. Fast-path returns without emitting "Database ready" — the canonical readiness marker only fires on the slow-path return. Operators reading pod logs see either Database ready or Schema already at Alembic head; skipping migration lock. Both signal "bootstrap finished," but inconsistently. Captured as Gap 8 above; trivial follow-up.

  2. Table-based mutex fallback not included — the original plan considered a Postgres table-based mutex for environments where migration_database_url (direct-Postgres bypass) isn't available AND the chart's pre-install Job is disabled. With L1 + L2 + the schema-at-head probe in place, that combination is well-covered: L1 makes the in-pod path safe, L2 lets the chart guarantee the in-pod path is skipped when the Job runs, and the probe holds Ready until the schema is correct regardless. Reopening if a real-world report justifies it.

  3. DROP SCHEMA in PGO setups requires ownership restoration — observed during the OCP reproduction. CrunchyData PGO assigns schema ownership to a service user; a manual DROP SCHEMA public CASCADE; CREATE SCHEMA public; re-creates the schema owned by postgres, so the service user can't create tables in it until ownership is restored:

    ALTER SCHEMA public OWNER TO admin;
    GRANT ALL ON SCHEMA public TO admin;
    GRANT ALL ON SCHEMA public TO public;

    Documented in the OCP reproduction notes; not a code fix.

  4. Pod readiness can be misleading during pre-fix cycling — observed pods show 1/1 Ready even while several gunicorn workers are still in their retry loop. One worker's lifespan completing seems sufficient to satisfy the readiness probe. With L1+L2 the storm window is gone; this is captured here for operator awareness when reading historical pre-fix logs.

  5. Fast-path makes the post-migration role/user-role seeders non-serialisedbootstrap_default_roles() and the platform_admin user-role assignment now run outside the migration advisory lock on every replica that takes the fast path. Chart-default deployments (migration.enabled=true + MCPGATEWAY_SKIP_MIGRATIONS=true on gateway pods) are unaffected — the migration Job is the sole writer. Configurations where gateway pods bootstrap themselves (e.g., migration.enabled=false, multi-worker single-pod gunicorn) carry a small first-startup race window that can produce duplicate active rows in roles / user_roles. Fully closed by the follow-up linked below.


Future work

RBAC seeder race fix — PR #4480 / issue #4482

Inline review on this PR identified two related races (HIGH on roles, MEDIUM on user_roles) that the L1 fast-path exposes by running the post-migration seeders without holding the advisory lock. The fix is staged in PR #4480 with issue #4482 as the tracking record. Approach: partial unique indexes on (name, scope) / (user_email, role_id, scope[, scope_id]) WHERE is_active = true, an idempotent dedupe-then-constrain Alembic migration, plus savepoint-and-refetch in RoleService.create_role / assign_role_to_user.

The race fix is strictly additive; the hang fix in this PR does not depend on it. Split out to keep this PR narrowly scoped to the original bug, and to keep the diff in front of the chart-and-bootstrap reviewers separate from the diff in front of the RBAC reviewers.

Compose-side L2 (deferred)

The root docker-compose.yml was left unchanged. It's currently safe thanks to L1 alone (the fast-path makes in-pod bootstrap pgbouncer-safe). Adding a dedicated migrate service in compose plus MCPGATEWAY_SKIP_MIGRATIONS=true on the gateway would mirror the chart's contract one-to-one — useful for users who want production-shaped semantics in dev. Tracked as a separate issue if there's appetite; deliberately out-of-scope here to keep docker-compose.yml touch-free for existing users.

Unify the readiness log line

Per Gap 8 above. Either:

  • emit "Database ready" from both paths (fast and slow), or
  • replace it with a single richer line carrying the path taken ("Database ready (fast-path)" / "Database ready (slow-path)").

Either change is a one-line edit; deferred so this PR stays scoped to the bug fix.

MIGRATION_DATABASE_URL (only if requested)

The Helm chart already exposes migration.hostKey / portKey to point the pre-install Job direct to Postgres (bypassing PgBouncer). A code-level MIGRATION_DATABASE_URL setting would let docker-compose users do the same without two DATABASE_URL definitions. Not necessary for the bug fix; deferred unless someone explicitly asks for it.


Closes #4051

@gandhipratik203 gandhipratik203 changed the title fix(bootstrap): skip Alembic advisory lock when schema is at head fix(bootstrap): improve startup reliability for multi-replica deploys Apr 26, 2026
@gandhipratik203 gandhipratik203 marked this pull request as ready for review April 27, 2026 09:59
@gandhipratik203 gandhipratik203 self-assigned this Apr 27, 2026
@gandhipratik203 gandhipratik203 added this to the Release 1.0.0 milestone Apr 27, 2026
@gandhipratik203 gandhipratik203 added bug Something isn't working helm Helm chart database resilience Resilience labels Apr 27, 2026
@gandhipratik203 gandhipratik203 force-pushed the fix/4051-alembic-pgbouncer-transaction-pool branch from 46f1d84 to 913fa4e Compare April 28, 2026 07:45
Copy link
Copy Markdown
Collaborator

@msureshkumar88 msureshkumar88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — Summary of Findings

The core fix (L1 fast-path probe + L2 MCPGATEWAY_SKIP_MIGRATIONS flag) is correct and the test design is impressive — the deterministic lock-orphan synthesis (holding the lock in a distinct Postgres session rather than through PgBouncer) is non-flaky and well-documented. The chart wiring and the extraction of _run_post_migration_bootstrap() are genuine improvements.

Requesting changes on the issues below. Low-impact items are noted but not blocking.


High — Blocking

H1 — Probe failures in _alembic_at_head are logged at DEBUG, invisible to operators
mcpgateway/bootstrap_db.py — when the fast-path probe throws (auth error, pool exhaustion, corrupt Alembic state), it falls through silently to the slow path. At the project's default LOG_LEVEL=ERROR, operators have no signal that the probe consistently failed. Change to logger.warning(...) so the fallback is auditable.

# current
logger.debug("Could not probe alembic head state: %s", exc)

# recommended
logger.warning("Fast-path head probe failed, falling back to advisory-lock path: %s", exc)

H2 — check_schema_at_head.py main() has zero test coverage
mcpgateway/utils/check_schema_at_head.py — this is the Helm startup probe entrypoint (runs every 5 s, determines pod readiness). _alembic_at_head is tested well via TestAlembicAtHead, but the main() wrapper — engine creation, connect, probe call, dispose, exit-code mapping — has no tests. A bad DATABASE_URL, a failed engine.dispose(), or a mis-mapped exit code would go undetected. Please add tests/unit/mcpgateway/utils/test_check_schema_at_head.py covering the 0 / 1 exit-code paths and the exception branch.

H3 — check_schema_at_head.py imports the private function _alembic_at_head from a sibling module
mcpgateway/utils/check_schema_at_head.py:58

from mcpgateway.bootstrap_db import _alembic_at_head

A private (underscore-prefixed) function is now load-bearing for the Helm startup probe. Any internal rename or signature change in bootstrap_db silently breaks pod readiness without a test catching it at PR time. Options:

  • Make _alembic_at_head public (alembic_at_head) if it is genuinely shared API, or
  • Extract the shared check into mcpgateway/utils/ as a proper public helper and import it from both callers.

Medium — Blocking

M1 — edoburu/pgbouncer:latest is unpinned in the test fixture
tests/integration/fixtures/transaction_pool/docker-compose.yml — an upstream image update can change POOL_MODE defaults or pg_advisory_lock semantics, causing mechanism tests to fail or silently pass incorrectly. Pin to a specific tag or digest and document the version rationale.

M2 — LOCK_ID / BOOTSTRAP_LOCK_ID duplication creates a silent mismatch risk
test_migrations_under_transaction_pool.py defines LOCK_ID = 42_424_242_424_242 at module level, then re-defines BOOTSTRAP_LOCK_ID = 42_424_242_424_242 as a local variable inside the regression test. If advisory_lock() in production ever changes its lock ID, neither constant would detect the drift. Use the module-level constant throughout, or import the sentinel from bootstrap_db.

M3 — URL escape-% logic is duplicated across two modules
Both bootstrap_db.py and check_schema_at_head.py contain:

escaped_url = settings.database_url.replace("%", "%%")
cfg.set_main_option("sqlalchemy.url", escaped_url)

Extract into a shared _make_alembic_cfg(database_url: str) -> Config helper. check_schema_at_head.py already imports from bootstrap_db, so the dependency cost is zero.

M4 — Missing security-boundary comment in the test fixture
tests/integration/fixtures/transaction_pool/docker-compose.yml sets AUTH_REQUIRED: "false" and uses hardcoded credentials (reprosecret). A developer copying this file for a quick non-isolated test and accidentally exposing port 4444 gets an unauthenticated gateway. Please add a prominent warning block at the top of the file, e.g.:

# !! TEST FIXTURE ONLY — NEVER use in production !!
# AUTH_REQUIRED=false disables all authentication.
# Hardcoded credentials (reprosecret) are intentional for test isolation only.

M5 — E2E compose test is always skipped in CI — no automated gate for the original bug
test_compose_three_replicas_complete_bootstrap_e2e requires MCPGATEWAY_TEST_ALLOW_DESTRUCTIVE_E2E=1 and a locally built image. The regression that motivated this entire PR has no CI-automated end-to-end gate. Consider adding a CI job that builds the image and runs this test, or at minimum add a pytest.mark.skip(reason="CI: add MCPGATEWAY_TEST_ALLOW_DESTRUCTIVE_E2E=1 and make docker to enable") with a clear call-to-action.

M6 — check_schema_at_head.py creates engine without production pool settings
create_engine(settings.database_url) uses SQLAlchemy defaults, not the DB_POOL_SIZE / DB_POOL_TIMEOUT values that the application engine uses. In a tuned production environment, the probe can timeout when the real pool would not (or vice versa). Replicate pool settings or share the engine reference.


Acknowledged / Follow-up Tracking

The following are noted here for visibility but are already tracked and not blocking this PR:

  • RBAC seeder race (Limitation 5 / PR #4480 / issue #4482): _run_post_migration_bootstrap() now runs without the advisory lock on the fast path. Fix is staged separately — consider a pytest.xfail placeholder so regression is detectable if #4480 is reverted.
  • Inconsistent "Database ready" log line (Gap 8): Fast path does not emit the canonical readiness marker. Tracked, one-line fix, deferred.
  • Startup probe creates/destroys engine every 5 s (P2): Minor overhead during pod startup; not critical.
  • Two sequential DB connections on slow path (P1): Reusing probe_conn would save one acquisition; low priority.

Alternative Implementation Worth Considering

The root cause is that pg_advisory_lock is session-scoped and PgBouncer's transaction pool reuses backends between logical clients. PostgreSQL also provides pg_advisory_xact_lock, which is transaction-scoped — released automatically on COMMIT or ROLLBACK, so PgBouncer backend reuse cannot orphan it. This would eliminate the need for the fast-path probe entirely.

Trade-off: The entire migration run must complete in a single transaction. This is safe for standard Alembic DDL on PostgreSQL (DDL is transactional in PG), but breaks for any migration that uses CREATE INDEX CONCURRENTLY or other non-transactional DDL. If the migration files can be audited for this constraint, a pg_advisory_xact_lock migration is a cleaner long-term fix. Worth keeping as a candidate for a follow-up cleanup PR.


The implementation is solid and the test methodology is excellent. Addressing the blocking items above — particularly H1 (log visibility), H2 (startup probe test coverage), H3 (private import), and M1–M4 — would make this merge-ready.

gandhipratik203 added a commit that referenced this pull request May 8, 2026
Addresses the inline review on PR #4444:

  H1  fast-path probe exception now logs at WARNING (was DEBUG); names
      the fallback path for greppability.
  H2  new tests/unit/mcpgateway/utils/test_check_schema_at_head.py with
      five tests pinning the K8s probe entry point's exit-code contract,
      exception swallowing, engine.dispose() discipline, and the
      WARNING-log emission on failure.
  H3  rename _alembic_at_head -> alembic_at_head; the function is
      load-bearing for the startup probe and a leading underscore
      misleadingly signals "safe to refactor."
  M1  document substrate alignment in fixture compose; do not pin in
      isolation, since the rest of the project (root docker-compose.yml,
      the *-debug / -performance / -verbose-logging variants, and the
      helm chart's pgbouncer.image.tag) all track :latest.
  M2  drop redundant function-local BOOTSTRAP_LOCK_ID; reuse the
      module-level LOCK_ID. Replaces the bare numeric literal further
      down with the same constant.
  M3  extract make_alembic_cfg(database_url, *, configure_logger=False)
      to dedupe URL-escape + Config setup between bootstrap_db.main()
      and check_schema_at_head.
  M4  prominent "TEST FIXTURE ONLY" warning at top of the fixture
      compose file; switch three hardcoded reprosecret refs to
      ${POSTGRES_PASSWORD:-reprosecret} to match the env-var-with-default
      pattern used by root docker-compose.yml.
  M5  rewrite destructive-e2e skip message: [regression-gate] prefix,
      explicit "NOT run in CI" with cost rationale, two-step local
      enable instructions visible in pytest output.
  M6  probe engine now uses NullPool (truthful one-shot lifetime) plus
      connect_args imported from mcpgateway.db (psycopg TCP keepalives,
      prepare_threshold; SQLite check_same_thread) so connect behavior
      matches production. Pool-internal settings do not apply: the probe
      opens one connection per K8s tick and disposes the engine.

Tests: 493 unit / 5 integration pass; probe binary exits 0 against live
compose DB. No production-code behavior change outside the probe-engine
alignment and the H1 log-level bump.

Refs #4444

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
@gandhipratik203
Copy link
Copy Markdown
Collaborator Author

Thanks for the thorough review. All nine items addressed in 3761333; brief notes per item below.

H1alembic_at_head's broad-except path now logs at WARNING (was DEBUG). Message reworded to name the fallback so it's greppable in pod logs: "Fast-path head probe failed, falling back to advisory-lock path: …". mcpgateway/bootstrap_db.py:175.

H2 — Added tests/unit/mcpgateway/utils/test_check_schema_at_head.py with five tests: exit-0 on at-head, exit-1 on not-at-head, exit-1 on engine.connect() exception, exit-1 on probe exception, and a WARNING-log assertion on the failure path. Engine dispose() is asserted on every code path.

H3 — Renamed _alembic_at_headalembic_at_head. The function is load-bearing for the K8s startup probe; the leading underscore was misleading. Test references in TestAlembicAtHead updated accordingly. mcpgateway/bootstrap_db.py:144.

M1 — Pushing back gently here. Pinning the test fixture's edoburu/pgbouncer:latest in isolation would diverge from the substrate every other deployment of this project actually uses: the root docker-compose.yml, the *-debug / *-performance / *-verbose-logging variants, and the helm chart's pgbouncer.image.tag default all track :latest. Tests would no longer exercise what users deploy.

Instead, added a substrate-alignment comment block on the fixture's pgbouncer service block that names the verified PgBouncer version (1.25.1 as of 2026-05-01) and the rationale for tracking :latest. The orphan condition this fixture targets is asserted behaviorally by the integration tests — if an upstream republish ever changed pool_mode=transaction semantics, those tests would fail loudly rather than silently pass incorrectly. Repo-wide pin/replace conversation can happen separately if there's appetite. tests/integration/fixtures/transaction_pool/docker-compose.yml.

M2 — Took the lighter path you offered: dropped the function-local BOOTSTRAP_LOCK_ID re-declaration; reuse the existing module-level LOCK_ID. Also replaced the bare numeric literal further down in the file. Production-side pg_lock_id left as a local in advisory_lock() — happy to extract a shared MIGRATION_ADVISORY_LOCK_ID to bootstrap_db.py if you'd prefer the stricter source-of-truth pattern; held off here to keep the diff tight.

M3 — Extracted make_alembic_cfg(database_url, *, configure_logger=False) to mcpgateway/bootstrap_db.py:104. Both bootstrap_db.main() and the schema-at-head probe call it; URL-escape and Config setup are no longer duplicated.

M4 — Added a prominent !! TEST FIXTURE ONLY — DO NOT USE IN PRODUCTION OR EXPOSE BEYOND LOCALHOST !! block at the top of the fixture compose, naming the disabled auth, hardcoded password default, and 0.0.0.0 host-port bindings. Also took the opportunity to switch three hardcoded reprosecret references to ${POSTGRES_PASSWORD:-reprosecret} to match the env-var-with-default pattern the repository's root docker-compose.yml uses.

M5 — Took the lighter path. Adding a CI job that builds the gateway image and runs the e2e is a real piece of work (image build + 3-replica compose + ~60s polling on every CI run) that's out of scope for this bug fix. Improved the skip message instead: [regression-gate] prefix so it stands out in pytest output, names what the test gates and why, explicit "NOT run in CI" line so future readers understand the gap is intentional, plus a two-step local enable command (make docker + the env var). tests/integration/test_migrations_under_transaction_pool.py:483.

M6 — Done, with a small framing note: DB_POOL_SIZE / DB_POOL_TIMEOUT are pool-internals (size of the in-process pool, time waiting to acquire from it), and the probe opens one connection per K8s tick and immediately disposes the engine — pool sizing and acquire timeout never come into play. The actual probe-vs-prod parity concern lives in connect_args (psycopg TCP keepalives, prepare_threshold; SQLite check_same_thread).

Switched to NullPool (truthful one-shot lifetime; matches the production engine's pool class when running against PgBouncer) and now import connect_args from mcpgateway.db so connect-establishment behavior is identical between probe and gateway. Inline comment in check_schema_at_head.py:34-43 records the reasoning.


Suggested pg_advisory_xact_lock cleanup is a real candidate; I've left it out of this PR's scope but happy to track it as a follow-up.

Tests after the changes: 493 unit / 5 integration pass; probe binary exits 0 against the live compose DB.

@msureshkumar88
Copy link
Copy Markdown
Collaborator

Re-review — follow-up on requested changes

Thank you for the thorough follow-up commit — the implementation quality here is excellent and the responsiveness to the review feedback is appreciated.

All blocking items addressed ✅

Item Resolution
H1 logger.warning in alembic_at_head Done — bootstrap_db.py:175
H2 Test coverage for check_schema_at_head.main() Done — full truth-table in test_check_schema_at_head.py (5 cases, all exit-code paths + engine.dispose() on every branch)
H3 _alembic_at_head made public Done — renamed to alembic_at_head, imported without underscore
M2 LOCK_ID / BOOTSTRAP_LOCK_ID duplication Done — BOOTSTRAP_LOCK_ID removed, LOCK_ID used consistently
M3 make_alembic_cfg shared helper Done — both callers use it, duplication gone
M4 Security warning block in test fixture Done — docker-compose.yml:1–16
M5 E2E skip with clear call-to-action Done — runtime pytest.skip() with step-by-step message
M6 Engine pool settings in probe Done — NullPool + connect_args from mcpgateway.db, with comment explaining the rationale

One small thing still open — M1

The image pin request (edoburu/pgbouncer:latest → specific tag/digest) wasn't applied. The rationale in the comment is reasonable — tracking :latest keeps the fixture in sync with what users actually run, and the tests assert behavior rather than version-specific semantics. That's a defensible choice and worth keeping if the team agrees.

That said, docker-compose.yml:65 has a placeholder that should be resolved before merge:

# Repo-wide pin/replace conversation tracked at <issue link TBD>.

Either open a tracking issue and drop the number in, or remove the line. A TBD placeholder committed to a main-bound file tends to stay TBD forever.

No new issues found

Went through all changed files — no new bugs, no security concerns, no behavioral regressions beyond the already-acknowledged RBAC seeder fast-path race (tracked at #4480 / #4482).

Once the <issue link TBD> placeholder is resolved, this looks merge-ready.

@msureshkumar88
Copy link
Copy Markdown
Collaborator

msureshkumar88 commented May 8, 2026

✅ E2E Verification Proof - Issue #4051 Fixed

Test Execution Summary

Date: 2026-05-08 12:35:27 UTC
Test Result:PASSED
Duration: 13.58 seconds
Commit: 3761333

Test Scenario

The automated E2E test validates the fix by:

  1. Setup: Postgres + PgBouncer (transaction pooling mode) + 3 gateway replicas
  2. Trigger: Scale gateway service to 3 replicas simultaneously
  3. Validation: All 3 replicas complete bootstrap within 60 seconds
  4. Success Criteria:
    • ✅ No advisory lock hangs
    • ✅ No pod restarts
    • ✅ All replicas emit bootstrap completion logs
    • ✅ Total time < 60s (achieved: 13.58s)

Test Results

Test: test_compose_three_replicas_complete_bootstrap_e2e
Module: tests/integration/test_migrations_under_transaction_pool.py
Result: PASSED
Duration: 13.58s

Verification Script

A comprehensive E2E automation script has been created at but not committed :
tests/e2e/run_issue_4051_verification.sh

This script:

  • Builds the gateway Docker image
  • Starts the test fixture stack (postgres + pgbouncer)
  • Runs the E2E test with timing
  • Collects proof artifacts (logs, metrics, git info)
  • Generates a proof summary report

Reproduction Steps

# 1. Install dependencies
make install-dev
uv pip install psycopg[binary]

# 2. Run the E2E verification script
./tests/e2e/run_issue_4051_verification.sh

Expected vs Observed Behavior

Pre-Fix (Issue #4051):

  • Helm release status: failed
  • Pod restarts: 7 per pod (21 total)
  • Time to 3/3 Ready: ~13 minutes
  • Visible retry messages: 60+
  • Advisory lock hangs causing timeout

Post-Fix (PR #4444):

  • Helm release status: deployed
  • Pod restarts: 0
  • Time to 3/3 Ready: < 60 seconds ✅ (13.58s)
  • No advisory lock hangs ✅
  • Fast-path probe skips lock acquisition when schema is at head ✅

Conclusion

The E2E test confirms that PR #4444 successfully resolves issue #4051. All 3 gateway replicas completed bootstrap without advisory lock hangs, meeting all success criteria.

Fix Layers Validated:

  • L1: Fast-path probe checks if schema is at head before acquiring advisory lock
  • L2: Chart-level skip flag (MCPGATEWAY_SKIP_MIGRATIONS)

Proof Artifacts

Full proof artifacts are available at:
tests/e2e/proof_artifacts/

Including:

  • Test output logs
  • Docker container logs (gateway, postgres, pgbouncer)
  • Service status snapshots
  • Build logs
  • Git information
  • Proof summary document

Automated verification completed successfully 🎉

gandhipratik203 added a commit that referenced this pull request May 8, 2026
Per re-review feedback on PR #4444 — TBD placeholder committed to a
main-bound file tends to stay TBD forever. Tracking issue is
deliberately not being filed; substrate-alignment rationale already
captured in the surrounding comment block.

Refs #4444

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
@gandhipratik203
Copy link
Copy Markdown
Collaborator Author

Either open a tracking issue and drop the number in, or remove the line.

Removed in 29aa33e. Repo-wide pin conversation can happen separately if it comes up; no need to leave a TBD anchor in main. Thanks for running the e2e — appreciated.

@gandhipratik203 gandhipratik203 force-pushed the fix/4051-alembic-pgbouncer-transaction-pool branch from 29aa33e to 06456e1 Compare May 8, 2026 13:16
@msureshkumar88
Copy link
Copy Markdown
Collaborator

M1 resolved — TBD placeholder removed in 29aa33e, clean choice to drop it rather than leave an unfiled anchor. The :latest rationale in the surrounding comment is self-contained and defensible as-is.

This is merge-ready. All blocking and minor items from both review rounds addressed. Thanks for the fast turnaround on the follow-up.

Multi-replica gateway startup behind a transaction-pooling PgBouncer
(pool_mode=transaction) can hang on Alembic's session-scoped advisory
lock: PgBouncer hands the same backend to multiple gateway clients,
and the lock taken by one client is left orphaned on the backend when
that client disconnects. Subsequent clients then see the lock as held
and spin in the retry loop indefinitely.

This change ships a two-layer fix and a chart-level startup probe that
gates Ready until the schema is fully migrated.

  L1 — Fast-path probe in bootstrap_db.main(): when alembic_version
       already matches the script-directory head, skip the advisory
       lock entirely. The post-migration bootstrap (admin user,
       default roles, orphaned-resource assignment) is extracted to a
       shared helper and runs on both fast and slow paths so a
       partial-startup edge case never leaves downstream state
       unpopulated.

  L2 — Chart wires MCPGATEWAY_SKIP_MIGRATIONS: the helm chart sets
       the new setting to "true" on gateway pods iff
       migration.enabled is true, so the in-process bootstrap is
       skipped when the chart's migration Job is the source of truth
       for schema work. Compose-side L2 deferred (see Future work).

  Schema-at-head startup probe: replaces the gateway pod's
       db_isready startup probe with a check that exits 0 only when
       alembic_version matches the script-directory head. Pods refuse
       Ready until the schema is fully migrated regardless of which
       entity (pre-install Job, post-install Job, init container,
       external CD pipeline) ran the migration.

Includes a permanent reproduction fixture (postgres + pgbouncer in
transaction-pool mode) under tests/integration/fixtures/ and the
mechanism-pinning regression suite at
tests/integration/test_migrations_under_transaction_pool.py.

Tests:
  - Unit: helm-chart render tests for the SKIP_MIGRATIONS env-wiring
    contract; truth-table tests for the alembic_at_head fast-path
    helper; bootstrap fast-path entry tests pinning the helper is
    used and the lock is not re-entered on the second invocation;
    test_check_schema_at_head covers the K8s probe entrypoint's
    exit-code contract, exception swallowing, and dispose discipline.
  - Integration: the PgBouncer transaction-pool fixture, an
    advisory-lock orphaning regression suite, and an end-to-end
    three-replica compose smoke (gated by
    MCPGATEWAY_TEST_ALLOW_DESTRUCTIVE_E2E=1).

Verified on docker-compose, OCP minimal profile (chart-internal
Postgres, post-install hook), and OCP PGO profile (CrunchyData PGO,
transaction-pool PgBouncer, pre-install hook): probe correctly gates
Ready, replicas reach Ready in <60s with 0 restarts, no advisory
locks left held.

Inline review feedback addressed in the same commit; see the PR
thread for the per-item response.

Refs #4051

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
@gandhipratik203 gandhipratik203 force-pushed the fix/4051-alembic-pgbouncer-transaction-pool branch from 06456e1 to 5384128 Compare May 8, 2026 13:25
Black autoformatted a long log message in mcpgateway/main.py:1265 into
two adjacent string literals; the project's pylint CI rule treats
implicit-str-concat (W1404) as fatal. Combined into a single string with
the same content.

Refs #4444

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working database helm Helm chart resilience Resilience

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: Alembic migration advisory lock hangs when multiple gateway pods start through PgBouncer (transaction pooling mode)

3 participants