fix(bootstrap): improve startup reliability for multi-replica deploys#4444
fix(bootstrap): improve startup reliability for multi-replica deploys#4444gandhipratik203 wants to merge 2 commits intomainfrom
Conversation
46f1d84 to
913fa4e
Compare
msureshkumar88
left a comment
There was a problem hiding this comment.
Code Review — Summary of Findings
The core fix (L1 fast-path probe + L2 MCPGATEWAY_SKIP_MIGRATIONS flag) is correct and the test design is impressive — the deterministic lock-orphan synthesis (holding the lock in a distinct Postgres session rather than through PgBouncer) is non-flaky and well-documented. The chart wiring and the extraction of _run_post_migration_bootstrap() are genuine improvements.
Requesting changes on the issues below. Low-impact items are noted but not blocking.
High — Blocking
H1 — Probe failures in _alembic_at_head are logged at DEBUG, invisible to operators
mcpgateway/bootstrap_db.py — when the fast-path probe throws (auth error, pool exhaustion, corrupt Alembic state), it falls through silently to the slow path. At the project's default LOG_LEVEL=ERROR, operators have no signal that the probe consistently failed. Change to logger.warning(...) so the fallback is auditable.
# current
logger.debug("Could not probe alembic head state: %s", exc)
# recommended
logger.warning("Fast-path head probe failed, falling back to advisory-lock path: %s", exc)H2 — check_schema_at_head.py main() has zero test coverage
mcpgateway/utils/check_schema_at_head.py — this is the Helm startup probe entrypoint (runs every 5 s, determines pod readiness). _alembic_at_head is tested well via TestAlembicAtHead, but the main() wrapper — engine creation, connect, probe call, dispose, exit-code mapping — has no tests. A bad DATABASE_URL, a failed engine.dispose(), or a mis-mapped exit code would go undetected. Please add tests/unit/mcpgateway/utils/test_check_schema_at_head.py covering the 0 / 1 exit-code paths and the exception branch.
H3 — check_schema_at_head.py imports the private function _alembic_at_head from a sibling module
mcpgateway/utils/check_schema_at_head.py:58
from mcpgateway.bootstrap_db import _alembic_at_headA private (underscore-prefixed) function is now load-bearing for the Helm startup probe. Any internal rename or signature change in bootstrap_db silently breaks pod readiness without a test catching it at PR time. Options:
- Make
_alembic_at_headpublic (alembic_at_head) if it is genuinely shared API, or - Extract the shared check into
mcpgateway/utils/as a proper public helper and import it from both callers.
Medium — Blocking
M1 — edoburu/pgbouncer:latest is unpinned in the test fixture
tests/integration/fixtures/transaction_pool/docker-compose.yml — an upstream image update can change POOL_MODE defaults or pg_advisory_lock semantics, causing mechanism tests to fail or silently pass incorrectly. Pin to a specific tag or digest and document the version rationale.
M2 — LOCK_ID / BOOTSTRAP_LOCK_ID duplication creates a silent mismatch risk
test_migrations_under_transaction_pool.py defines LOCK_ID = 42_424_242_424_242 at module level, then re-defines BOOTSTRAP_LOCK_ID = 42_424_242_424_242 as a local variable inside the regression test. If advisory_lock() in production ever changes its lock ID, neither constant would detect the drift. Use the module-level constant throughout, or import the sentinel from bootstrap_db.
M3 — URL escape-% logic is duplicated across two modules
Both bootstrap_db.py and check_schema_at_head.py contain:
escaped_url = settings.database_url.replace("%", "%%")
cfg.set_main_option("sqlalchemy.url", escaped_url)Extract into a shared _make_alembic_cfg(database_url: str) -> Config helper. check_schema_at_head.py already imports from bootstrap_db, so the dependency cost is zero.
M4 — Missing security-boundary comment in the test fixture
tests/integration/fixtures/transaction_pool/docker-compose.yml sets AUTH_REQUIRED: "false" and uses hardcoded credentials (reprosecret). A developer copying this file for a quick non-isolated test and accidentally exposing port 4444 gets an unauthenticated gateway. Please add a prominent warning block at the top of the file, e.g.:
# !! TEST FIXTURE ONLY — NEVER use in production !!
# AUTH_REQUIRED=false disables all authentication.
# Hardcoded credentials (reprosecret) are intentional for test isolation only.M5 — E2E compose test is always skipped in CI — no automated gate for the original bug
test_compose_three_replicas_complete_bootstrap_e2e requires MCPGATEWAY_TEST_ALLOW_DESTRUCTIVE_E2E=1 and a locally built image. The regression that motivated this entire PR has no CI-automated end-to-end gate. Consider adding a CI job that builds the image and runs this test, or at minimum add a pytest.mark.skip(reason="CI: add MCPGATEWAY_TEST_ALLOW_DESTRUCTIVE_E2E=1 and make docker to enable") with a clear call-to-action.
M6 — check_schema_at_head.py creates engine without production pool settings
create_engine(settings.database_url) uses SQLAlchemy defaults, not the DB_POOL_SIZE / DB_POOL_TIMEOUT values that the application engine uses. In a tuned production environment, the probe can timeout when the real pool would not (or vice versa). Replicate pool settings or share the engine reference.
Acknowledged / Follow-up Tracking
The following are noted here for visibility but are already tracked and not blocking this PR:
- RBAC seeder race (Limitation 5 / PR #4480 / issue #4482):
_run_post_migration_bootstrap()now runs without the advisory lock on the fast path. Fix is staged separately — consider apytest.xfailplaceholder so regression is detectable if #4480 is reverted. - Inconsistent "Database ready" log line (Gap 8): Fast path does not emit the canonical readiness marker. Tracked, one-line fix, deferred.
- Startup probe creates/destroys engine every 5 s (P2): Minor overhead during pod startup; not critical.
- Two sequential DB connections on slow path (P1): Reusing
probe_connwould save one acquisition; low priority.
Alternative Implementation Worth Considering
The root cause is that pg_advisory_lock is session-scoped and PgBouncer's transaction pool reuses backends between logical clients. PostgreSQL also provides pg_advisory_xact_lock, which is transaction-scoped — released automatically on COMMIT or ROLLBACK, so PgBouncer backend reuse cannot orphan it. This would eliminate the need for the fast-path probe entirely.
Trade-off: The entire migration run must complete in a single transaction. This is safe for standard Alembic DDL on PostgreSQL (DDL is transactional in PG), but breaks for any migration that uses CREATE INDEX CONCURRENTLY or other non-transactional DDL. If the migration files can be audited for this constraint, a pg_advisory_xact_lock migration is a cleaner long-term fix. Worth keeping as a candidate for a follow-up cleanup PR.
The implementation is solid and the test methodology is excellent. Addressing the blocking items above — particularly H1 (log visibility), H2 (startup probe test coverage), H3 (private import), and M1–M4 — would make this merge-ready.
Addresses the inline review on PR #4444: H1 fast-path probe exception now logs at WARNING (was DEBUG); names the fallback path for greppability. H2 new tests/unit/mcpgateway/utils/test_check_schema_at_head.py with five tests pinning the K8s probe entry point's exit-code contract, exception swallowing, engine.dispose() discipline, and the WARNING-log emission on failure. H3 rename _alembic_at_head -> alembic_at_head; the function is load-bearing for the startup probe and a leading underscore misleadingly signals "safe to refactor." M1 document substrate alignment in fixture compose; do not pin in isolation, since the rest of the project (root docker-compose.yml, the *-debug / -performance / -verbose-logging variants, and the helm chart's pgbouncer.image.tag) all track :latest. M2 drop redundant function-local BOOTSTRAP_LOCK_ID; reuse the module-level LOCK_ID. Replaces the bare numeric literal further down with the same constant. M3 extract make_alembic_cfg(database_url, *, configure_logger=False) to dedupe URL-escape + Config setup between bootstrap_db.main() and check_schema_at_head. M4 prominent "TEST FIXTURE ONLY" warning at top of the fixture compose file; switch three hardcoded reprosecret refs to ${POSTGRES_PASSWORD:-reprosecret} to match the env-var-with-default pattern used by root docker-compose.yml. M5 rewrite destructive-e2e skip message: [regression-gate] prefix, explicit "NOT run in CI" with cost rationale, two-step local enable instructions visible in pytest output. M6 probe engine now uses NullPool (truthful one-shot lifetime) plus connect_args imported from mcpgateway.db (psycopg TCP keepalives, prepare_threshold; SQLite check_same_thread) so connect behavior matches production. Pool-internal settings do not apply: the probe opens one connection per K8s tick and disposes the engine. Tests: 493 unit / 5 integration pass; probe binary exits 0 against live compose DB. No production-code behavior change outside the probe-engine alignment and the H1 log-level bump. Refs #4444 Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
|
Thanks for the thorough review. All nine items addressed in 3761333; brief notes per item below. H1 — H2 — Added H3 — Renamed M1 — Pushing back gently here. Pinning the test fixture's Instead, added a substrate-alignment comment block on the fixture's pgbouncer service block that names the verified PgBouncer version (1.25.1 as of 2026-05-01) and the rationale for tracking M2 — Took the lighter path you offered: dropped the function-local M3 — Extracted M4 — Added a prominent M5 — Took the lighter path. Adding a CI job that builds the gateway image and runs the e2e is a real piece of work (image build + 3-replica compose + ~60s polling on every CI run) that's out of scope for this bug fix. Improved the skip message instead: M6 — Done, with a small framing note: Switched to Suggested Tests after the changes: 493 unit / 5 integration pass; probe binary exits 0 against the live compose DB. |
Re-review — follow-up on requested changesThank you for the thorough follow-up commit — the implementation quality here is excellent and the responsiveness to the review feedback is appreciated. All blocking items addressed ✅
One small thing still open — M1The image pin request ( That said, # Repo-wide pin/replace conversation tracked at <issue link TBD>.Either open a tracking issue and drop the number in, or remove the line. A No new issues foundWent through all changed files — no new bugs, no security concerns, no behavioral regressions beyond the already-acknowledged RBAC seeder fast-path race (tracked at #4480 / #4482). Once the |
✅ E2E Verification Proof - Issue #4051 FixedTest Execution SummaryDate: 2026-05-08 12:35:27 UTC Test ScenarioThe automated E2E test validates the fix by:
Test ResultsVerification ScriptA comprehensive E2E automation script has been created at but not committed : This script:
Reproduction Steps# 1. Install dependencies
make install-dev
uv pip install psycopg[binary]
# 2. Run the E2E verification script
./tests/e2e/run_issue_4051_verification.shExpected vs Observed BehaviorPre-Fix (Issue #4051):
Post-Fix (PR #4444):
ConclusionThe E2E test confirms that PR #4444 successfully resolves issue #4051. All 3 gateway replicas completed bootstrap without advisory lock hangs, meeting all success criteria. Fix Layers Validated:
Proof ArtifactsFull proof artifacts are available at: Including:
Automated verification completed successfully 🎉 |
Per re-review feedback on PR #4444 — TBD placeholder committed to a main-bound file tends to stay TBD forever. Tracking issue is deliberately not being filed; substrate-alignment rationale already captured in the surrounding comment block. Refs #4444 Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Removed in 29aa33e. Repo-wide pin conversation can happen separately if it comes up; no need to leave a TBD anchor in main. Thanks for running the e2e — appreciated. |
29aa33e to
06456e1
Compare
|
M1 resolved — TBD placeholder removed in 29aa33e, clean choice to drop it rather than leave an unfiled anchor. The This is merge-ready. All blocking and minor items from both review rounds addressed. Thanks for the fast turnaround on the follow-up. |
Multi-replica gateway startup behind a transaction-pooling PgBouncer
(pool_mode=transaction) can hang on Alembic's session-scoped advisory
lock: PgBouncer hands the same backend to multiple gateway clients,
and the lock taken by one client is left orphaned on the backend when
that client disconnects. Subsequent clients then see the lock as held
and spin in the retry loop indefinitely.
This change ships a two-layer fix and a chart-level startup probe that
gates Ready until the schema is fully migrated.
L1 — Fast-path probe in bootstrap_db.main(): when alembic_version
already matches the script-directory head, skip the advisory
lock entirely. The post-migration bootstrap (admin user,
default roles, orphaned-resource assignment) is extracted to a
shared helper and runs on both fast and slow paths so a
partial-startup edge case never leaves downstream state
unpopulated.
L2 — Chart wires MCPGATEWAY_SKIP_MIGRATIONS: the helm chart sets
the new setting to "true" on gateway pods iff
migration.enabled is true, so the in-process bootstrap is
skipped when the chart's migration Job is the source of truth
for schema work. Compose-side L2 deferred (see Future work).
Schema-at-head startup probe: replaces the gateway pod's
db_isready startup probe with a check that exits 0 only when
alembic_version matches the script-directory head. Pods refuse
Ready until the schema is fully migrated regardless of which
entity (pre-install Job, post-install Job, init container,
external CD pipeline) ran the migration.
Includes a permanent reproduction fixture (postgres + pgbouncer in
transaction-pool mode) under tests/integration/fixtures/ and the
mechanism-pinning regression suite at
tests/integration/test_migrations_under_transaction_pool.py.
Tests:
- Unit: helm-chart render tests for the SKIP_MIGRATIONS env-wiring
contract; truth-table tests for the alembic_at_head fast-path
helper; bootstrap fast-path entry tests pinning the helper is
used and the lock is not re-entered on the second invocation;
test_check_schema_at_head covers the K8s probe entrypoint's
exit-code contract, exception swallowing, and dispose discipline.
- Integration: the PgBouncer transaction-pool fixture, an
advisory-lock orphaning regression suite, and an end-to-end
three-replica compose smoke (gated by
MCPGATEWAY_TEST_ALLOW_DESTRUCTIVE_E2E=1).
Verified on docker-compose, OCP minimal profile (chart-internal
Postgres, post-install hook), and OCP PGO profile (CrunchyData PGO,
transaction-pool PgBouncer, pre-install hook): probe correctly gates
Ready, replicas reach Ready in <60s with 0 restarts, no advisory
locks left held.
Inline review feedback addressed in the same commit; see the PR
thread for the per-item response.
Refs #4051
Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
06456e1 to
5384128
Compare
Black autoformatted a long log message in mcpgateway/main.py:1265 into two adjacent string literals; the project's pylint CI rule treats implicit-str-concat (W1404) as fatal. Combined into a single string with the same content. Refs #4444 Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Summary
Fixes issue #4051 — Alembic migration advisory lock hangs when multiple gateway replicas start through PgBouncer in transaction-pooling mode. The session-scoped
pg_advisory_lockis orphaned across PgBouncer's backend handoffs; the N-th gateway pod sees the lock held by an effectively-dead session and spins in its retry loop until the ~10-minute timeout fires.This PR closes the bug along three pieces and ships them together so the chart is safe under both deployment patterns:
bootstrap_db.main()now probesalembic_versionbefore attempting the advisory lock. When the schema is at the script directory's head, the lock acquisition is skipped entirely. Replicas 2..N never participate in the race.MCPGATEWAY_SKIP_MIGRATIONSsettings flag suppresses the in-podbootstrap_db.main()call entirely. Themcp-stackHelm chart wires it frommigration.enabledso the contract "the pre-install Job owns migrations, app pods skip" is enforced at the chart layer rather than left to operators to coordinate.python -m mcpgateway.utils.check_schema_at_headinstead ofdb_isready. Pods refuse Ready untilalembic_versionmatches the script-directory head, regardless of which entity (pre-install Job, post-install Job, init container, external CD pipeline) ran the migration. Defense in depth: even if L2 mis-fires, no pod ever serves traffic against a half-migrated DB.Library defaults preserve backward compatibility: a direct
docker run mcpgateway:latestcontinues to bootstrap its own schema. Only the chart and (future) compose overlays opt intoSKIP_MIGRATIONS=true.Related: existing
migration.hostKey: host/portKey: portknobs incharts/mcp-stack/profiles/ocp/values-pgo.yamlalready let the migration Job bypass PgBouncer; that workaround stays in place but is no longer load-bearing.Approach
The fix was built reproduction-first: rather than implement against the issue text alone, the bug was first reproduced deterministically in a throwaway docker-compose harness (inlined in the Test results section below) and then again on a real OpenShift cluster against CrunchyData PGO + transaction-pool PgBouncer. Both reproductions are documented in
todo/issue-4051-ocp-reproduction.md, and the compose-side stack lives permanently in the repo as a test fixture attests/integration/fixtures/transaction_pool/.For each layer (L1 fast-path, L2 chart-wired skip), the cycle was:
The regression test for the bug —
test_bootstrap_db_skips_lock_when_schema_already_at_head— is reliably RED pre-fix (240s timeout, 29/60 retry attempts visible in logs) and reliably GREEN post-fix (~1.4s). It's not a flaky timing-based test: it synthesizes the held-lock condition deterministically by holding the advisory lock in a distinct, still-alive Postgres session for the duration of the secondbootstrap_db.main()call. That guarantees the bootstrap's PgBouncer-side session seespg_try_advisory_lockreturn FALSE regardless of PgBouncer's internal backend assignment — without that guarantee, PgBouncer can hand the bootstrap the same server backend that the holder used, andpg_try_advisory_locksucceeds via PostgreSQL's reentrant-within-session semantics, masking the bug. (We landed on this design after observing a passing test that should have failed; the false-positive is documented astest_reentrant_acquire_through_same_pgbouncer_is_not_a_counter_exampleso a future reader who hits the same gotcha understands what they're seeing.)Gaps closed
HIGH — Bug correctness
Gap 1 (HIGH) — Advisory lock orphan via PgBouncer transaction-pool handoffs:
bootstrap_db.advisory_lockalways calledpg_try_advisory_lockand retried up to 60 times with exponential backoff. When PgBouncer reused a backend whose previous client took the lock, new clients saw it held by an unreachable session and spun until timeout. Fixed with a_alembic_at_headprobe that runs before the lock — when the DB is at the script-directory head, the lock is never attempted. The probe uses Alembic'sMigrationContext.get_current_heads()andScriptDirectory.from_config(cfg).get_heads()for the canonical source of truth; any error during probing falls through to the existing slow path so empty / partial / out-of-date databases still go through the explicit handling.Gap 2 (HIGH) — Worker-level fan-out under-stated in the issue text: each gateway pod runs N gunicorn workers (default 8 in the OCP profile), and each worker calls
bootstrap_db.main()in its own lifespan. So a 3-replica deployment produces 24 racing acquirers, not 3. Reproduction on OpenShift confirmed the fan-out: pre-fix, the same pod's logs show interleavedAcquired lock on attempt NandLock held by another instance, attempt N/60at the same timestamps — different workers in the same gunicorn process. The L1 fast-path eliminates the race for replicas 2..N entirely; for replica 1 the within-pod race is bounded to a sub-second seed window.Gap 3 (HIGH) — Chart safety today depends on the OCP-only workaround:
migration.hostKey: host/portKey: portroute the pre-install Job direct to Postgres (bypassing PgBouncer), masking the bug whenever the Job is enabled. A community user settingmigration.enabled=false(because they manage migrations externally, or run the chart in compose-flavoured contexts) loses that protection. Layer 1 makesmigration.enabled=falsesafe.HIGH — Operator ergonomics
Gap 4 (HIGH) — In-pod bootstrap runs even when a dedicated migration runner has already populated the schema: redundant probe activity per worker, plus the within-pod race for replica 1's seed window. Fixed with
MCPGATEWAY_SKIP_MIGRATIONS=truewhich short-circuits the lifespan call tobootstrap_db.main()and emits a single audit-log line so operators can see the choice in pod logs.Gap 5 (HIGH) — No chart wiring for the SKIP/Job pairing: even if the env var existed, expecting operators to set both
migration.enabled=trueANDMCPGATEWAY_SKIP_MIGRATIONS=trueis a foot-gun (forgetting one means redundant in-pod bootstrap; forgetting the other means schema never populated). Fixed incharts/mcp-stack/templates/deployment-mcpgateway.yaml: the env var is rendered as{{ ternary "true" "false" .Values.migration.enabled }}and placed afterextraEnvso a user override cannot accidentally undo the contract.MEDIUM — Test infrastructure
Gap 6 (MEDIUM) — No regression test for the advisory-lock invariant: no existing test exercised concurrent
bootstrap_db.main()callers through PgBouncer. Fixed withtests/integration/test_migrations_under_transaction_pool.pycontaining five@pytest.mark.integrationtests: three pinning the PgBouncer mechanism (so the substrate behavior is captured for posterity even if PgBouncer changes), one regression gate (synthesizes the orphaned-lock condition and assertsbootstrap_db.main()returns within 10s — fails in 240s pre-fix), and one idempotency invariant (asserts a second call after schema is at head never enters theadvisory_lockcontext manager). Tests skip cleanly without--with-integration.Gap 7 (MEDIUM) — No reproduction artifact in the repo: future debuggers had nothing to anchor on. Fixed by promoting the reproduction's compose stack to a permanent test fixture at
tests/integration/fixtures/transaction_pool/docker-compose.yml(driven by the integration suite). A throwaway shell harness (demonstrate_orphan.shproving the PgBouncer mechanism in 3 psql calls;reproduce.shscaling gateway replicas) lived on the branch during development and was removed in the squashed commit; the script contents are inlined verbatim in the Test results section below so a future reader can rebuild the reproduction without checking out an old commit.LOW — Documentation and observability
Gap 8 (LOW) — Two readiness markers, one log-line: the L1 fast-path returns without emitting
"Database ready"(only the slow-path return logs it). Operators reading pod logs see either the slow-path marker or the fast-path skip line, not a single canonical "bootstrap finished" line. Tracked as a tiny follow-up; not blocking.Architecture
Before — every replica × every worker races for the lock
After Layer 1 — fast-path before the lock
After Layer 2 — pods don't even probe when the Job owns it
After: schema-at-head startup probe (chart-level Ready gate)
Test results
The change ships with three layers of verification, each addressing a different failure mode:
The integration tests sit behind
--with-integration; the unit tests run on every CI pass. The cluster smoke is documented evidence, not automated, and lives astodo/issue-4051-ocp-reproduction.mdfor the PR record.Reproduction harness — files, steps, and observed results
A throwaway harness lived on the branch during development at
tests/reproduction/issue-4051/, then was removed before merge. Its threefiles are inlined below so a future reader can rebuild the reproduction
without checking out an old commit.
The compose stack the scripts drove was promoted to
tests/integration/fixtures/transaction_pool/docker-compose.ymlso theintegration test owns the same infrastructure long-term. The shell harness
is a separate, optional layer on top of that for ad-hoc runs.
Why two harnesses for one bug?
demonstrate_orphan.shproves the PgBouncermechanism exists (orphaned advisory locks);
reproduce.shproves theoperational symptom exists (replicas hang). Together they answer two
questions any reviewer might ask: "is this a real PgBouncer property?" and
"does it actually cause user-visible breakage?" — without making us guess
timing.
PgBouncer's
default_pool_sizeis deliberately small (2) in the fixture sototal client connections comfortably exceed it, forcing backend multiplexing
— without that pressure, the bug is hard to trigger locally because bootstrap
can finish before any backend handoff happens.
Step 1 —
demonstrate_orphan.sh(mechanism proof)Three steps:
42424242424242, disconnect.pg_locksdirectly on Postgres → lock is still held by thenow-orphaned backend.
pg_try_advisory_lock(42424242424242)returnsf.Step 3 returning
fis the exact condition that makesbootstrap_db.main()hang: a new gateway pod spins on a lock no liveclient owns. Always reproducible; runs in ~10 s.
Step 2 —
reproduce.sh(symptom reproduction at replica scale)Scales gateway to N replicas (default 3) against the same PgBouncer-fronted
Postgres, watches each pod's logs for the bootstrap-completion line,
then dumps
pg_locksandpg_stat_activityif any pod is still hung.Pre-fix expected output:
1."Database ready"; the others end atINFO [alembic.runtime.migration] Will assume transactional DDL.pg_locksshows one granted advisory row plus several backendsidle in transactiononSELECT pg_try_advisory_lock(...).Post-fix expected output: all replicas log
Database readywithin thetimeout, no stuck backends, exits
0.Inlined file contents
demonstrate_orphan.shreproduce.shREADME.mdWhy the regression test is deterministic (not flaky)
Three ideas had to land before
test_bootstrap_db_skips_lock_when_schema_already_at_headwas reliably RED on main:The middle attempt's false-positive is captured as a permanent test (
test_reentrant_acquire_through_same_pgbouncer_is_not_a_counter_example) so future readers who see "but I called pg_try_advisory_lock through PgBouncer and it returned TRUE!" understand the reentrant-session gotcha.The same logic underpins the
_clean_orphaned_lockautouse fixture: it runspg_terminate_backendon any session still holding the test lock id between tests, so a single failed test cannot wedge the next one behind a leaked orphan.Unit — `_alembic_at_head` truth table (5 tests)
Pins each branch of the fast-path predicate independently of the integration test, so a future refactor that widens or narrows the trigger condition is caught locally.
Unit — `MCPGATEWAY_SKIP_MIGRATIONS` settings field (3 tests)
Library default stays False (the
docker run mcpgatewayhappy path), env var override flips True, explicit False matches the default. No accidental coercion.Unit — lifespan helper respects the flag (2 tests)
The skip-case test also asserts the audit-log line is emitted so operators can see the choice in pod logs.
Unit — Helm chart render (3 tests)
Tests run
helm templateas a subprocess, parse the rendered manifests, and assert on the gateway Deployment's env block. Skipped automatically whenhelmis not on PATH; no cluster needed.Integration — PgBouncer mechanism + bootstrap regression (5 tests, gated by `--with-integration`)
The mechanism tests document what PgBouncer does (lock orphaning is a real, named property of transaction-pool handoffs) so a future reader who sees "but pg_try_advisory_lock returned TRUE through PgBouncer!" understands the reentrance gotcha. The regression tests gate the actual fix.
Run with the docker-compose stack from
tests/integration/fixtures/transaction_pool/docker-compose.yml.OCP verification — L1 fix on real OpenShift + CrunchyData PGO (cluster: )
Full reproduction-vs-fix evidence in
todo/issue-4051-ocp-reproduction.md.OCP verification — L2 (chart-wired SKIP_MIGRATIONS) on real OpenShift
The 3-way pre-fix / post-L1 / post-L2 comparison is the headline win. The chart's contract — migration Job runs → app pods skip — is honored automatically; operators don't have to remember to set both
migration.enabledANDMCPGATEWAY_SKIP_MIGRATIONSbecause the chart wires them together.Full reproduction-vs-fix evidence in
todo/issue-4051-ocp-reproduction.md.OCP verification — L2 on minimal profile (no PGO operator, chart-managed Postgres + PgBouncer)
The OCP-PGO smoke above proves the fix on the operator-managed path. This second smoke proves it on the community Helm path — the chart's base
values.yamlwith the chart's own standalonepostgres+pgbouncerDeployments (no CrunchyData operator). That's the configuration most community users will hit and the one a downstream pipeline usingDEPLOYMENT_PROFILE=minimallands on.Stack components running in the namespace:
What this confirms beyond the PGO smoke:
migration.enabled→MCPGATEWAY_SKIP_MIGRATIONSwiring is profile-agnostic — works on both operator-managed and chart-managed Postgres.DEPLOYMENT_PROFILE=minimal(without PGO) is unblocked by this PR with no additional configuration changes — they only need to bump their image tag to one with this fix.Additional improvements
Idempotent post-migration bootstrap extracted to
_run_post_migration_bootstrap()— admin-user creation, default-role seeding, and orphaned-resource assignment are now called from both the fast-path and slow-path branches. Previously these were only inside theadvisory_lockblock, so a fast-path skip would have missed them on a partial-startup edge case (replica 1 stamps head then crashes before bootstrapping the admin user → replica 2 fast-paths and runs the bootstrap steps to fill the gap).Reproduction harness inlined in this description — the development-time harness directory has been removed in the squashed commit; the compose stack it drove was promoted to
tests/integration/fixtures/transaction_pool/docker-compose.ymlso the integration suite owns the same infrastructure long-term. The two shell scripts (reproduce.sh,demonstrate_orphan.sh) are inlined in the Test results section above so a future reader can rebuild the harness from the PR alone.Module docstrings reference the issue number — both the helper functions and the integration test module carry an inline reference to issue [BUG]: Alembic migration advisory lock hangs when multiple gateway pods start through PgBouncer (transaction pooling mode) #4051 so a future reader greps once and lands in the right context.
Limitations
Fast-path returns without emitting
"Database ready"— the canonical readiness marker only fires on the slow-path return. Operators reading pod logs see eitherDatabase readyorSchema already at Alembic head; skipping migration lock. Both signal "bootstrap finished," but inconsistently. Captured as Gap 8 above; trivial follow-up.Table-based mutex fallback not included — the original plan considered a Postgres table-based mutex for environments where
migration_database_url(direct-Postgres bypass) isn't available AND the chart's pre-install Job is disabled. With L1 + L2 + the schema-at-head probe in place, that combination is well-covered: L1 makes the in-pod path safe, L2 lets the chart guarantee the in-pod path is skipped when the Job runs, and the probe holds Ready until the schema is correct regardless. Reopening if a real-world report justifies it.DROP SCHEMA in PGO setups requires ownership restoration — observed during the OCP reproduction. CrunchyData PGO assigns schema ownership to a service user; a manual
DROP SCHEMA public CASCADE; CREATE SCHEMA public;re-creates the schema owned bypostgres, so the service user can't create tables in it until ownership is restored:Documented in the OCP reproduction notes; not a code fix.
Pod readiness can be misleading during pre-fix cycling — observed pods show
1/1 Readyeven while several gunicorn workers are still in their retry loop. One worker's lifespan completing seems sufficient to satisfy the readiness probe. With L1+L2 the storm window is gone; this is captured here for operator awareness when reading historical pre-fix logs.Fast-path makes the post-migration role/user-role seeders non-serialised —
bootstrap_default_roles()and the platform_admin user-role assignment now run outside the migration advisory lock on every replica that takes the fast path. Chart-default deployments (migration.enabled=true+MCPGATEWAY_SKIP_MIGRATIONS=trueon gateway pods) are unaffected — the migration Job is the sole writer. Configurations where gateway pods bootstrap themselves (e.g.,migration.enabled=false, multi-worker single-pod gunicorn) carry a small first-startup race window that can produce duplicate active rows inroles/user_roles. Fully closed by the follow-up linked below.Future work
RBAC seeder race fix — PR #4480 / issue #4482
Inline review on this PR identified two related races (HIGH on
roles, MEDIUM onuser_roles) that the L1 fast-path exposes by running the post-migration seeders without holding the advisory lock. The fix is staged in PR #4480 with issue #4482 as the tracking record. Approach: partial unique indexes on(name, scope)/(user_email, role_id, scope[, scope_id])WHERE is_active = true, an idempotent dedupe-then-constrain Alembic migration, plus savepoint-and-refetch inRoleService.create_role/assign_role_to_user.The race fix is strictly additive; the hang fix in this PR does not depend on it. Split out to keep this PR narrowly scoped to the original bug, and to keep the diff in front of the chart-and-bootstrap reviewers separate from the diff in front of the RBAC reviewers.
Compose-side L2 (deferred)
The root
docker-compose.ymlwas left unchanged. It's currently safe thanks to L1 alone (the fast-path makes in-pod bootstrap pgbouncer-safe). Adding a dedicatedmigrateservice in compose plusMCPGATEWAY_SKIP_MIGRATIONS=trueon the gateway would mirror the chart's contract one-to-one — useful for users who want production-shaped semantics in dev. Tracked as a separate issue if there's appetite; deliberately out-of-scope here to keepdocker-compose.ymltouch-free for existing users.Unify the readiness log line
Per Gap 8 above. Either:
"Database ready"from both paths (fast and slow), or"Database ready (fast-path)"/"Database ready (slow-path)").Either change is a one-line edit; deferred so this PR stays scoped to the bug fix.
MIGRATION_DATABASE_URL(only if requested)The Helm chart already exposes
migration.hostKey/portKeyto point the pre-install Job direct to Postgres (bypassing PgBouncer). A code-levelMIGRATION_DATABASE_URLsetting would let docker-compose users do the same without twoDATABASE_URLdefinitions. Not necessary for the bug fix; deferred unless someone explicitly asks for it.Closes #4051