fix(rbac): block duplicate role/user_role rows on concurrent bootstrap#4480
Draft
gandhipratik203 wants to merge 12 commits intofix/4051-alembic-pgbouncer-transaction-poolfrom
Draft
fix(rbac): block duplicate role/user_role rows on concurrent bootstrap#4480gandhipratik203 wants to merge 12 commits intofix/4051-alembic-pgbouncer-transaction-poolfrom
gandhipratik203 wants to merge 12 commits intofix/4051-alembic-pgbouncer-transaction-poolfrom
Conversation
Adds tests/integration/test_migrations_under_transaction_pool.py — three @pytest.mark.integration assertions pinning the PgBouncer transaction-pool behavior that makes bootstrap_db.main() hang when gateway replicas start concurrently: - a disconnected pgbouncer client's advisory lock persists on Postgres, - a fresh Postgres session is blocked from acquiring the orphaned lock, - a same-backend reuse via pgbouncer is reentrant (documents the gotcha so future readers don't mistake TRUE for "no bug"). Tests are gated behind --with-integration so CI default runs are unaffected. The fixture stack (postgres + pgbouncer with pool_mode= transaction and default_pool_size=2) lives at tests/integration/fixtures/transaction_pool/docker-compose.yml. Also adds tests/reproduction/issue-4051/ with demonstrate_orphan.sh (deterministic mechanism proof) and reproduce.sh (replica-scaling symptom reproduction). These exist for PR-review evidence and local debugging during development; they are expected to be removed when the fix PR lands, with the evidence preserved in the issue thread. No runtime changes. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Multiple gateway replicas starting concurrently through a transaction- pooling connection pooler (PgBouncer pool_mode=transaction, pgcat, etc.) can hang indefinitely on bootstrap. A previous replica's pg_advisory_lock is session-scoped on its Postgres server backend; when the pooler returns that backend to its pool across a transaction boundary without clearing the lock, the next replica's pg_try_advisory_lock sees it held by an orphaned session and spins through its retry loop until the ~10-minute timeout fires. Add a fast-path in bootstrap_db.main() that probes alembic_version via MigrationContext before attempting the advisory lock: when the DB is already at the Alembic script directory's head, skip the lock entirely and run only the idempotent post-migration bootstrap steps (team visibility normalization, admin user, default roles, orphaned resource assignment). Replicas 2..N of a multi-pod deployment take this branch on every normal restart, so the advisory-lock orphan is never contended. The lock is still acquired for the slow path (empty DB, version mismatch, or any probe error), preserving the existing concurrent- migration protection. Post-migration bootstrap was extracted into _run_post_migration_bootstrap() so both paths share it. Adds tests/integration/test_migrations_under_transaction_pool.py:: test_bootstrap_db_skips_lock_when_schema_already_at_head as a regression gate: it synthesizes the held-lock precondition via a distinct direct-to-postgres session and runs bootstrap through pgbouncer. Pre-fix, the test times out after 240s; post-fix, it completes in under a second. Closes #4051 Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Adds TestAlembicAtHead with five unit tests covering the bool outcomes that decide whether main() skips the migration advisory lock: - at head → True (fast-path fires) - no alembic_version in DB → False (slow path handles empty/partial DBs) - head mismatch → False (migrations actually need to run) - script directory has zero heads → False (defensive) - any probe exception → False (fallback to slow path) Direct coverage independent of the integration test; narrows the blast radius of a future refactor to this one helper. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Adds test_bootstrap_db_is_idempotent_once_schema_is_at_head as an additional invariant for the issue #4051 fast-path: after one successful bootstrap, a subsequent call must NOT enter advisory_lock at all — it should take the fast-path exclusively. The test wraps bootstrap_db.advisory_lock with a spy and asserts zero entries on the second invocation. This catches regressions where a future refactor accidentally re-routes bootstrap through the lock even when the schema is at head (the specific failure mode that would bring issue #4051 back under transaction-pooling connection poolers). Also asserts the second call completes under 3s and leaves no advisory lock held. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
…otstrap Adds a settings flag (`mcpgateway_skip_migrations`, default False) that gates the gateway's lifespan call to bootstrap_db.main(). When set, the gateway logs that it is yielding migration ownership to a separate runner and skips the in-pod path entirely. This is the operator-facing half of the issue #4051 fix. The L1 fast-path (already on this branch) makes in-pod bootstrap safe under transaction-pool connection poolers; this flag lets a Helm chart or init-container deployment turn the in-pod path off entirely once a dedicated migration runner has populated the schema. Library default stays False so direct `docker run mcpgateway` continues to bootstrap itself with no operator action. Wires the gate via a small lifespan helper (`_run_initial_db_bootstrap`) to keep the lifespan body unchanged and the gate trivially testable. Tests: - settings.skip_migrations defaults False - MCPGATEWAY_SKIP_MIGRATIONS=true → True - MCPGATEWAY_SKIP_MIGRATIONS=false → False - lifespan helper invokes bootstrap_db when flag off - lifespan helper skips bootstrap_db AND logs an audit line when on Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Wires the gateway Deployment env var introduced in the prior commit into the chart so the pairing "migration Job runs → app pods skip" is enforced at the chart layer rather than relying on operators to set both flags consistently. The env entry is placed after extraEnv so a user-supplied override cannot accidentally undo the contract. Behavior: migration.enabled=true → MCPGATEWAY_SKIP_MIGRATIONS="true" migration.enabled=false → MCPGATEWAY_SKIP_MIGRATIONS="false" default values → migration.enabled=true → SKIP="true" Adds tests/unit/charts/test_deployment_mcpgateway.py with three helm-template assertions pinning each branch. Tests skip automatically when helm is not on PATH; no cluster needed. Issue #4051 Layer 2. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Adds test_compose_three_replicas_complete_bootstrap_e2e — drives the fixture compose stack at the container level, scales the gateway service to 3 replicas through transaction-pool PgBouncer, and asserts all three reach bootstrap completion within 60s. This is the closest re-runnable analog of the manual OCP cluster smoke captured in todo/issue-4051-ocp-reproduction.md, exercising the full pod-startup path (gunicorn + lifespan + bootstrap_db.main()) inside real containers rather than calling the library entry point in-process. The other 5 tests in this module already cover the bug at the mechanism (PgBouncer orphaning) and library (bootstrap_db) levels. This e2e test catches regressions of the integration shape: gunicorn worker fan-out, lifespan ordering, image-level config loading, etc. The test is destructive — it drops the fixture's public schema. Skipped unless MCPGATEWAY_TEST_ALLOW_DESTRUCTIVE_E2E=1 is set, matching the opt-in pattern already used elsewhere in tests/ conftest.py for external-DB safety. Also skips cleanly when docker is missing, the gateway image isn't built, or the fixture postgres/ pgbouncer aren't running — each with an actionable message explaining the missing prerequisite. Verified locally: 14.78s post-fix, hangs to 240s pytest.mark.timeout pre-fix. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Issue numbers are pointers, not content — they rot once the issue is closed. The fix outlives the issue. Removes "issue #4051" / "Layer 2" references from code comments and test docstrings, replacing them with prose that names the BEHAVIOR being protected (advisory-lock orphaning behind transaction-pool poolers, MCPGATEWAY_SKIP_MIGRATIONS contract, etc). Issue references are kept where they belong — commit messages and the PR description — and removed where they don't (test docstrings, helper comments, file-level docstrings, the test fixture's compose file). Also renames the test fixture's docker compose project name from "mcf4051" to the more descriptive "transaction-pool-fixture" since that file has no surviving link to the originating issue. No behavior changes. 493 tests pass; 1 (e2e) skips cleanly without the destructive opt-in. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
CI's pre-commit run flagged two issues on this branch: 1. .secrets.baseline was out-of-date — the new mcpgateway_skip_migrations field in config.py shifted line numbers of pre-existing detected patterns by +42 lines, and detect-secrets refused to pass while the baseline still pointed at the old line numbers. Refreshed via `detect-secrets scan --update --use-all-plugins`. Four new entries from this PR's test fixtures (synthetic test passwords like `reprosecret`) are present in the baseline marked is_secret=False: they are clearly throwaway test values, not real credentials. 2. Black wanted minor reformatting in three of the test/source files we touch in this PR (mcpgateway/main.py, two test files). Applied. No behavior changes. 488 unit tests still pass. The detect-secrets and black pre-commit hooks now pass locally on the PR-relevant files. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Replaces the gateway Deployment's db_isready startup probe with a schema-at-head check so pods cannot serve traffic against an empty or mid-migration DB regardless of which entity ran the migration: the chart's pre-install Job, post-install Job, an init container, an external CD pipeline, or a manual operator step. Adds mcpgateway.utils.check_schema_at_head as the probe command. It exits 0 only when ``alembic_version`` matches the script-directory head, and exits 1 on any error so probe failures fail closed. Refs #4051 Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
…t bootstrap
After the L1 fast-path landed, ``bootstrap_default_roles`` and the
platform_admin user-role assignment run outside the migration advisory
lock on every replica that takes the fast path. Both code paths were
SELECT-then-INSERT with no DB-tier uniqueness on (roles.name, roles.scope)
or (user_roles.user_email, role_id, scope, scope_id) — and ``Role.id`` /
``UserRole.id`` are client-generated UUIDs, so the PK never saved us.
Concurrent fast-path workers could write duplicate active rows, after
which ``get_role_by_name`` / ``get_user_role_assignment`` raise
``MultipleResultsFound``.
Adds:
* Partial unique indexes (``WHERE is_active = true``) on the relevant
column tuples — split into two for ``user_roles`` because ``scope_id``
is nullable and both Postgres and SQLite treat NULL as distinct in
ordinary unique indexes.
* Idempotent dedupe-then-constrain Alembic migration that soft-deletes
duplicate active rows (oldest by created_at / granted_at wins) so the
indexes can be added on databases that already raced.
* Savepoint-and-refetch handling in ``RoleService.create_role`` and
``RoleService.assign_role_to_user``: insert is wrapped in
``db.begin_nested()``, an IntegrityError that resolves to a winning
row is treated as success, and any other IntegrityError surfaces as
``ValueError``.
* Unit tests pinning the savepoint refetch path and the non-race
IntegrityError → ValueError surface.
* Integration test: 6 concurrent seeders against real Postgres must
converge on exactly one active row per (name, scope) and exactly one
active platform_admin user-role assignment.
Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
This was referenced Apr 27, 2026
46f1d84 to
913fa4e
Compare
06456e1 to
5384128
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #4444. After the L1 fast-path landed in #4444,
bootstrap_default_rolesand the platform_admin user-role assignment run outside the migration advisory lock on every replica that takes the fast path. Both code paths were SELECT-then-INSERT with no DB-tier uniqueness, and the row IDs are client-generated UUIDs — so concurrent fast-path workers can write duplicate active rows. The nextget_role_by_name/get_user_role_assignmentthen raisesMultipleResultsFound.This PR closes the race by adding the missing DB-tier guards and graceful app-tier handling for any in-flight retries.
What changed
mcpgateway/db.py— partial unique indexes:roles(name, scope) WHERE is_active = trueuser_roles(user_email, role_id, scope) WHERE is_active = true AND scope_id IS NULLuser_roles(user_email, role_id, scope, scope_id) WHERE is_active = true AND scope_id IS NOT NULLTwo
user_rolesindexes are needed becausescope_idis nullable and both Postgres and SQLite treat NULL as distinct in plain unique indexes.mcpgateway/alembic/versions/c3d5e7f9a1b2_*.py— idempotent dedupe-then-constrain migration. Soft-deletes duplicate active rows viais_active = false(oldest bycreated_at/granted_atwins) so the indexes can be added on databases that already raced. Audit history preserved; downgrade drops the indexes only.mcpgateway/services/role_service.py—create_roleandassign_role_to_userwrap their inserts indb.begin_nested(). AnIntegrityErrorthat resolves to a winning row is treated as success and the winner is returned; anIntegrityErrorthat does not resolve to a winning row surfaces asValueErrorper the documented contract.Tests — 5 new unit tests pinning the savepoint refetch path and the non-race
IntegrityError→ValueErrorsurface, plus 1 new integration test (6 concurrent seeders against real Postgres → exactly 1 active row per(name, scope)and exactly 1 active platform_admin assignment).Verified on
post-install,pre-upgradehookPhase)pre-install,pre-upgradehookPhase)In all three: migration upgrade is clean, gateway pods reach Ready with 0 restarts, all three partial unique indexes are present, no duplicate active rows, no advisory locks held.
Negative control verified locally: removing the
rolespartial unique index makes the new integration test fail with three duplicateplatform_viewerrows.Notes
fix/4051-alembic-pgbouncer-transaction-pool(PR fix(bootstrap): improve startup reliability for multi-replica deploys #4444). Once fix(bootstrap): improve startup reliability for multi-replica deploys #4444 merges tomain, GitHub will auto-retarget this PR's base tomain.