fix(rbac): block duplicate role/user_role rows on concurrent bootstrap by gandhipratik203 · Pull Request #4480 · IBM/mcp-context-forge

gandhipratik203 · 2026-04-27T20:57:57Z

Summary

Follow-up to #4444. After the L1 fast-path landed in #4444, bootstrap_default_roles and the platform_admin user-role assignment run outside the migration advisory lock on every replica that takes the fast path. Both code paths were SELECT-then-INSERT with no DB-tier uniqueness, and the row IDs are client-generated UUIDs — so concurrent fast-path workers can write duplicate active rows. The next get_role_by_name / get_user_role_assignment then raises MultipleResultsFound.

This PR closes the race by adding the missing DB-tier guards and graceful app-tier handling for any in-flight retries.

What changed

mcpgateway/db.py — partial unique indexes:
- roles(name, scope) WHERE is_active = true
- user_roles(user_email, role_id, scope) WHERE is_active = true AND scope_id IS NULL
- user_roles(user_email, role_id, scope, scope_id) WHERE is_active = true AND scope_id IS NOT NULL
Two user_roles indexes are needed because scope_id is nullable and both Postgres and SQLite treat NULL as distinct in plain unique indexes.
mcpgateway/alembic/versions/c3d5e7f9a1b2_*.py — idempotent dedupe-then-constrain migration. Soft-deletes duplicate active rows via is_active = false (oldest by created_at / granted_at wins) so the indexes can be added on databases that already raced. Audit history preserved; downgrade drops the indexes only.
mcpgateway/services/role_service.py — create_role and assign_role_to_user wrap their inserts in db.begin_nested(). An IntegrityError that resolves to a winning row is treated as success and the winner is returned; an IntegrityError that does not resolve to a winning row surfaces as ValueError per the documented contract.
Tests — 5 new unit tests pinning the savepoint refetch path and the non-race IntegrityError → ValueError surface, plus 1 new integration test (6 concurrent seeders against real Postgres → exactly 1 active row per (name, scope) and exactly 1 active platform_admin assignment).

Verified on

Local docker-compose (Postgres + transaction-pool PgBouncer fixture)
OCP minimal profile (chart default post-install,pre-upgrade hookPhase)
OCP PGO profile (pre-install,pre-upgrade hookPhase)

In all three: migration upgrade is clean, gateway pods reach Ready with 0 restarts, all three partial unique indexes are present, no duplicate active rows, no advisory locks held.

Negative control verified locally: removing the roles partial unique index makes the new integration test fail with three duplicate platform_viewer rows.

Notes

This PR's base is fix/4051-alembic-pgbouncer-transaction-pool (PR fix(bootstrap): improve startup reliability for multi-replica deploys #4444). Once fix(bootstrap): improve startup reliability for multi-replica deploys #4444 merges to main, GitHub will auto-retarget this PR's base to main.
Closes [BUG] RBAC role/user_role seeder race when fast-path skips advisory lock #4482

Adds tests/integration/test_migrations_under_transaction_pool.py — three @pytest.mark.integration assertions pinning the PgBouncer transaction-pool behavior that makes bootstrap_db.main() hang when gateway replicas start concurrently: - a disconnected pgbouncer client's advisory lock persists on Postgres, - a fresh Postgres session is blocked from acquiring the orphaned lock, - a same-backend reuse via pgbouncer is reentrant (documents the gotcha so future readers don't mistake TRUE for "no bug"). Tests are gated behind --with-integration so CI default runs are unaffected. The fixture stack (postgres + pgbouncer with pool_mode= transaction and default_pool_size=2) lives at tests/integration/fixtures/transaction_pool/docker-compose.yml. Also adds tests/reproduction/issue-4051/ with demonstrate_orphan.sh (deterministic mechanism proof) and reproduce.sh (replica-scaling symptom reproduction). These exist for PR-review evidence and local debugging during development; they are expected to be removed when the fix PR lands, with the evidence preserved in the issue thread. No runtime changes. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

Multiple gateway replicas starting concurrently through a transaction- pooling connection pooler (PgBouncer pool_mode=transaction, pgcat, etc.) can hang indefinitely on bootstrap. A previous replica's pg_advisory_lock is session-scoped on its Postgres server backend; when the pooler returns that backend to its pool across a transaction boundary without clearing the lock, the next replica's pg_try_advisory_lock sees it held by an orphaned session and spins through its retry loop until the ~10-minute timeout fires. Add a fast-path in bootstrap_db.main() that probes alembic_version via MigrationContext before attempting the advisory lock: when the DB is already at the Alembic script directory's head, skip the lock entirely and run only the idempotent post-migration bootstrap steps (team visibility normalization, admin user, default roles, orphaned resource assignment). Replicas 2..N of a multi-pod deployment take this branch on every normal restart, so the advisory-lock orphan is never contended. The lock is still acquired for the slow path (empty DB, version mismatch, or any probe error), preserving the existing concurrent- migration protection. Post-migration bootstrap was extracted into _run_post_migration_bootstrap() so both paths share it. Adds tests/integration/test_migrations_under_transaction_pool.py:: test_bootstrap_db_skips_lock_when_schema_already_at_head as a regression gate: it synthesizes the held-lock precondition via a distinct direct-to-postgres session and runs bootstrap through pgbouncer. Pre-fix, the test times out after 240s; post-fix, it completes in under a second. Closes #4051 Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

Adds TestAlembicAtHead with five unit tests covering the bool outcomes that decide whether main() skips the migration advisory lock: - at head → True (fast-path fires) - no alembic_version in DB → False (slow path handles empty/partial DBs) - head mismatch → False (migrations actually need to run) - script directory has zero heads → False (defensive) - any probe exception → False (fallback to slow path) Direct coverage independent of the integration test; narrows the blast radius of a future refactor to this one helper. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

Adds test_bootstrap_db_is_idempotent_once_schema_is_at_head as an additional invariant for the issue #4051 fast-path: after one successful bootstrap, a subsequent call must NOT enter advisory_lock at all — it should take the fast-path exclusively. The test wraps bootstrap_db.advisory_lock with a spy and asserts zero entries on the second invocation. This catches regressions where a future refactor accidentally re-routes bootstrap through the lock even when the schema is at head (the specific failure mode that would bring issue #4051 back under transaction-pooling connection poolers). Also asserts the second call completes under 3s and leaves no advisory lock held. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

…otstrap Adds a settings flag (`mcpgateway_skip_migrations`, default False) that gates the gateway's lifespan call to bootstrap_db.main(). When set, the gateway logs that it is yielding migration ownership to a separate runner and skips the in-pod path entirely. This is the operator-facing half of the issue #4051 fix. The L1 fast-path (already on this branch) makes in-pod bootstrap safe under transaction-pool connection poolers; this flag lets a Helm chart or init-container deployment turn the in-pod path off entirely once a dedicated migration runner has populated the schema. Library default stays False so direct `docker run mcpgateway` continues to bootstrap itself with no operator action. Wires the gate via a small lifespan helper (`_run_initial_db_bootstrap`) to keep the lifespan body unchanged and the gate trivially testable. Tests: - settings.skip_migrations defaults False - MCPGATEWAY_SKIP_MIGRATIONS=true → True - MCPGATEWAY_SKIP_MIGRATIONS=false → False - lifespan helper invokes bootstrap_db when flag off - lifespan helper skips bootstrap_db AND logs an audit line when on Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

Wires the gateway Deployment env var introduced in the prior commit into the chart so the pairing "migration Job runs → app pods skip" is enforced at the chart layer rather than relying on operators to set both flags consistently. The env entry is placed after extraEnv so a user-supplied override cannot accidentally undo the contract. Behavior: migration.enabled=true → MCPGATEWAY_SKIP_MIGRATIONS="true" migration.enabled=false → MCPGATEWAY_SKIP_MIGRATIONS="false" default values → migration.enabled=true → SKIP="true" Adds tests/unit/charts/test_deployment_mcpgateway.py with three helm-template assertions pinning each branch. Tests skip automatically when helm is not on PATH; no cluster needed. Issue #4051 Layer 2. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

Adds test_compose_three_replicas_complete_bootstrap_e2e — drives the fixture compose stack at the container level, scales the gateway service to 3 replicas through transaction-pool PgBouncer, and asserts all three reach bootstrap completion within 60s. This is the closest re-runnable analog of the manual OCP cluster smoke captured in todo/issue-4051-ocp-reproduction.md, exercising the full pod-startup path (gunicorn + lifespan + bootstrap_db.main()) inside real containers rather than calling the library entry point in-process. The other 5 tests in this module already cover the bug at the mechanism (PgBouncer orphaning) and library (bootstrap_db) levels. This e2e test catches regressions of the integration shape: gunicorn worker fan-out, lifespan ordering, image-level config loading, etc. The test is destructive — it drops the fixture's public schema. Skipped unless MCPGATEWAY_TEST_ALLOW_DESTRUCTIVE_E2E=1 is set, matching the opt-in pattern already used elsewhere in tests/ conftest.py for external-DB safety. Also skips cleanly when docker is missing, the gateway image isn't built, or the fixture postgres/ pgbouncer aren't running — each with an actionable message explaining the missing prerequisite. Verified locally: 14.78s post-fix, hangs to 240s pytest.mark.timeout pre-fix. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

Issue numbers are pointers, not content — they rot once the issue is closed. The fix outlives the issue. Removes "issue #4051" / "Layer 2" references from code comments and test docstrings, replacing them with prose that names the BEHAVIOR being protected (advisory-lock orphaning behind transaction-pool poolers, MCPGATEWAY_SKIP_MIGRATIONS contract, etc). Issue references are kept where they belong — commit messages and the PR description — and removed where they don't (test docstrings, helper comments, file-level docstrings, the test fixture's compose file). Also renames the test fixture's docker compose project name from "mcf4051" to the more descriptive "transaction-pool-fixture" since that file has no surviving link to the originating issue. No behavior changes. 493 tests pass; 1 (e2e) skips cleanly without the destructive opt-in. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

CI's pre-commit run flagged two issues on this branch: 1. .secrets.baseline was out-of-date — the new mcpgateway_skip_migrations field in config.py shifted line numbers of pre-existing detected patterns by +42 lines, and detect-secrets refused to pass while the baseline still pointed at the old line numbers. Refreshed via `detect-secrets scan --update --use-all-plugins`. Four new entries from this PR's test fixtures (synthetic test passwords like `reprosecret`) are present in the baseline marked is_secret=False: they are clearly throwaway test values, not real credentials. 2. Black wanted minor reformatting in three of the test/source files we touch in this PR (mcpgateway/main.py, two test files). Applied. No behavior changes. 488 unit tests still pass. The detect-secrets and black pre-commit hooks now pass locally on the PR-relevant files. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

Replaces the gateway Deployment's db_isready startup probe with a schema-at-head check so pods cannot serve traffic against an empty or mid-migration DB regardless of which entity ran the migration: the chart's pre-install Job, post-install Job, an init container, an external CD pipeline, or a manual operator step. Adds mcpgateway.utils.check_schema_at_head as the probe command. It exits 0 only when ``alembic_version`` matches the script-directory head, and exits 1 on any error so probe failures fail closed. Refs #4051 Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

…t bootstrap After the L1 fast-path landed, ``bootstrap_default_roles`` and the platform_admin user-role assignment run outside the migration advisory lock on every replica that takes the fast path. Both code paths were SELECT-then-INSERT with no DB-tier uniqueness on (roles.name, roles.scope) or (user_roles.user_email, role_id, scope, scope_id) — and ``Role.id`` / ``UserRole.id`` are client-generated UUIDs, so the PK never saved us. Concurrent fast-path workers could write duplicate active rows, after which ``get_role_by_name`` / ``get_user_role_assignment`` raise ``MultipleResultsFound``. Adds: * Partial unique indexes (``WHERE is_active = true``) on the relevant column tuples — split into two for ``user_roles`` because ``scope_id`` is nullable and both Postgres and SQLite treat NULL as distinct in ordinary unique indexes. * Idempotent dedupe-then-constrain Alembic migration that soft-deletes duplicate active rows (oldest by created_at / granted_at wins) so the indexes can be added on databases that already raced. * Savepoint-and-refetch handling in ``RoleService.create_role`` and ``RoleService.assign_role_to_user``: insert is wrapped in ``db.begin_nested()``, an IntegrityError that resolves to a winning row is treated as success, and any other IntegrityError surfaces as ``ValueError``. * Unit tests pinning the savepoint refetch path and the non-race IntegrityError → ValueError surface. * Integration test: 6 concurrent seeders against real Postgres must converge on exactly one active row per (name, scope) and exactly one active platform_admin user-role assignment. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

gandhipratik203 added 12 commits April 24, 2026 17:25

chore: remove temporary reproduction harness for #4051

2ef5fe3

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

This was referenced Apr 27, 2026

[BUG] RBAC role/user_role seeder race when fast-path skips advisory lock #4482

Open

fix(bootstrap): improve startup reliability for multi-replica deploys #4444

Open

gandhipratik203 force-pushed the fix/4051-alembic-pgbouncer-transaction-pool branch from 46f1d84 to 913fa4e Compare April 28, 2026 07:45

gandhipratik203 force-pushed the fix/4051-alembic-pgbouncer-transaction-pool branch 2 times, most recently from 06456e1 to 5384128 Compare May 8, 2026 13:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rbac): block duplicate role/user_role rows on concurrent bootstrap#4480

fix(rbac): block duplicate role/user_role rows on concurrent bootstrap#4480
gandhipratik203 wants to merge 12 commits intofix/4051-alembic-pgbouncer-transaction-poolfrom
fix/4051-followup-rbac-race-guard

gandhipratik203 commented Apr 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gandhipratik203 commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Verified on

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gandhipratik203 commented Apr 27, 2026 •

edited

Loading