Skip to content

fix(rbac): block duplicate role/user_role rows on concurrent bootstrap#4480

Draft
gandhipratik203 wants to merge 12 commits intofix/4051-alembic-pgbouncer-transaction-poolfrom
fix/4051-followup-rbac-race-guard
Draft

fix(rbac): block duplicate role/user_role rows on concurrent bootstrap#4480
gandhipratik203 wants to merge 12 commits intofix/4051-alembic-pgbouncer-transaction-poolfrom
fix/4051-followup-rbac-race-guard

Conversation

@gandhipratik203
Copy link
Copy Markdown
Collaborator

@gandhipratik203 gandhipratik203 commented Apr 27, 2026

Summary

Follow-up to #4444. After the L1 fast-path landed in #4444, bootstrap_default_roles and the platform_admin user-role assignment run outside the migration advisory lock on every replica that takes the fast path. Both code paths were SELECT-then-INSERT with no DB-tier uniqueness, and the row IDs are client-generated UUIDs — so concurrent fast-path workers can write duplicate active rows. The next get_role_by_name / get_user_role_assignment then raises MultipleResultsFound.

This PR closes the race by adding the missing DB-tier guards and graceful app-tier handling for any in-flight retries.

What changed

  • mcpgateway/db.py — partial unique indexes:

    • roles(name, scope) WHERE is_active = true
    • user_roles(user_email, role_id, scope) WHERE is_active = true AND scope_id IS NULL
    • user_roles(user_email, role_id, scope, scope_id) WHERE is_active = true AND scope_id IS NOT NULL

    Two user_roles indexes are needed because scope_id is nullable and both Postgres and SQLite treat NULL as distinct in plain unique indexes.

  • mcpgateway/alembic/versions/c3d5e7f9a1b2_*.py — idempotent dedupe-then-constrain migration. Soft-deletes duplicate active rows via is_active = false (oldest by created_at / granted_at wins) so the indexes can be added on databases that already raced. Audit history preserved; downgrade drops the indexes only.

  • mcpgateway/services/role_service.pycreate_role and assign_role_to_user wrap their inserts in db.begin_nested(). An IntegrityError that resolves to a winning row is treated as success and the winner is returned; an IntegrityError that does not resolve to a winning row surfaces as ValueError per the documented contract.

  • Tests — 5 new unit tests pinning the savepoint refetch path and the non-race IntegrityErrorValueError surface, plus 1 new integration test (6 concurrent seeders against real Postgres → exactly 1 active row per (name, scope) and exactly 1 active platform_admin assignment).

Verified on

  • Local docker-compose (Postgres + transaction-pool PgBouncer fixture)
  • OCP minimal profile (chart default post-install,pre-upgrade hookPhase)
  • OCP PGO profile (pre-install,pre-upgrade hookPhase)

In all three: migration upgrade is clean, gateway pods reach Ready with 0 restarts, all three partial unique indexes are present, no duplicate active rows, no advisory locks held.

Negative control verified locally: removing the roles partial unique index makes the new integration test fail with three duplicate platform_viewer rows.

Notes

Adds tests/integration/test_migrations_under_transaction_pool.py — three
@pytest.mark.integration assertions pinning the PgBouncer transaction-pool
behavior that makes bootstrap_db.main() hang when gateway replicas start
concurrently:

 - a disconnected pgbouncer client's advisory lock persists on Postgres,
 - a fresh Postgres session is blocked from acquiring the orphaned lock,
 - a same-backend reuse via pgbouncer is reentrant (documents the gotcha
   so future readers don't mistake TRUE for "no bug").

Tests are gated behind --with-integration so CI default runs are
unaffected. The fixture stack (postgres + pgbouncer with pool_mode=
transaction and default_pool_size=2) lives at
tests/integration/fixtures/transaction_pool/docker-compose.yml.

Also adds tests/reproduction/issue-4051/ with demonstrate_orphan.sh
(deterministic mechanism proof) and reproduce.sh (replica-scaling symptom
reproduction). These exist for PR-review evidence and local debugging
during development; they are expected to be removed when the fix PR
lands, with the evidence preserved in the issue thread.

No runtime changes.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Multiple gateway replicas starting concurrently through a transaction-
pooling connection pooler (PgBouncer pool_mode=transaction, pgcat, etc.)
can hang indefinitely on bootstrap. A previous replica's pg_advisory_lock
is session-scoped on its Postgres server backend; when the pooler returns
that backend to its pool across a transaction boundary without clearing
the lock, the next replica's pg_try_advisory_lock sees it held by an
orphaned session and spins through its retry loop until the ~10-minute
timeout fires.

Add a fast-path in bootstrap_db.main() that probes alembic_version via
MigrationContext before attempting the advisory lock: when the DB is
already at the Alembic script directory's head, skip the lock entirely
and run only the idempotent post-migration bootstrap steps (team
visibility normalization, admin user, default roles, orphaned resource
assignment). Replicas 2..N of a multi-pod deployment take this branch on
every normal restart, so the advisory-lock orphan is never contended.

The lock is still acquired for the slow path (empty DB, version
mismatch, or any probe error), preserving the existing concurrent-
migration protection. Post-migration bootstrap was extracted into
_run_post_migration_bootstrap() so both paths share it.

Adds tests/integration/test_migrations_under_transaction_pool.py::
test_bootstrap_db_skips_lock_when_schema_already_at_head as a
regression gate: it synthesizes the held-lock precondition via a
distinct direct-to-postgres session and runs bootstrap through
pgbouncer. Pre-fix, the test times out after 240s; post-fix, it
completes in under a second.

Closes #4051

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Adds TestAlembicAtHead with five unit tests covering the bool outcomes
that decide whether main() skips the migration advisory lock:

 - at head → True (fast-path fires)
 - no alembic_version in DB → False (slow path handles empty/partial DBs)
 - head mismatch → False (migrations actually need to run)
 - script directory has zero heads → False (defensive)
 - any probe exception → False (fallback to slow path)

Direct coverage independent of the integration test; narrows the blast
radius of a future refactor to this one helper.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Adds test_bootstrap_db_is_idempotent_once_schema_is_at_head as an
additional invariant for the issue #4051 fast-path: after one successful
bootstrap, a subsequent call must NOT enter advisory_lock at all — it
should take the fast-path exclusively.

The test wraps bootstrap_db.advisory_lock with a spy and asserts zero
entries on the second invocation. This catches regressions where a
future refactor accidentally re-routes bootstrap through the lock even
when the schema is at head (the specific failure mode that would bring
issue #4051 back under transaction-pooling connection poolers).

Also asserts the second call completes under 3s and leaves no advisory
lock held.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
…otstrap

Adds a settings flag (`mcpgateway_skip_migrations`, default False) that
gates the gateway's lifespan call to bootstrap_db.main(). When set,
the gateway logs that it is yielding migration ownership to a separate
runner and skips the in-pod path entirely.

This is the operator-facing half of the issue #4051 fix. The L1
fast-path (already on this branch) makes in-pod bootstrap safe under
transaction-pool connection poolers; this flag lets a Helm chart or
init-container deployment turn the in-pod path off entirely once a
dedicated migration runner has populated the schema. Library default
stays False so direct `docker run mcpgateway` continues to bootstrap
itself with no operator action.

Wires the gate via a small lifespan helper (`_run_initial_db_bootstrap`)
to keep the lifespan body unchanged and the gate trivially testable.

Tests:
 - settings.skip_migrations defaults False
 - MCPGATEWAY_SKIP_MIGRATIONS=true → True
 - MCPGATEWAY_SKIP_MIGRATIONS=false → False
 - lifespan helper invokes bootstrap_db when flag off
 - lifespan helper skips bootstrap_db AND logs an audit line when on
Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Wires the gateway Deployment env var introduced in the prior commit
into the chart so the pairing "migration Job runs → app pods skip" is
enforced at the chart layer rather than relying on operators to set
both flags consistently. The env entry is placed after extraEnv so a
user-supplied override cannot accidentally undo the contract.

Behavior:
  migration.enabled=true   → MCPGATEWAY_SKIP_MIGRATIONS="true"
  migration.enabled=false  → MCPGATEWAY_SKIP_MIGRATIONS="false"
  default values           → migration.enabled=true → SKIP="true"

Adds tests/unit/charts/test_deployment_mcpgateway.py with three
helm-template assertions pinning each branch. Tests skip automatically
when helm is not on PATH; no cluster needed. Issue #4051 Layer 2.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Adds test_compose_three_replicas_complete_bootstrap_e2e — drives the
fixture compose stack at the container level, scales the gateway
service to 3 replicas through transaction-pool PgBouncer, and asserts
all three reach bootstrap completion within 60s. This is the closest
re-runnable analog of the manual OCP cluster smoke captured in
todo/issue-4051-ocp-reproduction.md, exercising the full pod-startup
path (gunicorn + lifespan + bootstrap_db.main()) inside real
containers rather than calling the library entry point in-process.

The other 5 tests in this module already cover the bug at the
mechanism (PgBouncer orphaning) and library (bootstrap_db) levels.
This e2e test catches regressions of the integration shape:
gunicorn worker fan-out, lifespan ordering, image-level config
loading, etc.

The test is destructive — it drops the fixture's public schema.
Skipped unless MCPGATEWAY_TEST_ALLOW_DESTRUCTIVE_E2E=1 is set,
matching the opt-in pattern already used elsewhere in tests/
conftest.py for external-DB safety. Also skips cleanly when docker
is missing, the gateway image isn't built, or the fixture postgres/
pgbouncer aren't running — each with an actionable message
explaining the missing prerequisite.

Verified locally: 14.78s post-fix, hangs to 240s pytest.mark.timeout
pre-fix.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Issue numbers are pointers, not content — they rot once the issue is
closed. The fix outlives the issue. Removes "issue #4051" / "Layer 2"
references from code comments and test docstrings, replacing them with
prose that names the BEHAVIOR being protected (advisory-lock orphaning
behind transaction-pool poolers, MCPGATEWAY_SKIP_MIGRATIONS contract,
etc).

Issue references are kept where they belong — commit messages and the
PR description — and removed where they don't (test docstrings, helper
comments, file-level docstrings, the test fixture's compose file).

Also renames the test fixture's docker compose project name from
"mcf4051" to the more descriptive "transaction-pool-fixture" since
that file has no surviving link to the originating issue.

No behavior changes. 493 tests pass; 1 (e2e) skips cleanly without
the destructive opt-in.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
CI's pre-commit run flagged two issues on this branch:

1. .secrets.baseline was out-of-date — the new mcpgateway_skip_migrations
   field in config.py shifted line numbers of pre-existing detected
   patterns by +42 lines, and detect-secrets refused to pass while the
   baseline still pointed at the old line numbers. Refreshed via
   `detect-secrets scan --update --use-all-plugins`. Four new entries
   from this PR's test fixtures (synthetic test passwords like
   `reprosecret`) are present in the baseline marked is_secret=False:
   they are clearly throwaway test values, not real credentials.

2. Black wanted minor reformatting in three of the test/source files
   we touch in this PR (mcpgateway/main.py, two test files). Applied.

No behavior changes. 488 unit tests still pass. The detect-secrets and
black pre-commit hooks now pass locally on the PR-relevant files.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Replaces the gateway Deployment's db_isready startup probe with a
schema-at-head check so pods cannot serve traffic against an empty or
mid-migration DB regardless of which entity ran the migration: the chart's
pre-install Job, post-install Job, an init container, an external CD
pipeline, or a manual operator step.

Adds mcpgateway.utils.check_schema_at_head as the probe command. It exits
0 only when ``alembic_version`` matches the script-directory head, and
exits 1 on any error so probe failures fail closed.

Refs #4051

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
…t bootstrap

After the L1 fast-path landed, ``bootstrap_default_roles`` and the
platform_admin user-role assignment run outside the migration advisory
lock on every replica that takes the fast path. Both code paths were
SELECT-then-INSERT with no DB-tier uniqueness on (roles.name, roles.scope)
or (user_roles.user_email, role_id, scope, scope_id) — and ``Role.id`` /
``UserRole.id`` are client-generated UUIDs, so the PK never saved us.
Concurrent fast-path workers could write duplicate active rows, after
which ``get_role_by_name`` / ``get_user_role_assignment`` raise
``MultipleResultsFound``.

Adds:

  * Partial unique indexes (``WHERE is_active = true``) on the relevant
    column tuples — split into two for ``user_roles`` because ``scope_id``
    is nullable and both Postgres and SQLite treat NULL as distinct in
    ordinary unique indexes.
  * Idempotent dedupe-then-constrain Alembic migration that soft-deletes
    duplicate active rows (oldest by created_at / granted_at wins) so the
    indexes can be added on databases that already raced.
  * Savepoint-and-refetch handling in ``RoleService.create_role`` and
    ``RoleService.assign_role_to_user``: insert is wrapped in
    ``db.begin_nested()``, an IntegrityError that resolves to a winning
    row is treated as success, and any other IntegrityError surfaces as
    ``ValueError``.
  * Unit tests pinning the savepoint refetch path and the non-race
    IntegrityError → ValueError surface.
  * Integration test: 6 concurrent seeders against real Postgres must
    converge on exactly one active row per (name, scope) and exactly one
    active platform_admin user-role assignment.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
@gandhipratik203 gandhipratik203 force-pushed the fix/4051-alembic-pgbouncer-transaction-pool branch from 46f1d84 to 913fa4e Compare April 28, 2026 07:45
@gandhipratik203 gandhipratik203 force-pushed the fix/4051-alembic-pgbouncer-transaction-pool branch 2 times, most recently from 06456e1 to 5384128 Compare May 8, 2026 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant