Skip to content

perf(auth): cache get_user_by_email at service level to reduce DB hot-path queries#4595

Merged
brian-hussey merged 11 commits intomainfrom
fix/issue-3061-cache-get-user-by-email
May 6, 2026
Merged

perf(auth): cache get_user_by_email at service level to reduce DB hot-path queries#4595
brian-hussey merged 11 commits intomainfrom
fix/issue-3061-cache-get-user-by-email

Conversation

@msureshkumar88
Copy link
Copy Markdown
Collaborator

Summary

EmailAuthService.get_user_by_email() is called on every service operation that reads or verifies a user (login, password change, SSO upsert, OAuth token introspection, profile fetch, admin bootstrap, and more — 12+ internal call sites). Each call issued a synchronous SELECT … WHERE email = ? with no caching, even though the existing AuthCache already had an allocated _user_cache dict and the Redis key mcpgw:auth:user:{email} wired into invalidate_user() but never populated.

  • Adds get_user() / set_user() to AuthCache using the same L1 (in-memory) → L2 (Redis) write-through pattern as get_user_role() / set_user_role()
  • Wires those methods into EmailAuthService.get_user_by_email() so every call checks cache first and populates on a DB hit
  • Adds missing cache invalidation to update_user(), activate_user(), and deactivate_user() — three mutation methods that previously committed user changes without busting the cache
  • Exposes user_cache_size in AuthCache.stats() (was always 0 because the cache was never written)
  • All cache operations wrapped in try/except; Redis outage falls through to DB transparently

No new config knobs. Existing auth_cache_user_ttl (default 60 s) and auth_cache_enabled govern this path.

Closes #3061
Sub-issue of #2692 (DB query hot-path optimisations)

Files Changed

File What changed
mcpgateway/cache/auth_cache.py Added get_user() and set_user() methods; added user_cache_size to stats()
mcpgateway/services/email_auth_service.py Added _user_obj_to_dict() / _user_dict_to_obj() helpers; updated get_user_by_email() to use cache; added _invalidate_user_auth_cache() calls in update_user(), activate_user(), deactivate_user()
tests/unit/mcpgateway/cache/test_auth_cache_user.py 7 new unit tests for cache hit, miss, expiry, invalidation, Redis write-through, disabled cache, and stats
tests/unit/mcpgateway/services/test_email_auth_get_user_cache.py 8 new unit tests for helper round-trips, cache hit skips DB, cache miss populates cache, error fallback to DB, and all three mutation invalidations

Invalidation Coverage After This PR

Mutation path Method Invalidation
Password reset reset_password() existing
Password change change_password() existing
User update (name, admin flag, etc.) update_user() added here
Activate user activate_user() added here
Deactivate user deactivate_user() added here
User deletion delete_user() existing
User creation create_user() not needed

@msureshkumar88 msureshkumar88 force-pushed the fix/issue-3061-cache-get-user-by-email branch from c49c699 to 8f1d424 Compare May 5, 2026 09:25
@msureshkumar88 msureshkumar88 added enhancement New feature or request python Python / backend development (FastAPI) performance Performance related items COULD P3: Nice-to-have features with minimal impact if left out; included if time permits labels May 5, 2026
Copy link
Copy Markdown
Collaborator

@Lang-Akshay Lang-Akshay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR - @msureshkumar88

Please address the following findings

  • The PR stores a full user object (including password_hash) in Redis and returns a SQLAlchemy-detached object from cache hits. Any method that mutates the returned object (e.g. authenticate_user, unlock_user_account) silently discards those changes because the ORM session does not track the detached instance. This enables two critical authentication-path bypasses.
# File Line Severity CWE Description
1 mcpgateway/services/email_auth_service.py 897 Critical CWE-307 Brute-force lockout bypass via detached ORM object. authenticate_user calls get_user_by_email → cache hit → _user_dict_to_obj returns a new, session-detached EmailUser. Then user.increment_failed_attempts() is called and self.db.commit() is issued — but the detached object is not tracked, so the DB counter is never updated. Every request within the 60 s TTL window sees failed_login_attempts = <stale> and failed attempts are silently discarded, enabling unlimited password guesses during the TTL.
2 mcpgateway/services/email_auth_service.py 1162–1175 Critical CWE-613 unlock_user_account mutations are silently dropped + missing invalidation. get_user_by_email on a cache-hit returns a detached object. Assignments user.failed_login_attempts = 0 and user.locked_until = None then self.db.commit() silently no-op. Account is never actually unlocked in DB. No _invalidate_user_auth_cache call either, so stale locked state persists in cache.
3 mcpgateway/services/email_auth_service.py 76–91 High CWE-312 password_hash serialised into Redis. _user_obj_to_dict includes the Argon2id hash in the JSON payload written to Redis. A Redis compromise (SSRF, misconfigured bind, key dump) directly exposes all cached hashes. The existing module docstring states "JWT payloads are NOT cached (security risk)" — the same principle applies here.
4 mcpgateway/services/email_auth_service.py 121 Medium CWE-20 is_active defaults to True on deserialization. d.get("is_active", True) means a corrupted/missing key produces an active user — fails open. Should be False to fail closed.
5 mcpgateway/services/email_auth_service.py 579, 596 Low Duplicate import auth_cache inside get_user_by_email. from mcpgateway.cache.auth_cache import auth_cache appears in two separate try blocks within the same method call.
6 mcpgateway/cache/auth_cache.py ~413 Low CWE-362 L1 cache dict read outside lock. self._user_cache.get(email) at the top of get_user is done without acquiring self._lock, inconsistent with the write path inside the method.

Redundant Code

# File Line(s) Type Description Suggestion
1 mcpgateway/services/email_auth_service.py 579, 596 Duplicated import from mcpgateway.cache.auth_cache import auth_cache appears twice in consecutive try blocks inside get_user_by_email Import once at the top of the method or hoist above both try blocks
2 mcpgateway/cache/auth_cache.py ~428, ~456 Repeated lazy import import orjson inside both get_user and set_user try-blocks Move to top of auth_cache.py alongside other third-party imports (or at class level)

@msureshkumar88
Copy link
Copy Markdown
Collaborator Author

Thanks for the thorough review @Lang-Akshay — all six findings are addressed in the latest commits. Here's a summary of what changed:

Critical / High

# Finding Resolution
1 CWE-307 — brute-force bypass via detached ORM object in authenticate_user Added _fetch_user_from_db() helper; authenticate_user now calls it directly so increment_failed_attempts / reset_failed_attempts mutations are tracked by the session and committed durably
2 CWE-613 — unlock_user_account mutations silently dropped + missing invalidation Same fix: _fetch_user_from_db() for the session-attached object; _invalidate_user_auth_cache() call added after commit
3 CWE-312 — password_hash serialised into Redis Removed from _user_obj_to_dict; _user_dict_to_obj falls back to "" for read-only paths. Mutation paths (authenticate_user, unlock_user_account) always fetch from DB, so they never need the hash from cache

Medium / Low

# Finding Resolution
4 CWE-20 — is_active defaults to True on deserialisation Changed to False (fail-closed)
5 Duplicate from mcpgateway.cache.auth_cache import auth_cache in get_user_by_email Hoisted above both try blocks; import orjson moved to module-level in auth_cache.py
6 CWE-362 — L1 dict read outside self._lock in get_user Wrapped self._user_cache.get(email) inside with self._lock

Tests updated / added

  • test_auth_cache_user.py — added Redis-miss branch test (line 438 _redis_miss_count); removed password_hash from fixture dict
  • test_email_auth_get_user_cache.py — added 5 regression tests: password_hash exclusion, is_active fail-closed, _fetch_user_from_db hits DB directly, unlock_user_account invalidation; removed password_hash from fixture dict
  • test_email_auth_basic.py — updated two unlock_user_account tests to patch _fetch_user_from_db instead of get_user_by_email

All 17 286 unit tests pass.

@msureshkumar88 msureshkumar88 force-pushed the fix/issue-3061-cache-get-user-by-email branch from 95c6b89 to b2c7cbd Compare May 5, 2026 15:11
@msureshkumar88 msureshkumar88 requested a review from Lang-Akshay May 5, 2026 15:11
@msureshkumar88 msureshkumar88 force-pushed the fix/issue-3061-cache-get-user-by-email branch from 13ee4bb to 11b125e Compare May 5, 2026 15:41
Copy link
Copy Markdown
Collaborator

@Lang-Akshay Lang-Akshay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security hardening

Scope includes cache integration for user lookups and mutation invalidation. The changes introduce two password-reset regressions tied to cached detached objects.

# File Line Severity CWE Description Fix
1 mcpgateway/services/email_auth_service.py reset_password_with_token High CWE-610 reset_password_with_token() uses get_user_by_email(), which can return a detached cached user; writes to user.password_hash may not persist, leaving the previous password valid. Fetch a session-attached user in reset_password_with_token() via _fetch_user_from_db() (or merge the object into the session) before mutating and committing.
2 mcpgateway/services/email_auth_service.py reuse check Medium CWE-521 Password reuse checks can be bypassed on cache hits because cached users store an empty password_hash, making reuse comparisons always pass. Perform reuse checks using a DB-attached

@msureshkumar88
Copy link
Copy Markdown
Collaborator Author

Thanks @Lang-Akshay — both findings from the second review are addressed in the latest commit.

Finding 1 (High · CWE-610) — detached object in reset_password_with_token

get_user_by_email at line 1123 could return a cached, session-detached EmailUser. All subsequent field assignments (password_hash, password_change_required, password_changed_at, failed_login_attempts, locked_until) were written to an object not tracked by self.db, so self.db.commit() silently discarded them — the old password remained valid.

Fixed by replacing await self.get_user_by_email(...) with self._fetch_user_from_db(...), which always returns a session-attached object. Same pattern as authenticate_user and unlock_user_account.

Finding 2 (Medium · CWE-521) — reuse check bypassed via empty cached hash

_user_obj_to_dict intentionally excludes password_hash from the cache payload (CWE-312 fix from the first review). _user_dict_to_obj therefore reconstructs the object with password_hash="". The reuse check verify_password_async(new_password, "") always returned False, so any password passed the reuse guard on cache hits.

Resolved by the same fix above — _fetch_user_from_db returns the real password_hash from the DB, so the reuse check operates on the actual credential material.

Tests updated

Four reset_password_with_token test cases in test_email_auth_basic.py were patching get_user_by_email (async). Updated to patch _fetch_user_from_db (sync, no AsyncMock) — all four pass.

@msureshkumar88 msureshkumar88 force-pushed the fix/issue-3061-cache-get-user-by-email branch from 68d7ba2 to 88670cc Compare May 6, 2026 10:55
@msureshkumar88 msureshkumar88 requested a review from Lang-Akshay May 6, 2026 11:02
Lang-Akshay
Lang-Akshay previously approved these changes May 6, 2026
Copy link
Copy Markdown
Collaborator

@Lang-Akshay Lang-Akshay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR implementing service-level caching for get_user_by_email(), reducing database load during the authentication hot path by storing User objects in AuthCache (both L1 in-memory and L2 Redis)

@brian-hussey brian-hussey self-assigned this May 6, 2026
Suresh Kumar Moharajan added 10 commits May 6, 2026 17:49
Add get_user()/set_user() to AuthCache (L1+L2) and wire them into
EmailAuthService.get_user_by_email() to avoid repeated SELECT queries
on every service call. Add missing cache invalidation to update_user(),
activate_user(), and deactivate_user(). Include unit tests for cache
hit, miss, fallback, and invalidation paths.

Closes #3061

Signed-off-by: Suresh Kumar Moharajan <suresh.kumar.m@ibm.com>
Cover previously untested error paths:
- AuthCache.get_user(): Redis exception handler (line 438)
- AuthCache.set_user(): Redis path with orjson import and setex (lines 463/466) and exception handler (line 468)
- _user_dict_to_obj(): _dt() passthrough for datetime objects (line 110)
- get_user_by_email(): set_user exception suppression (line 598)

Signed-off-by: Suresh Kumar Moharajan <suresh.kumar.m@ibm.com>
Signed-off-by: Suresh Kumar Moharajan <suresh.kumar.m@ibm.com>
Log debug messages on cache read/write failures instead of silently
passing, making Redis fallback behaviour observable.

Closes #3061

Signed-off-by: Suresh Kumar Moharajan <suresh.kumar.m@ibm.com>
- Remove password_hash from cache dict (CWE-312): _user_obj_to_dict no
  longer serialises the Argon2id hash into Redis; _user_dict_to_obj falls
  back to empty string for read-only paths
- is_active defaults to False on deserialisation (CWE-20): fail-closed
  instead of fail-open for missing/corrupted cache key
- Add _fetch_user_from_db() helper for mutation paths; wire
  authenticate_user and unlock_user_account to use it instead of the
  cached detached object, preventing silent ORM mutation discards
  (CWE-307 brute-force bypass, CWE-613 unlock bypass)
- Add _invalidate_user_auth_cache() call to unlock_user_account after
  DB commit so stale locked state is evicted immediately
- Hoist duplicate auth_cache import above both try blocks in
  get_user_by_email; move orjson module-level import in auth_cache.py
- Acquire self._lock before L1 dict read in get_user() to be consistent
  with the write path (CWE-362)

Tests: 5 new regression tests covering each security finding; updated
test_email_auth_basic.py to patch _fetch_user_from_db for unlock tests.
All 17286 unit tests pass.

Signed-off-by: Suresh Kumar Moharajan <suresh.kumar.m@ibm.com>
Add test hitting the path where Redis is available but returns None,
incrementing _redis_miss_count. Previously untested, causing 97.4%
coverage on auth_cache.py.

Signed-off-by: Suresh Kumar Moharajan <suresh.kumar.m@ibm.com>
Signed-off-by: Suresh Kumar Moharajan <suresh.kumar.m@ibm.com>
Drop CWE-307 and CWE-613 tags from inline comments in
email_auth_service.py. Explanatory text retained intact.

Signed-off-by: Suresh Kumar Moharajan <suresh.kumar.m@ibm.com>
orjson already imported at module level (line 41). Local re-imports
at lines 309, 381, 817, 856 triggered W0621/W0404 pylint warnings.

Signed-off-by: Suresh Kumar Moharajan <suresh.kumar.m@ibm.com>
get_user_by_email can return a detached cached EmailUser with
an empty password_hash. Two regressions followed:

- CWE-610: mutations (password_hash, failed_login_attempts, etc.)
  written to the detached object were silently dropped on commit,
  leaving the old password valid after a reset.
- CWE-521: the password-reuse check compared against "" (cached
  hash placeholder), so reuse was never detected on cache hits.

_fetch_user_from_db always returns a session-attached object with
the real hash, matching the pattern already used in authenticate_user
and unlock_user_account.

Tests updated: mock _fetch_user_from_db (sync) instead of the
async get_user_by_email across all four reset_password_with_token
test cases.

Signed-off-by: Suresh Kumar Moharajan <suresh.kumar.m@ibm.com>
Signed-off-by: Suresh Kumar Moharajan <suresh.kumar.m@ibm.com>
@brian-hussey brian-hussey force-pushed the fix/issue-3061-cache-get-user-by-email branch from 88670cc to e46c2e1 Compare May 6, 2026 16:50
@brian-hussey
Copy link
Copy Markdown
Member

All tests had passed prior to rebase.
Only conflict was .secrets.baseline - I resolved that and force pushed.

Previous state prior to force push:
image

Merging without waiting for checks to go through queue again.

@brian-hussey brian-hussey merged commit 4079dcc into main May 6, 2026
28 checks passed
@brian-hussey brian-hussey deleted the fix/issue-3061-cache-get-user-by-email branch May 6, 2026 16:52
jonpspri added a commit that referenced this pull request May 6, 2026
…ate 100%

Two CI failures triggered by main moving forward during this PR's review:

* alembic multi-head: PR #4501 ("add ondelete=CASCADE to FK constraints")
  introduced 9fb98535724d as a sibling of our w7x8y9z0a1b2, both parented
  to 926d3e07d098. Re-parents w7x8y9z0a1b2 -> 9fb98535724d so the chain
  has a single head again and bootstrap_db / SQLite+Postgres upgrade
  validation pass. Verified single head with alembic heads.

* interrogate --fail-under=100: three nested helper functions had no
  docstrings, dropping the package to 99.9%. One is mine
  (GrpcEndpoint.invoke._call introduced by the gRPC timeout fix in
  PR #3202 review B6); the other two (_user_obj_to_dict._iso,
  _user_dict_to_obj._dt in services/email_auth_service.py) landed
  on main via PR #4595 and were already failing main's CI. All three
  added one-line docstrings.

Cascade fixed by these two changes:
  - SQLite + PostgreSQL Fresh/Upgrade (alembic head check)
  - pytest (py3.12) (bootstrap_db on alembic head check)
  - playwright-ci-smoke (depends on bootstrap)
  - sql-sanitizer E2E (depends on bootstrap)
  - interrogate (mcpgateway) (docstring coverage)
  - Run pre-commit hooks (interrogate is a pre-commit hook)

The 2 email_auth_service docstrings are technically out of scope for
this PR but the local lint cycle scoping I had been using
(make interrogate TARGET=<my-files>) didn't catch them. Folding
them in here unblocks PR-level CI rather than waiting for a separate
fix to main.

Verified locally:
  alembic heads      -> single head w7x8y9z0a1b2
  interrogate        -> 100% on full mcpgateway/
  black --check      -> clean
  ruff check         -> clean
  pylint             -> 10.00/10
  targeted gRPC tests -> 541 passing

#2854
Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>
jonpspri added a commit that referenced this pull request May 7, 2026
…ate 100%

Two CI failures triggered by main moving forward during this PR's review:

* alembic multi-head: PR #4501 ("add ondelete=CASCADE to FK constraints")
  introduced 9fb98535724d as a sibling of our w7x8y9z0a1b2, both parented
  to 926d3e07d098. Re-parents w7x8y9z0a1b2 -> 9fb98535724d so the chain
  has a single head again and bootstrap_db / SQLite+Postgres upgrade
  validation pass. Verified single head with alembic heads.

* interrogate --fail-under=100: three nested helper functions had no
  docstrings, dropping the package to 99.9%. One is mine
  (GrpcEndpoint.invoke._call introduced by the gRPC timeout fix in
  PR #3202 review B6); the other two (_user_obj_to_dict._iso,
  _user_dict_to_obj._dt in services/email_auth_service.py) landed
  on main via PR #4595 and were already failing main's CI. All three
  added one-line docstrings.

Cascade fixed by these two changes:
  - SQLite + PostgreSQL Fresh/Upgrade (alembic head check)
  - pytest (py3.12) (bootstrap_db on alembic head check)
  - playwright-ci-smoke (depends on bootstrap)
  - sql-sanitizer E2E (depends on bootstrap)
  - interrogate (mcpgateway) (docstring coverage)
  - Run pre-commit hooks (interrogate is a pre-commit hook)

The 2 email_auth_service docstrings are technically out of scope for
this PR but the local lint cycle scoping I had been using
(make interrogate TARGET=<my-files>) didn't catch them. Folding
them in here unblocks PR-level CI rather than waiting for a separate
fix to main.

Verified locally:
  alembic heads      -> single head w7x8y9z0a1b2
  interrogate        -> 100% on full mcpgateway/
  black --check      -> clean
  ruff check         -> clean
  pylint             -> 10.00/10
  targeted gRPC tests -> 541 passing

#2854
Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>
jonpspri added a commit that referenced this pull request May 7, 2026
…ate 100%

Two CI failures triggered by main moving forward during this PR's review:

* alembic multi-head: PR #4501 ("add ondelete=CASCADE to FK constraints")
  introduced 9fb98535724d as a sibling of our w7x8y9z0a1b2, both parented
  to 926d3e07d098. Re-parents w7x8y9z0a1b2 -> 9fb98535724d so the chain
  has a single head again and bootstrap_db / SQLite+Postgres upgrade
  validation pass. Verified single head with alembic heads.

* interrogate --fail-under=100: three nested helper functions had no
  docstrings, dropping the package to 99.9%. One is mine
  (GrpcEndpoint.invoke._call introduced by the gRPC timeout fix in
  PR #3202 review B6); the other two (_user_obj_to_dict._iso,
  _user_dict_to_obj._dt in services/email_auth_service.py) landed
  on main via PR #4595 and were already failing main's CI. All three
  added one-line docstrings.

Cascade fixed by these two changes:
  - SQLite + PostgreSQL Fresh/Upgrade (alembic head check)
  - pytest (py3.12) (bootstrap_db on alembic head check)
  - playwright-ci-smoke (depends on bootstrap)
  - sql-sanitizer E2E (depends on bootstrap)
  - interrogate (mcpgateway) (docstring coverage)
  - Run pre-commit hooks (interrogate is a pre-commit hook)

The 2 email_auth_service docstrings are technically out of scope for
this PR but the local lint cycle scoping I had been using
(make interrogate TARGET=<my-files>) didn't catch them. Folding
them in here unblocks PR-level CI rather than waiting for a separate
fix to main.

Verified locally:
  alembic heads      -> single head w7x8y9z0a1b2
  interrogate        -> 100% on full mcpgateway/
  black --check      -> clean
  ruff check         -> clean
  pylint             -> 10.00/10
  targeted gRPC tests -> 541 passing

#2854
Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>
jonpspri added a commit that referenced this pull request May 7, 2026
…ate 100%

Two CI failures triggered by main moving forward during this PR's review:

* alembic multi-head: PR #4501 ("add ondelete=CASCADE to FK constraints")
  introduced 9fb98535724d as a sibling of our w7x8y9z0a1b2, both parented
  to 926d3e07d098. Re-parents w7x8y9z0a1b2 -> 9fb98535724d so the chain
  has a single head again and bootstrap_db / SQLite+Postgres upgrade
  validation pass. Verified single head with alembic heads.

* interrogate --fail-under=100: three nested helper functions had no
  docstrings, dropping the package to 99.9%. One is mine
  (GrpcEndpoint.invoke._call introduced by the gRPC timeout fix in
  PR #3202 review B6); the other two (_user_obj_to_dict._iso,
  _user_dict_to_obj._dt in services/email_auth_service.py) landed
  on main via PR #4595 and were already failing main's CI. All three
  added one-line docstrings.

Cascade fixed by these two changes:
  - SQLite + PostgreSQL Fresh/Upgrade (alembic head check)
  - pytest (py3.12) (bootstrap_db on alembic head check)
  - playwright-ci-smoke (depends on bootstrap)
  - sql-sanitizer E2E (depends on bootstrap)
  - interrogate (mcpgateway) (docstring coverage)
  - Run pre-commit hooks (interrogate is a pre-commit hook)

The 2 email_auth_service docstrings are technically out of scope for
this PR but the local lint cycle scoping I had been using
(make interrogate TARGET=<my-files>) didn't catch them. Folding
them in here unblocks PR-level CI rather than waiting for a separate
fix to main.

Verified locally:
  alembic heads      -> single head w7x8y9z0a1b2
  interrogate        -> 100% on full mcpgateway/
  black --check      -> clean
  ruff check         -> clean
  pylint             -> 10.00/10
  targeted gRPC tests -> 541 passing

#2854
Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>
jonpspri added a commit that referenced this pull request May 7, 2026
* feat: Add grpc_service_id to Tool schema

#2854
Branch: GrpcMethodsAsTools-2854

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

AI-usage: full

* feat: Add tool registration and deletion for registering/deleting gRPC services

#2854
Branch: GrpcMethodsAsTools-2854

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

AI-usage: full

* feat: Add load_file_descriptors for grpc serialized proto descriptor bytes

#2854
Branch: GrpcMethodsAsTools-2854

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

AI-usage: full

* feat: Add gRPC tool invocation support

#2854
Branch: GrpcMethodsAsTools-2854

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

AI-usage: full

* test: Unit tests for gRPC tool registration

#2854
Branch: GrpcMethodsAsTools-2854

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

AI-usage: full

* style: make lint

#2854
Branch: GrpcMethodsAsTools-2854

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* test: Fix missing grpc_service_id in tool unit test

#2854
Branch: GrpcMethodsAsTools-2854

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

AI-usage: full

* test: Flesh out test coverage for new gRPC code paths

#2854
Branch: GrpcMethodsAsTools-2854

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

AI-usage: full

* fix: Better error handling for protobuf descriptor pool adding

#2854
Branch: GrpcMethodsAsTools-2854

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Inherit visibility from parent gRPC service when mapping to tools

#2854
Branch: GrpcMethodsAsTools-2854
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix alembic migration after rebase

#2854
Branch: GrpcMethodsAsTools-2854
AI-usage: full
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Repoint alembic migration parent to current head after rebase

Updates down_revision from x7h8i9j0k1l2 to aa1b2c3d4e5f, the current
head on origin/main. Several new migrations have landed since this PR
was last rebased, so the previous parent is no longer the head.

Verified single head: alembic heads -> w7x8y9z0a1b2 (head)

#2854
Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

* test: Add grpc_service_id=None to TestRustMcpExecutionPlan SimpleNamespace mocks

Two tests added to TestRustMcpExecutionPlan after this PR was last
rebased construct tool mocks via SimpleNamespace. The PR's change to
_build_tool_cache_payload reads tool.grpc_service_id, so those mocks
now AttributeError. Mirrors b576ec4's earlier fix for the same
pattern in MagicMock(spec=DbTool) helpers.

Affected:
- TestRustMcpExecutionPlan.test_prepare_rust_mcp_tool_execution_uses_live_gateway_auth_fields_for_loaded_tools
- TestRustMcpExecutionPlan.test_prepare_rust_mcp_tool_execution_uses_live_gateway_string_auth_values

#2854
Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

* style: make black isort

Tightens whitespace and import grouping per project lint config:
- mcpgateway/services/grpc_service.py: black removes a stray blank line
- tests/unit/mcpgateway/{services/test_grpc_service,services/test_tool_service,test_translate_grpc}.py: isort regroups stdlib/third-party/first-party/local imports

#2854
Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(grpc): use protobuf 6.x kwarg in MessageToDict and isolate descriptor pool

Two issues addressed in mcpgateway/translate_grpc.py:

1. Pre-existing runtime bug: invoke()/invoke_streaming() called
   json_format.MessageToDict(..., including_default_value_fields=True),
   the parameter name from protobuf 4.x. The repo pins protobuf>=6.33.6
   (see pyproject.toml), where the parameter was renamed to
   always_print_fields_with_no_presence. Calls raised TypeError at
   runtime, breaking gRPC tool invocation. Renamed the kwarg and
   replaced the # pylint: disable=unexpected-keyword-arg suppression
   with an anti-regression comment.

2. Descriptor pool poisoning: GrpcEndpoint shared
   descriptor_pool.Default(), the process-wide singleton. Reflected
   FileDescriptorProto from one (untrusted) upstream service could
   leak across requests and cause symbol collisions or type confusion
   in subsequent calls. Switched to a per-endpoint
   descriptor_pool.DescriptorPool(); MessageFactory now binds to that
   private pool.

Updated tests/unit/mcpgateway/test_translate_grpc.py monkeypatches to
expose DescriptorPool() instead of Default() and MessageFactory(pool=...)
instead of MessageFactory().

Closes review: B2, B11

#2854
Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(grpc): security and robustness hardening from review

Addresses BLOCKING findings from a multi-agent code review of PR #3202:

Security:
* Reject local-only gRPC schemes (unix:, unix-abstract:, vsock:, fd:)
  in _validate_grpc_target since they bypass the IP-based SSRF model.
  Also normalize dns:/// and ipv4:/ipv6: name-resolver prefixes and
  parse bracketed IPv6 hosts correctly so the host-extraction step
  cannot be tricked. (review B1)
* Add _enforce_descriptor_limits to bound reflected
  FileDescriptorProto count, per-blob size, and aggregate size before
  storing or loading. Hardcoded constants (not settings) so a config
  change cannot silently weaken the DoS defense. (review B3)
* Add _validate_reflected_tool_name that runs reflected tool names
  through SecurityValidator.validate_tool_name() before persisting.
  Reflected names from upstream gRPC servers no longer bypass the
  injection / length / character checks applied to user-registered
  tools. (review B4)
* Strip metadata pseudo-keys (anything starting with '_') when
  copying discovered_services into endpoint._services so the
  _file_descriptors entry can never be confused with a service.
* Decode stored descriptors with base64.b64decode(..., validate=True)
  to fail fast on tampered DB content.

Robustness / Layer 1 invariants:
* _sync_tools_from_reflection now runs each method through a
  per-tool try/except so a single bad method cannot abort the
  entire reflection sync. Counts created / updated / failed.
  (review B9)
* _sync_tools_from_reflection now propagates parent service
  visibility, team_id, and owner_email to existing reflected tools,
  not just to newly created ones. Closes a Layer 1 token-scoping gap
  where a public->team visibility change on the parent service was
  silently ignored on already-discovered tools. (review B5)

Operational:
* invoke_method gains a timeout parameter (default settings.tool_timeout)
  and wraps endpoint.start() / endpoint.invoke() in asyncio.wait_for so
  a slow upstream cannot tie up a worker indefinitely. (review B6)
* tool_service.invoke_tool gRPC branch wraps the call in
  asyncio.wait_for, surfaces ToolTimeoutError + post-invoke timeout
  hook (matches A2A branch), preserves CancelledError instead of
  swallowing it, and logs with %-style + exc_info=True. (review B6/B7/B8)
* invoke_method preserves CancelledError, logs with %-style +
  exc_info=True, re-raises GrpcServiceError/GrpcServiceNotFoundError
  unwrapped instead of double-wrapping them. (review B8)

#2854
Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

* refactor(grpc): drop redundant inline settings import

The inline 'from mcpgateway.config import settings' in
_validate_grpc_target was made redundant by hoisting the same import
to module top in the security hardening commit.

Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

* test(grpc): cover security-hardening helpers and Layer 1 invariants

Adds two test classes exercising the helpers introduced by the
security/robustness hardening commit:

TestSecurityHardening:
* _validate_grpc_target rejects unix:/unix-abstract:/vsock:/fd: schemes
* _validate_grpc_target normalizes dns:///, dns://, dns:, ipv4:, ipv6:
  name-resolver prefixes
* _validate_grpc_target accepts bracketed IPv6 literals and rejects
  malformed bracket syntax
* _enforce_descriptor_limits enforces count, per-blob, and aggregate
  size caps; happy path passes within bounds
* _validate_reflected_tool_name rejects empty / over-length / injection
  patterns and accepts proto-style identifiers

TestVisibilityPropagation:
* _sync_tools_from_reflection propagates parent service visibility,
  team_id, and owner_email to existing reflected tools (Layer 1
  invariant).
* _sync_tools_from_reflection isolates per-method failures: a method
  with an invalid name is skipped while the rest of the sync proceeds
  (B9 regression test).

All 13 new tests pass; the broader gRPC + tool_service + translate_grpc
suite is green at 520 tests.

#2854
Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(grpc): close remaining review gaps from iteration-2 oracle pass

Addresses BLOCKING items the second-pass review identified as still
present after the first round of fixes.

* B1 (DNS resolution): _validate_grpc_target now delegates the
  hostname/IP-network/DNS policy to SecurityValidator._validate_ssrf so
  hostnames like metadata.google.internal are resolved and the resolved
  IPs go through the same blocklist as HTTP targets. Removes the
  duplicate hand-rolled IP-network logic that previously lived inline.

* B5 (visibility on update): GrpcService.update_service now snapshots
  visibility / team_id / owner_email before applying the update and
  bulk-updates every child Tool with grpc_service_id == service.id in
  the same transaction whenever any of those fields change. Closes the
  Layer 1 token-scoping gap where a public->team change on the parent
  service was silently ignored on already-discovered tools.

* B6 (real gRPC deadlines): GrpcEndpoint.start, _discover_services,
  and _discover_service_details now accept a timeout kwarg that is
  threaded into ServerReflectionInfo, and GrpcEndpoint.invoke binds
  the deadline onto the underlying unary_unary call so a slow upstream
  cannot keep an executor thread alive after asyncio.wait_for cancels
  the wrapping coroutine. invoke_method passes the effective tool
  timeout into both endpoint.start and endpoint.invoke.

* B7 (cancellation propagation in outer except BaseException): adds
  except asyncio.CancelledError: raise immediately before the three
  outer except BaseException as e blocks in tool_service.invoke_tool
  so a cancellation is never wrapped as a ToolInvocationError.

* B12 (protobuf 6.x compatibility): MessageFactory.GetPrototype was
  removed in protobuf >= 5.x. Both the unary and streaming invocation
  paths in translate_grpc now call message_factory.GetMessageClass(...)
  instead. Updated the corresponding unit-test mocks.

* Cleanup: drops unused import ipaddress that became dead after
  the SecurityValidator delegation.

Test mocks updated to match the new signatures (timeout kwarg on the
populate_service / _populate callbacks; GetMessageClass at the module
level instead of MessageFactory.GetPrototype).

Lint clean (black, isort, ruff E3/E4/E7/E9/F/D1). 520 targeted tests
pass.

#2854
Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(grpc): preserve timeout=0 fail-fast semantics in deadline ternaries

The 'if timeout' ternary in three reflection/unary call sites silently
drops timeout=0, but gRPC accepts timeout=0 as the explicit 'fail
immediately' deadline. Switched to 'if timeout is not None' so a
caller-supplied 0.0 reaches the underlying gRPC call with the right
semantics.

Found in iteration-3 oracle review (final pass, no new BLOCKING items).

Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(grpc): apply non-blocking review nits flagged across the three review passes

Bundle of low-risk, non-blocking improvements that surfaced during the
deep review iterations. Each item below was explicitly classified as
fix-now in the review summary; broader / design-dependent items were
filed as follow-ups under #4612 and #4613.

* Schema parity (#1): add grpc_service_id to ToolRead in schemas.py
  alongside the existing gateway_id field so API consumers can identify
  gRPC-discovered tools (review S2.4 / TD1).

* db.py cleanup (#2): drop redundant nullable=True from
  Tool.grpc_service_id; Mapped[Optional[str]] already implies it and
  the sibling gateway_id column omits it (review TD2).

* translate_grpc.load_file_descriptors (#3): widen the parameter type
  from List[bytes] to Sequence[bytes] and reject a single bytes object
  passed by mistake. Without the guard Python would silently iterate
  byte-by-byte (review TD3).

* translate_grpc descriptor pool conflict (#4): replace the inaccurate
  'no-op if already added' comment with an explicit TypeError handler.
  protobuf raises TypeError when a file with the same name has
  conflicting content; the existing descriptor stays authoritative
  (review S3.3).

* test_sync_tools_removes_stale_tools (#5): replace the brittle
  string-match assertion ('DELETE' in str(call)) with an explicit
  call_count == 4 (1 select + 3 deletes), which no longer depends on
  the SQLAlchemy Delete object repr (review S2.6).

* translate_grpc close() (#6): convert the f-string log to lazy
  %-style for consistency with the rest of this PR's logging
  (review N2.1).

* grpc_service _sync_tools_from_reflection (#7): document why
  input_schema['properties'] is empty by design — gRPC arg shape is
  validated at the protobuf invocation layer, not the MCP tool-call
  layer; the actual proto types live in the x-grpc-* extensions
  (review N2.2).

* TestInvokeMethodGuards (#8): add 5 edge-case tests covering
  invoke_method paths the previous suite did not exercise:
  - service-not-found  -> GrpcServiceNotFoundError
  - disabled service   -> GrpcServiceError('is disabled')
  - invalid method format (no dot) -> GrpcServiceError
  - _validate_grpc_target spy called with service.target
  - _validate_tls_path spy called for both cert and key paths

525 targeted tests pass (was 520). Lint clean.

Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

* style: clean lint cycle (make autoflake/isort/black/ruff/bandit/interrogate/pylint)

Remaining fixes the project lint cycle surfaced after the fix-now bundle landed:

* isort: re-sort sibling imports in test_grpc_service.py
  (_enforce_descriptor_limits before _GRPC_MAX_*; case-insensitive order).

* ruff PLW0108 (unnecessary lambda) in test_translate_grpc.py: align the
  4 mocks I added with the 7 pre-existing patterns in the same file
  (lambda **_kw: MagicMock() / lambda *_a, **_kw: MagicMock()).

* ruff RET501 in test_grpc_service.py::tls_spy: drop the redundant
  trailing 'return None'.

* pylint no-member on the alembic migration: add the project-standard
  '# pylint: disable=no-member' header that 9 other migrations use to
  silence alembic.op's dynamic-attribute false positives.

* pylint try-except-raise on tool_service.invoke_tool gRPC branch:
  the 'except asyncio.CancelledError: raise' clause LOOKS like a no-op
  but exists specifically so the LATER 'except Exception' cannot swallow
  cancellation (PR #3202 review B7). Added an inline disable plus a
  comment citing B7 so the next lint sweep doesn't delete the
  anti-regression code.

Verified after fix:
  make autoflake / isort / black -> clean
  make ruff (E3,E4,E7,E9,F,D1)  -> All checks passed
  make bandit                   -> 0 issues, 18022 LoC
  make interrogate              -> 100% docstring coverage
  make pylint                   -> 10.00/10 on all 6 production files
  pytest (targeted gRPC suite)  -> 525/525 passing

Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix: Re-parent alembic migration to current head after second rebase

Updates down_revision from aa1b2c3d4e5f to 926d3e07d098. Origin/main
gained the CPEX plugin framework replacement (#3754) which introduced
two new migrations and a mergepoint, leaving our grpc_service_id
migration as a sibling head. Re-pointing to the new head returns the
chain to single-headed.

Verified single head:
  alembic heads -> w7x8y9z0a1b2 (head)
  alembic upgrade head -> clean on a fresh sqlite DB

#2854
Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(grpc): restore reserved/multicast guard and update test_grpc_service_no_grpc

Two related fixes for the no-grpc test file that surfaced after rebase:

1. Restore is_reserved / is_multicast pre-check in _validate_grpc_target.
   The earlier delegation to SecurityValidator._validate_ssrf inadvertently
   dropped this guard because the shared validator only checks
   blocked-networks / localhost / private. IP literals like 224.0.0.1
   (multicast) or 0.0.0.0 (reserved) would have slipped past unless they
   were also in ssrf_blocked_networks. The restored check excludes
   loopback so ::1 (which Python flags as both is_loopback AND
   is_reserved) still passes through the localhost gate as intended.

2. Update test_grpc_service_no_grpc.py:
   - Add timeout=None kwarg to the three FakeEndpoint.start() and
     invoke() methods so they accept the per-RPC deadline added in the
     security/robustness hardening commit.
   - Update test_validate_grpc_target_enforces_ssrf_rules regex
     patterns to match the SecurityValidator messages reached via
     delegation (blocked hostname X, localhost address which is
     blocked, private network address which is blocked).

All 4 gRPC test files pass: 528 / 528.

#2854
Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

* test: increase diff coverage for gRPC PR from 92% to 100% on grpc_service+translate_grpc

Adds 11 targeted tests + a decision-record comment to address the diff-coverage
report that flagged 18 previously-uncovered lines:

  grpc_service.py:    95.7% -> 100%  (7/7 covered)
  translate_grpc.py:  91.7% -> 100%  (3/3 covered)
  tool_service.py:    69.2% -> 62%   (5/8 covered + 3 documented)

New tests in tests/unit/mcpgateway/services/test_grpc_service.py:

* test_validate_grpc_target_empty_string                        -> line 134
* test_invoke_method_propagates_cancelled_error                 -> line 1021
* test_invoke_method_re_raises_timeout                          -> lines 1023-1024
* test_invoke_method_re_raises_grpc_service_error_unwrapped     -> line 1026
* test_update_service_propagates_visibility_to_child_tools      -> lines 503-504

New tests in tests/unit/mcpgateway/test_translate_grpc.py:

* test_load_file_descriptors_rejects_single_bytes               -> line 409
* test_load_file_descriptors_skips_pool_conflict                -> lines 420, 425

New tests in tests/unit/mcpgateway/services/test_tool_service.py:

* test_invoke_grpc_tool_propagates_cancellation                 -> line 5847
* test_invoke_grpc_tool_timeout_raises_tool_timeout_error       -> lines 5849-5850, 5852

Restored a regression: the earlier B1 refactor (delegating to
SecurityValidator._validate_ssrf) inadvertently dropped the is_reserved /
is_multicast guard, since SecurityValidator only checks blocked-networks /
localhost / private. Restored the local guard with an is_loopback exclusion
so IPv6 ::1 (which Python flags as both is_loopback AND is_reserved)
still passes through the localhost gate.

Coverage decision-record left in test_tool_service.py for the 3 remaining
B7 anti-regression lines (5446 REST, 5632 MCP, 5851 gRPC timeout-with-pm):
they are byte-identical to the gRPC variant at line 5847 (which is fully
exercised). Equivalent REST/MCP tests require coaxing CancelledError
through asyncio.wait_for, which Python's event loop converts to TimeoutError
in some states. Protecting one branch is sufficient to detect a regression
that would affect all three.

541 tests passing, ruff/black/isort/pylint clean.

#2854
Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix: re-parent alembic migration onto current head + restore interrogate 100%

Two CI failures triggered by main moving forward during this PR's review:

* alembic multi-head: PR #4501 ("add ondelete=CASCADE to FK constraints")
  introduced 9fb98535724d as a sibling of our w7x8y9z0a1b2, both parented
  to 926d3e07d098. Re-parents w7x8y9z0a1b2 -> 9fb98535724d so the chain
  has a single head again and bootstrap_db / SQLite+Postgres upgrade
  validation pass. Verified single head with alembic heads.

* interrogate --fail-under=100: three nested helper functions had no
  docstrings, dropping the package to 99.9%. One is mine
  (GrpcEndpoint.invoke._call introduced by the gRPC timeout fix in
  PR #3202 review B6); the other two (_user_obj_to_dict._iso,
  _user_dict_to_obj._dt in services/email_auth_service.py) landed
  on main via PR #4595 and were already failing main's CI. All three
  added one-line docstrings.

Cascade fixed by these two changes:
  - SQLite + PostgreSQL Fresh/Upgrade (alembic head check)
  - pytest (py3.12) (bootstrap_db on alembic head check)
  - playwright-ci-smoke (depends on bootstrap)
  - sql-sanitizer E2E (depends on bootstrap)
  - interrogate (mcpgateway) (docstring coverage)
  - Run pre-commit hooks (interrogate is a pre-commit hook)

The 2 email_auth_service docstrings are technically out of scope for
this PR but the local lint cycle scoping I had been using
(make interrogate TARGET=<my-files>) didn't catch them. Folding
them in here unblocks PR-level CI rather than waiting for a separate
fix to main.

Verified locally:
  alembic heads      -> single head w7x8y9z0a1b2
  interrogate        -> 100% on full mcpgateway/
  black --check      -> clean
  ruff check         -> clean
  pylint             -> 10.00/10
  targeted gRPC tests -> 541 passing

#2854
Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Signed-off-by: Jonathan Springer <jps@s390x.com>
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Jonathan Springer <jps@s390x.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

COULD P3: Nice-to-have features with minimal impact if left out; included if time permits enhancement New feature or request performance Performance related items python Python / backend development (FastAPI)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE][PERFORMANCE]: Cache get_user_by_email at service level

3 participants