perf(auth): cache get_user_by_email at service level to reduce DB hot-path queries#4595
Conversation
c49c699 to
8f1d424
Compare
Lang-Akshay
left a comment
There was a problem hiding this comment.
Thanks for the PR - @msureshkumar88
Please address the following findings
- The PR stores a full user object (including
password_hash) in Redis and returns a SQLAlchemy-detached object from cache hits. Any method that mutates the returned object (e.g.authenticate_user,unlock_user_account) silently discards those changes because the ORM session does not track the detached instance. This enables two critical authentication-path bypasses.
| # | File | Line | Severity | CWE | Description |
|---|---|---|---|---|---|
| 1 | mcpgateway/services/email_auth_service.py |
897 | Critical | CWE-307 | Brute-force lockout bypass via detached ORM object. authenticate_user calls get_user_by_email → cache hit → _user_dict_to_obj returns a new, session-detached EmailUser. Then user.increment_failed_attempts() is called and self.db.commit() is issued — but the detached object is not tracked, so the DB counter is never updated. Every request within the 60 s TTL window sees failed_login_attempts = <stale> and failed attempts are silently discarded, enabling unlimited password guesses during the TTL. |
| 2 | mcpgateway/services/email_auth_service.py |
1162–1175 | Critical | CWE-613 | unlock_user_account mutations are silently dropped + missing invalidation. get_user_by_email on a cache-hit returns a detached object. Assignments user.failed_login_attempts = 0 and user.locked_until = None then self.db.commit() silently no-op. Account is never actually unlocked in DB. No _invalidate_user_auth_cache call either, so stale locked state persists in cache. |
| 3 | mcpgateway/services/email_auth_service.py |
76–91 | High | CWE-312 | password_hash serialised into Redis. _user_obj_to_dict includes the Argon2id hash in the JSON payload written to Redis. A Redis compromise (SSRF, misconfigured bind, key dump) directly exposes all cached hashes. The existing module docstring states "JWT payloads are NOT cached (security risk)" — the same principle applies here. |
| 4 | mcpgateway/services/email_auth_service.py |
121 | Medium | CWE-20 | is_active defaults to True on deserialization. d.get("is_active", True) means a corrupted/missing key produces an active user — fails open. Should be False to fail closed. |
| 5 | mcpgateway/services/email_auth_service.py |
579, 596 | Low | — | Duplicate import auth_cache inside get_user_by_email. from mcpgateway.cache.auth_cache import auth_cache appears in two separate try blocks within the same method call. |
| 6 | mcpgateway/cache/auth_cache.py |
~413 | Low | CWE-362 | L1 cache dict read outside lock. self._user_cache.get(email) at the top of get_user is done without acquiring self._lock, inconsistent with the write path inside the method. |
Redundant Code
| # | File | Line(s) | Type | Description | Suggestion |
|---|---|---|---|---|---|
| 1 | mcpgateway/services/email_auth_service.py |
579, 596 | Duplicated import | from mcpgateway.cache.auth_cache import auth_cache appears twice in consecutive try blocks inside get_user_by_email |
Import once at the top of the method or hoist above both try blocks |
| 2 | mcpgateway/cache/auth_cache.py |
~428, ~456 | Repeated lazy import | import orjson inside both get_user and set_user try-blocks |
Move to top of auth_cache.py alongside other third-party imports (or at class level) |
|
Thanks for the thorough review @Lang-Akshay — all six findings are addressed in the latest commits. Here's a summary of what changed: Critical / High
Medium / Low
Tests updated / added
All 17 286 unit tests pass. |
95c6b89 to
b2c7cbd
Compare
13ee4bb to
11b125e
Compare
Lang-Akshay
left a comment
There was a problem hiding this comment.
Security hardening
Scope includes cache integration for user lookups and mutation invalidation. The changes introduce two password-reset regressions tied to cached detached objects.
| # | File | Line | Severity | CWE | Description | Fix |
|---|---|---|---|---|---|---|
| 1 | mcpgateway/services/email_auth_service.py | reset_password_with_token | High | CWE-610 | reset_password_with_token() uses get_user_by_email(), which can return a detached cached user; writes to user.password_hash may not persist, leaving the previous password valid. |
Fetch a session-attached user in reset_password_with_token() via _fetch_user_from_db() (or merge the object into the session) before mutating and committing. |
| 2 | mcpgateway/services/email_auth_service.py | reuse check | Medium | CWE-521 | Password reuse checks can be bypassed on cache hits because cached users store an empty password_hash, making reuse comparisons always pass. |
Perform reuse checks using a DB-attached |
|
Thanks @Lang-Akshay — both findings from the second review are addressed in the latest commit. Finding 1 (High · CWE-610) — detached object in
Fixed by replacing Finding 2 (Medium · CWE-521) — reuse check bypassed via empty cached hash
Resolved by the same fix above — Tests updated Four |
68d7ba2 to
88670cc
Compare
Lang-Akshay
left a comment
There was a problem hiding this comment.
Thanks for the PR implementing service-level caching for get_user_by_email(), reducing database load during the authentication hot path by storing User objects in AuthCache (both L1 in-memory and L2 Redis)
Add get_user()/set_user() to AuthCache (L1+L2) and wire them into EmailAuthService.get_user_by_email() to avoid repeated SELECT queries on every service call. Add missing cache invalidation to update_user(), activate_user(), and deactivate_user(). Include unit tests for cache hit, miss, fallback, and invalidation paths. Closes #3061 Signed-off-by: Suresh Kumar Moharajan <suresh.kumar.m@ibm.com>
Cover previously untested error paths: - AuthCache.get_user(): Redis exception handler (line 438) - AuthCache.set_user(): Redis path with orjson import and setex (lines 463/466) and exception handler (line 468) - _user_dict_to_obj(): _dt() passthrough for datetime objects (line 110) - get_user_by_email(): set_user exception suppression (line 598) Signed-off-by: Suresh Kumar Moharajan <suresh.kumar.m@ibm.com>
Signed-off-by: Suresh Kumar Moharajan <suresh.kumar.m@ibm.com>
Log debug messages on cache read/write failures instead of silently passing, making Redis fallback behaviour observable. Closes #3061 Signed-off-by: Suresh Kumar Moharajan <suresh.kumar.m@ibm.com>
- Remove password_hash from cache dict (CWE-312): _user_obj_to_dict no longer serialises the Argon2id hash into Redis; _user_dict_to_obj falls back to empty string for read-only paths - is_active defaults to False on deserialisation (CWE-20): fail-closed instead of fail-open for missing/corrupted cache key - Add _fetch_user_from_db() helper for mutation paths; wire authenticate_user and unlock_user_account to use it instead of the cached detached object, preventing silent ORM mutation discards (CWE-307 brute-force bypass, CWE-613 unlock bypass) - Add _invalidate_user_auth_cache() call to unlock_user_account after DB commit so stale locked state is evicted immediately - Hoist duplicate auth_cache import above both try blocks in get_user_by_email; move orjson module-level import in auth_cache.py - Acquire self._lock before L1 dict read in get_user() to be consistent with the write path (CWE-362) Tests: 5 new regression tests covering each security finding; updated test_email_auth_basic.py to patch _fetch_user_from_db for unlock tests. All 17286 unit tests pass. Signed-off-by: Suresh Kumar Moharajan <suresh.kumar.m@ibm.com>
Add test hitting the path where Redis is available but returns None, incrementing _redis_miss_count. Previously untested, causing 97.4% coverage on auth_cache.py. Signed-off-by: Suresh Kumar Moharajan <suresh.kumar.m@ibm.com>
Signed-off-by: Suresh Kumar Moharajan <suresh.kumar.m@ibm.com>
Drop CWE-307 and CWE-613 tags from inline comments in email_auth_service.py. Explanatory text retained intact. Signed-off-by: Suresh Kumar Moharajan <suresh.kumar.m@ibm.com>
orjson already imported at module level (line 41). Local re-imports at lines 309, 381, 817, 856 triggered W0621/W0404 pylint warnings. Signed-off-by: Suresh Kumar Moharajan <suresh.kumar.m@ibm.com>
get_user_by_email can return a detached cached EmailUser with an empty password_hash. Two regressions followed: - CWE-610: mutations (password_hash, failed_login_attempts, etc.) written to the detached object were silently dropped on commit, leaving the old password valid after a reset. - CWE-521: the password-reuse check compared against "" (cached hash placeholder), so reuse was never detected on cache hits. _fetch_user_from_db always returns a session-attached object with the real hash, matching the pattern already used in authenticate_user and unlock_user_account. Tests updated: mock _fetch_user_from_db (sync) instead of the async get_user_by_email across all four reset_password_with_token test cases. Signed-off-by: Suresh Kumar Moharajan <suresh.kumar.m@ibm.com>
Signed-off-by: Suresh Kumar Moharajan <suresh.kumar.m@ibm.com>
88670cc to
e46c2e1
Compare
…ate 100% Two CI failures triggered by main moving forward during this PR's review: * alembic multi-head: PR #4501 ("add ondelete=CASCADE to FK constraints") introduced 9fb98535724d as a sibling of our w7x8y9z0a1b2, both parented to 926d3e07d098. Re-parents w7x8y9z0a1b2 -> 9fb98535724d so the chain has a single head again and bootstrap_db / SQLite+Postgres upgrade validation pass. Verified single head with alembic heads. * interrogate --fail-under=100: three nested helper functions had no docstrings, dropping the package to 99.9%. One is mine (GrpcEndpoint.invoke._call introduced by the gRPC timeout fix in PR #3202 review B6); the other two (_user_obj_to_dict._iso, _user_dict_to_obj._dt in services/email_auth_service.py) landed on main via PR #4595 and were already failing main's CI. All three added one-line docstrings. Cascade fixed by these two changes: - SQLite + PostgreSQL Fresh/Upgrade (alembic head check) - pytest (py3.12) (bootstrap_db on alembic head check) - playwright-ci-smoke (depends on bootstrap) - sql-sanitizer E2E (depends on bootstrap) - interrogate (mcpgateway) (docstring coverage) - Run pre-commit hooks (interrogate is a pre-commit hook) The 2 email_auth_service docstrings are technically out of scope for this PR but the local lint cycle scoping I had been using (make interrogate TARGET=<my-files>) didn't catch them. Folding them in here unblocks PR-level CI rather than waiting for a separate fix to main. Verified locally: alembic heads -> single head w7x8y9z0a1b2 interrogate -> 100% on full mcpgateway/ black --check -> clean ruff check -> clean pylint -> 10.00/10 targeted gRPC tests -> 541 passing #2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com>
…ate 100% Two CI failures triggered by main moving forward during this PR's review: * alembic multi-head: PR #4501 ("add ondelete=CASCADE to FK constraints") introduced 9fb98535724d as a sibling of our w7x8y9z0a1b2, both parented to 926d3e07d098. Re-parents w7x8y9z0a1b2 -> 9fb98535724d so the chain has a single head again and bootstrap_db / SQLite+Postgres upgrade validation pass. Verified single head with alembic heads. * interrogate --fail-under=100: three nested helper functions had no docstrings, dropping the package to 99.9%. One is mine (GrpcEndpoint.invoke._call introduced by the gRPC timeout fix in PR #3202 review B6); the other two (_user_obj_to_dict._iso, _user_dict_to_obj._dt in services/email_auth_service.py) landed on main via PR #4595 and were already failing main's CI. All three added one-line docstrings. Cascade fixed by these two changes: - SQLite + PostgreSQL Fresh/Upgrade (alembic head check) - pytest (py3.12) (bootstrap_db on alembic head check) - playwright-ci-smoke (depends on bootstrap) - sql-sanitizer E2E (depends on bootstrap) - interrogate (mcpgateway) (docstring coverage) - Run pre-commit hooks (interrogate is a pre-commit hook) The 2 email_auth_service docstrings are technically out of scope for this PR but the local lint cycle scoping I had been using (make interrogate TARGET=<my-files>) didn't catch them. Folding them in here unblocks PR-level CI rather than waiting for a separate fix to main. Verified locally: alembic heads -> single head w7x8y9z0a1b2 interrogate -> 100% on full mcpgateway/ black --check -> clean ruff check -> clean pylint -> 10.00/10 targeted gRPC tests -> 541 passing #2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com>
…ate 100% Two CI failures triggered by main moving forward during this PR's review: * alembic multi-head: PR #4501 ("add ondelete=CASCADE to FK constraints") introduced 9fb98535724d as a sibling of our w7x8y9z0a1b2, both parented to 926d3e07d098. Re-parents w7x8y9z0a1b2 -> 9fb98535724d so the chain has a single head again and bootstrap_db / SQLite+Postgres upgrade validation pass. Verified single head with alembic heads. * interrogate --fail-under=100: three nested helper functions had no docstrings, dropping the package to 99.9%. One is mine (GrpcEndpoint.invoke._call introduced by the gRPC timeout fix in PR #3202 review B6); the other two (_user_obj_to_dict._iso, _user_dict_to_obj._dt in services/email_auth_service.py) landed on main via PR #4595 and were already failing main's CI. All three added one-line docstrings. Cascade fixed by these two changes: - SQLite + PostgreSQL Fresh/Upgrade (alembic head check) - pytest (py3.12) (bootstrap_db on alembic head check) - playwright-ci-smoke (depends on bootstrap) - sql-sanitizer E2E (depends on bootstrap) - interrogate (mcpgateway) (docstring coverage) - Run pre-commit hooks (interrogate is a pre-commit hook) The 2 email_auth_service docstrings are technically out of scope for this PR but the local lint cycle scoping I had been using (make interrogate TARGET=<my-files>) didn't catch them. Folding them in here unblocks PR-level CI rather than waiting for a separate fix to main. Verified locally: alembic heads -> single head w7x8y9z0a1b2 interrogate -> 100% on full mcpgateway/ black --check -> clean ruff check -> clean pylint -> 10.00/10 targeted gRPC tests -> 541 passing #2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com>
…ate 100% Two CI failures triggered by main moving forward during this PR's review: * alembic multi-head: PR #4501 ("add ondelete=CASCADE to FK constraints") introduced 9fb98535724d as a sibling of our w7x8y9z0a1b2, both parented to 926d3e07d098. Re-parents w7x8y9z0a1b2 -> 9fb98535724d so the chain has a single head again and bootstrap_db / SQLite+Postgres upgrade validation pass. Verified single head with alembic heads. * interrogate --fail-under=100: three nested helper functions had no docstrings, dropping the package to 99.9%. One is mine (GrpcEndpoint.invoke._call introduced by the gRPC timeout fix in PR #3202 review B6); the other two (_user_obj_to_dict._iso, _user_dict_to_obj._dt in services/email_auth_service.py) landed on main via PR #4595 and were already failing main's CI. All three added one-line docstrings. Cascade fixed by these two changes: - SQLite + PostgreSQL Fresh/Upgrade (alembic head check) - pytest (py3.12) (bootstrap_db on alembic head check) - playwright-ci-smoke (depends on bootstrap) - sql-sanitizer E2E (depends on bootstrap) - interrogate (mcpgateway) (docstring coverage) - Run pre-commit hooks (interrogate is a pre-commit hook) The 2 email_auth_service docstrings are technically out of scope for this PR but the local lint cycle scoping I had been using (make interrogate TARGET=<my-files>) didn't catch them. Folding them in here unblocks PR-level CI rather than waiting for a separate fix to main. Verified locally: alembic heads -> single head w7x8y9z0a1b2 interrogate -> 100% on full mcpgateway/ black --check -> clean ruff check -> clean pylint -> 10.00/10 targeted gRPC tests -> 541 passing #2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com>
* feat: Add grpc_service_id to Tool schema #2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> AI-usage: full * feat: Add tool registration and deletion for registering/deleting gRPC services #2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> AI-usage: full * feat: Add load_file_descriptors for grpc serialized proto descriptor bytes #2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> AI-usage: full * feat: Add gRPC tool invocation support #2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> AI-usage: full * test: Unit tests for gRPC tool registration #2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> AI-usage: full * style: make lint #2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * test: Fix missing grpc_service_id in tool unit test #2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> AI-usage: full * test: Flesh out test coverage for new gRPC code paths #2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> AI-usage: full * fix: Better error handling for protobuf descriptor pool adding #2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Inherit visibility from parent gRPC service when mapping to tools #2854 Branch: GrpcMethodsAsTools-2854 AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix alembic migration after rebase #2854 Branch: GrpcMethodsAsTools-2854 AI-usage: full Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Repoint alembic migration parent to current head after rebase Updates down_revision from x7h8i9j0k1l2 to aa1b2c3d4e5f, the current head on origin/main. Several new migrations have landed since this PR was last rebased, so the previous parent is no longer the head. Verified single head: alembic heads -> w7x8y9z0a1b2 (head) #2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> * test: Add grpc_service_id=None to TestRustMcpExecutionPlan SimpleNamespace mocks Two tests added to TestRustMcpExecutionPlan after this PR was last rebased construct tool mocks via SimpleNamespace. The PR's change to _build_tool_cache_payload reads tool.grpc_service_id, so those mocks now AttributeError. Mirrors b576ec4's earlier fix for the same pattern in MagicMock(spec=DbTool) helpers. Affected: - TestRustMcpExecutionPlan.test_prepare_rust_mcp_tool_execution_uses_live_gateway_auth_fields_for_loaded_tools - TestRustMcpExecutionPlan.test_prepare_rust_mcp_tool_execution_uses_live_gateway_string_auth_values #2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> * style: make black isort Tightens whitespace and import grouping per project lint config: - mcpgateway/services/grpc_service.py: black removes a stray blank line - tests/unit/mcpgateway/{services/test_grpc_service,services/test_tool_service,test_translate_grpc}.py: isort regroups stdlib/third-party/first-party/local imports #2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> * fix(grpc): use protobuf 6.x kwarg in MessageToDict and isolate descriptor pool Two issues addressed in mcpgateway/translate_grpc.py: 1. Pre-existing runtime bug: invoke()/invoke_streaming() called json_format.MessageToDict(..., including_default_value_fields=True), the parameter name from protobuf 4.x. The repo pins protobuf>=6.33.6 (see pyproject.toml), where the parameter was renamed to always_print_fields_with_no_presence. Calls raised TypeError at runtime, breaking gRPC tool invocation. Renamed the kwarg and replaced the # pylint: disable=unexpected-keyword-arg suppression with an anti-regression comment. 2. Descriptor pool poisoning: GrpcEndpoint shared descriptor_pool.Default(), the process-wide singleton. Reflected FileDescriptorProto from one (untrusted) upstream service could leak across requests and cause symbol collisions or type confusion in subsequent calls. Switched to a per-endpoint descriptor_pool.DescriptorPool(); MessageFactory now binds to that private pool. Updated tests/unit/mcpgateway/test_translate_grpc.py monkeypatches to expose DescriptorPool() instead of Default() and MessageFactory(pool=...) instead of MessageFactory(). Closes review: B2, B11 #2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> * fix(grpc): security and robustness hardening from review Addresses BLOCKING findings from a multi-agent code review of PR #3202: Security: * Reject local-only gRPC schemes (unix:, unix-abstract:, vsock:, fd:) in _validate_grpc_target since they bypass the IP-based SSRF model. Also normalize dns:/// and ipv4:/ipv6: name-resolver prefixes and parse bracketed IPv6 hosts correctly so the host-extraction step cannot be tricked. (review B1) * Add _enforce_descriptor_limits to bound reflected FileDescriptorProto count, per-blob size, and aggregate size before storing or loading. Hardcoded constants (not settings) so a config change cannot silently weaken the DoS defense. (review B3) * Add _validate_reflected_tool_name that runs reflected tool names through SecurityValidator.validate_tool_name() before persisting. Reflected names from upstream gRPC servers no longer bypass the injection / length / character checks applied to user-registered tools. (review B4) * Strip metadata pseudo-keys (anything starting with '_') when copying discovered_services into endpoint._services so the _file_descriptors entry can never be confused with a service. * Decode stored descriptors with base64.b64decode(..., validate=True) to fail fast on tampered DB content. Robustness / Layer 1 invariants: * _sync_tools_from_reflection now runs each method through a per-tool try/except so a single bad method cannot abort the entire reflection sync. Counts created / updated / failed. (review B9) * _sync_tools_from_reflection now propagates parent service visibility, team_id, and owner_email to existing reflected tools, not just to newly created ones. Closes a Layer 1 token-scoping gap where a public->team visibility change on the parent service was silently ignored on already-discovered tools. (review B5) Operational: * invoke_method gains a timeout parameter (default settings.tool_timeout) and wraps endpoint.start() / endpoint.invoke() in asyncio.wait_for so a slow upstream cannot tie up a worker indefinitely. (review B6) * tool_service.invoke_tool gRPC branch wraps the call in asyncio.wait_for, surfaces ToolTimeoutError + post-invoke timeout hook (matches A2A branch), preserves CancelledError instead of swallowing it, and logs with %-style + exc_info=True. (review B6/B7/B8) * invoke_method preserves CancelledError, logs with %-style + exc_info=True, re-raises GrpcServiceError/GrpcServiceNotFoundError unwrapped instead of double-wrapping them. (review B8) #2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> * refactor(grpc): drop redundant inline settings import The inline 'from mcpgateway.config import settings' in _validate_grpc_target was made redundant by hoisting the same import to module top in the security hardening commit. Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> * test(grpc): cover security-hardening helpers and Layer 1 invariants Adds two test classes exercising the helpers introduced by the security/robustness hardening commit: TestSecurityHardening: * _validate_grpc_target rejects unix:/unix-abstract:/vsock:/fd: schemes * _validate_grpc_target normalizes dns:///, dns://, dns:, ipv4:, ipv6: name-resolver prefixes * _validate_grpc_target accepts bracketed IPv6 literals and rejects malformed bracket syntax * _enforce_descriptor_limits enforces count, per-blob, and aggregate size caps; happy path passes within bounds * _validate_reflected_tool_name rejects empty / over-length / injection patterns and accepts proto-style identifiers TestVisibilityPropagation: * _sync_tools_from_reflection propagates parent service visibility, team_id, and owner_email to existing reflected tools (Layer 1 invariant). * _sync_tools_from_reflection isolates per-method failures: a method with an invalid name is skipped while the rest of the sync proceeds (B9 regression test). All 13 new tests pass; the broader gRPC + tool_service + translate_grpc suite is green at 520 tests. #2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> * fix(grpc): close remaining review gaps from iteration-2 oracle pass Addresses BLOCKING items the second-pass review identified as still present after the first round of fixes. * B1 (DNS resolution): _validate_grpc_target now delegates the hostname/IP-network/DNS policy to SecurityValidator._validate_ssrf so hostnames like metadata.google.internal are resolved and the resolved IPs go through the same blocklist as HTTP targets. Removes the duplicate hand-rolled IP-network logic that previously lived inline. * B5 (visibility on update): GrpcService.update_service now snapshots visibility / team_id / owner_email before applying the update and bulk-updates every child Tool with grpc_service_id == service.id in the same transaction whenever any of those fields change. Closes the Layer 1 token-scoping gap where a public->team change on the parent service was silently ignored on already-discovered tools. * B6 (real gRPC deadlines): GrpcEndpoint.start, _discover_services, and _discover_service_details now accept a timeout kwarg that is threaded into ServerReflectionInfo, and GrpcEndpoint.invoke binds the deadline onto the underlying unary_unary call so a slow upstream cannot keep an executor thread alive after asyncio.wait_for cancels the wrapping coroutine. invoke_method passes the effective tool timeout into both endpoint.start and endpoint.invoke. * B7 (cancellation propagation in outer except BaseException): adds except asyncio.CancelledError: raise immediately before the three outer except BaseException as e blocks in tool_service.invoke_tool so a cancellation is never wrapped as a ToolInvocationError. * B12 (protobuf 6.x compatibility): MessageFactory.GetPrototype was removed in protobuf >= 5.x. Both the unary and streaming invocation paths in translate_grpc now call message_factory.GetMessageClass(...) instead. Updated the corresponding unit-test mocks. * Cleanup: drops unused import ipaddress that became dead after the SecurityValidator delegation. Test mocks updated to match the new signatures (timeout kwarg on the populate_service / _populate callbacks; GetMessageClass at the module level instead of MessageFactory.GetPrototype). Lint clean (black, isort, ruff E3/E4/E7/E9/F/D1). 520 targeted tests pass. #2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> * fix(grpc): preserve timeout=0 fail-fast semantics in deadline ternaries The 'if timeout' ternary in three reflection/unary call sites silently drops timeout=0, but gRPC accepts timeout=0 as the explicit 'fail immediately' deadline. Switched to 'if timeout is not None' so a caller-supplied 0.0 reaches the underlying gRPC call with the right semantics. Found in iteration-3 oracle review (final pass, no new BLOCKING items). Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> * fix(grpc): apply non-blocking review nits flagged across the three review passes Bundle of low-risk, non-blocking improvements that surfaced during the deep review iterations. Each item below was explicitly classified as fix-now in the review summary; broader / design-dependent items were filed as follow-ups under #4612 and #4613. * Schema parity (#1): add grpc_service_id to ToolRead in schemas.py alongside the existing gateway_id field so API consumers can identify gRPC-discovered tools (review S2.4 / TD1). * db.py cleanup (#2): drop redundant nullable=True from Tool.grpc_service_id; Mapped[Optional[str]] already implies it and the sibling gateway_id column omits it (review TD2). * translate_grpc.load_file_descriptors (#3): widen the parameter type from List[bytes] to Sequence[bytes] and reject a single bytes object passed by mistake. Without the guard Python would silently iterate byte-by-byte (review TD3). * translate_grpc descriptor pool conflict (#4): replace the inaccurate 'no-op if already added' comment with an explicit TypeError handler. protobuf raises TypeError when a file with the same name has conflicting content; the existing descriptor stays authoritative (review S3.3). * test_sync_tools_removes_stale_tools (#5): replace the brittle string-match assertion ('DELETE' in str(call)) with an explicit call_count == 4 (1 select + 3 deletes), which no longer depends on the SQLAlchemy Delete object repr (review S2.6). * translate_grpc close() (#6): convert the f-string log to lazy %-style for consistency with the rest of this PR's logging (review N2.1). * grpc_service _sync_tools_from_reflection (#7): document why input_schema['properties'] is empty by design — gRPC arg shape is validated at the protobuf invocation layer, not the MCP tool-call layer; the actual proto types live in the x-grpc-* extensions (review N2.2). * TestInvokeMethodGuards (#8): add 5 edge-case tests covering invoke_method paths the previous suite did not exercise: - service-not-found -> GrpcServiceNotFoundError - disabled service -> GrpcServiceError('is disabled') - invalid method format (no dot) -> GrpcServiceError - _validate_grpc_target spy called with service.target - _validate_tls_path spy called for both cert and key paths 525 targeted tests pass (was 520). Lint clean. Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> * style: clean lint cycle (make autoflake/isort/black/ruff/bandit/interrogate/pylint) Remaining fixes the project lint cycle surfaced after the fix-now bundle landed: * isort: re-sort sibling imports in test_grpc_service.py (_enforce_descriptor_limits before _GRPC_MAX_*; case-insensitive order). * ruff PLW0108 (unnecessary lambda) in test_translate_grpc.py: align the 4 mocks I added with the 7 pre-existing patterns in the same file (lambda **_kw: MagicMock() / lambda *_a, **_kw: MagicMock()). * ruff RET501 in test_grpc_service.py::tls_spy: drop the redundant trailing 'return None'. * pylint no-member on the alembic migration: add the project-standard '# pylint: disable=no-member' header that 9 other migrations use to silence alembic.op's dynamic-attribute false positives. * pylint try-except-raise on tool_service.invoke_tool gRPC branch: the 'except asyncio.CancelledError: raise' clause LOOKS like a no-op but exists specifically so the LATER 'except Exception' cannot swallow cancellation (PR #3202 review B7). Added an inline disable plus a comment citing B7 so the next lint sweep doesn't delete the anti-regression code. Verified after fix: make autoflake / isort / black -> clean make ruff (E3,E4,E7,E9,F,D1) -> All checks passed make bandit -> 0 issues, 18022 LoC make interrogate -> 100% docstring coverage make pylint -> 10.00/10 on all 6 production files pytest (targeted gRPC suite) -> 525/525 passing Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> * fix: Re-parent alembic migration to current head after second rebase Updates down_revision from aa1b2c3d4e5f to 926d3e07d098. Origin/main gained the CPEX plugin framework replacement (#3754) which introduced two new migrations and a mergepoint, leaving our grpc_service_id migration as a sibling head. Re-pointing to the new head returns the chain to single-headed. Verified single head: alembic heads -> w7x8y9z0a1b2 (head) alembic upgrade head -> clean on a fresh sqlite DB #2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> * fix(grpc): restore reserved/multicast guard and update test_grpc_service_no_grpc Two related fixes for the no-grpc test file that surfaced after rebase: 1. Restore is_reserved / is_multicast pre-check in _validate_grpc_target. The earlier delegation to SecurityValidator._validate_ssrf inadvertently dropped this guard because the shared validator only checks blocked-networks / localhost / private. IP literals like 224.0.0.1 (multicast) or 0.0.0.0 (reserved) would have slipped past unless they were also in ssrf_blocked_networks. The restored check excludes loopback so ::1 (which Python flags as both is_loopback AND is_reserved) still passes through the localhost gate as intended. 2. Update test_grpc_service_no_grpc.py: - Add timeout=None kwarg to the three FakeEndpoint.start() and invoke() methods so they accept the per-RPC deadline added in the security/robustness hardening commit. - Update test_validate_grpc_target_enforces_ssrf_rules regex patterns to match the SecurityValidator messages reached via delegation (blocked hostname X, localhost address which is blocked, private network address which is blocked). All 4 gRPC test files pass: 528 / 528. #2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> * test: increase diff coverage for gRPC PR from 92% to 100% on grpc_service+translate_grpc Adds 11 targeted tests + a decision-record comment to address the diff-coverage report that flagged 18 previously-uncovered lines: grpc_service.py: 95.7% -> 100% (7/7 covered) translate_grpc.py: 91.7% -> 100% (3/3 covered) tool_service.py: 69.2% -> 62% (5/8 covered + 3 documented) New tests in tests/unit/mcpgateway/services/test_grpc_service.py: * test_validate_grpc_target_empty_string -> line 134 * test_invoke_method_propagates_cancelled_error -> line 1021 * test_invoke_method_re_raises_timeout -> lines 1023-1024 * test_invoke_method_re_raises_grpc_service_error_unwrapped -> line 1026 * test_update_service_propagates_visibility_to_child_tools -> lines 503-504 New tests in tests/unit/mcpgateway/test_translate_grpc.py: * test_load_file_descriptors_rejects_single_bytes -> line 409 * test_load_file_descriptors_skips_pool_conflict -> lines 420, 425 New tests in tests/unit/mcpgateway/services/test_tool_service.py: * test_invoke_grpc_tool_propagates_cancellation -> line 5847 * test_invoke_grpc_tool_timeout_raises_tool_timeout_error -> lines 5849-5850, 5852 Restored a regression: the earlier B1 refactor (delegating to SecurityValidator._validate_ssrf) inadvertently dropped the is_reserved / is_multicast guard, since SecurityValidator only checks blocked-networks / localhost / private. Restored the local guard with an is_loopback exclusion so IPv6 ::1 (which Python flags as both is_loopback AND is_reserved) still passes through the localhost gate. Coverage decision-record left in test_tool_service.py for the 3 remaining B7 anti-regression lines (5446 REST, 5632 MCP, 5851 gRPC timeout-with-pm): they are byte-identical to the gRPC variant at line 5847 (which is fully exercised). Equivalent REST/MCP tests require coaxing CancelledError through asyncio.wait_for, which Python's event loop converts to TimeoutError in some states. Protecting one branch is sufficient to detect a regression that would affect all three. 541 tests passing, ruff/black/isort/pylint clean. #2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> * fix: re-parent alembic migration onto current head + restore interrogate 100% Two CI failures triggered by main moving forward during this PR's review: * alembic multi-head: PR #4501 ("add ondelete=CASCADE to FK constraints") introduced 9fb98535724d as a sibling of our w7x8y9z0a1b2, both parented to 926d3e07d098. Re-parents w7x8y9z0a1b2 -> 9fb98535724d so the chain has a single head again and bootstrap_db / SQLite+Postgres upgrade validation pass. Verified single head with alembic heads. * interrogate --fail-under=100: three nested helper functions had no docstrings, dropping the package to 99.9%. One is mine (GrpcEndpoint.invoke._call introduced by the gRPC timeout fix in PR #3202 review B6); the other two (_user_obj_to_dict._iso, _user_dict_to_obj._dt in services/email_auth_service.py) landed on main via PR #4595 and were already failing main's CI. All three added one-line docstrings. Cascade fixed by these two changes: - SQLite + PostgreSQL Fresh/Upgrade (alembic head check) - pytest (py3.12) (bootstrap_db on alembic head check) - playwright-ci-smoke (depends on bootstrap) - sql-sanitizer E2E (depends on bootstrap) - interrogate (mcpgateway) (docstring coverage) - Run pre-commit hooks (interrogate is a pre-commit hook) The 2 email_auth_service docstrings are technically out of scope for this PR but the local lint cycle scoping I had been using (make interrogate TARGET=<my-files>) didn't catch them. Folding them in here unblocks PR-level CI rather than waiting for a separate fix to main. Verified locally: alembic heads -> single head w7x8y9z0a1b2 interrogate -> 100% on full mcpgateway/ black --check -> clean ruff check -> clean pylint -> 10.00/10 targeted gRPC tests -> 541 passing #2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Signed-off-by: Jonathan Springer <jps@s390x.com> Co-authored-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Jonathan Springer <jps@s390x.com>

Summary
EmailAuthService.get_user_by_email()is called on every service operation that reads or verifies a user (login, password change, SSO upsert, OAuth token introspection, profile fetch, admin bootstrap, and more — 12+ internal call sites). Each call issued a synchronousSELECT … WHERE email = ?with no caching, even though the existingAuthCachealready had an allocated_user_cachedict and the Redis keymcpgw:auth:user:{email}wired intoinvalidate_user()but never populated.get_user()/set_user()toAuthCacheusing the same L1 (in-memory) → L2 (Redis) write-through pattern asget_user_role()/set_user_role()EmailAuthService.get_user_by_email()so every call checks cache first and populates on a DB hitupdate_user(),activate_user(), anddeactivate_user()— three mutation methods that previously committed user changes without busting the cacheuser_cache_sizeinAuthCache.stats()(was always 0 because the cache was never written)try/except; Redis outage falls through to DB transparentlyNo new config knobs. Existing
auth_cache_user_ttl(default 60 s) andauth_cache_enabledgovern this path.Closes #3061
Sub-issue of #2692 (DB query hot-path optimisations)
Files Changed
mcpgateway/cache/auth_cache.pyget_user()andset_user()methods; addeduser_cache_sizetostats()mcpgateway/services/email_auth_service.py_user_obj_to_dict()/_user_dict_to_obj()helpers; updatedget_user_by_email()to use cache; added_invalidate_user_auth_cache()calls inupdate_user(),activate_user(),deactivate_user()tests/unit/mcpgateway/cache/test_auth_cache_user.pytests/unit/mcpgateway/services/test_email_auth_get_user_cache.pyInvalidation Coverage After This PR
reset_password()change_password()update_user()activate_user()deactivate_user()delete_user()create_user()