[WIP] Update AI model details in documentation#1
Closed
Conversation
…1939) MCP list endpoints (tools/list, resources/list, prompts/list) were returning only ~50 items due to pagination default limit being applied. MCP clients expect all items returned without pagination. Fixed by passing limit=0 to service methods, which returns all items: - streamablehttp_transport.py: tools/list, prompts/list, resources/list - main.py handle_rpc(): tools/list, list_tools, resources/list, prompts/list Also fixed incorrect positional argument passing in streamablehttp transport where request_headers was passed to gateway_id parameter. Closes IBM#1937 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* Cache tool lookups by name closes IBM#1940 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Cache tool lookups by name closes IBM#1940 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Cache tool lookups by name closes IBM#1940 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Cache tool lookups by name closes IBM#1940 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* perf: Add TEMPLATES_AUTO_RELOAD setting and nginx admin caching IBM#1944 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * perf: Add TEMPLATES_AUTO_RELOAD setting and nginx admin caching IBM#1944 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * revert: Remove nginx admin caching (moved to IBM#1946) The nginx caching changes for admin pages have been moved to a separate issue (IBM#1946) due to multi-tenant security concerns that require careful review. This PR now only contains the TEMPLATES_AUTO_RELOAD changes. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…BM#1947) Enable short-TTL (5s) caching for admin pages to reduce CPU load from Jinja2 template rendering under high concurrency. Security measures for multi-tenant safety: - Cache key includes all auth methods ($http_authorization, $cookie_jwt_token, $cookie_access_token) to isolate users - Cache-Control: private prevents shared caches (CDNs/proxies) from storing user-specific content - Set-Cookie header ignored for cache decisions (safe due to short TTL and read-only GET requests) Performance impact (4000 concurrent users): - /admin/ response time: 5414ms → 199ms (96% improvement) - Throughput: ~2400 → ~4000 RPS (67% improvement) Closes IBM#1946 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Optimizes AuthCache by checking in-memory L1 cache before Redis L2 cache: - Implements L1-first lookup pattern in get_auth_context, get_user_role, get_user_teams, is_token_revoked, and get_team_membership_valid - Adds write-through from L2 to L1 to populate local cache on Redis hits - Removes Redis round-trip latency for hot authentication data - Adds comprehensive unit tests for L1/L2 behavior and write-through logic - Updates docstrings to accurately reflect L1-first architecture Closes: IBM#1881 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
…1 queries (IBM#1957) Add email_team relationship with lazy="joined" to Tool model, replacing manual team name lookups with a single JOIN query. This eliminates N+1 query patterns when fetching tools with their team names. Changes: - Add email_team relationship to Tool class with lazy="joined" - Add primaryjoin condition to filter only active teams (is_active == True) - Add team property to Tool class for convenient team name access - Document the team property in Tool class docstring - Remove _get_team_name helper method from tool_service.py - Remove manual team_map batch loading in list_tools - Remove manual outerjoin/add_columns in list_server_tools and list_tools_for_user - Update tests to use new query pattern (.scalars().all()) - Fix flake8 issues in test file Signed-off-by: Jonathan Springer <jonpspri@gmail.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…eir testing in Locust (IBM#1954) * Multi gateway in benchmark Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * register benchmark at start in locust Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * refactor benchmark register Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * Add benchmarks servers to locustfile Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * Add benchmark to generate_docker_compose Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * Use ports from container for benchmark Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * Add benchmark, time and test servers in generate_docker_compose Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * Update fast-test-server port Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * Fix fast-test-server port Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * Use port 8080 for nginx in performance tests Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * Add benchmark and fast_test_server to docker-compose Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * Use staging profile as default for benchmark consistency Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * Reduce limits for benchmark_server Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * Fix tests Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * fix: correct health check port and style issues - Fix fast-time-server health check port (8888 → 8002) in generate_docker_compose.py - Add missing blank line before function definition in locustfile.py (PEP8 E302) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * refactor: simplify benchmark server setup with dedicated profile - Add dedicated "benchmark" profile with auto-registration at startup - Add make benchmark-up/down/status/logs targets - Reduce to 10 servers (ports 9000-9009) with auto-registration - Remove complex auto-detection and dynamic registration from locustfile - Benchmark servers registered via register_benchmark service at compose startup - Remove benchmark from monitoring profile for cleaner separation Usage: make benchmark-up Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: remove healthcheck from benchmark_server (scratch image has no shell) - Remove CMD-SHELL healthcheck (scratch image lacks /bin/sh) - Change dependency to service_started with 5s sleep - Servers verified healthy via HTTP endpoint Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * refactor: locust no longer registers gateways, add benchmark-clean and env vars - Remove all gateway registration from locustfile.py (handled by compose services) - Add benchmark-clean target to remove volumes - Add BENCHMARK_SERVER_COUNT and BENCHMARK_START_PORT environment variables - Support up to 100 benchmark servers (9000-9099 port range) - Increased resource limits for larger deployments Usage: make benchmark-up # Default: 10 servers BENCHMARK_SERVER_COUNT=50 make benchmark-up # Custom: 50 servers make benchmark-clean # Stop and remove volumes Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: add Benchmark Server Stack documentation Add comprehensive documentation for the benchmark server stack including: - Quick start guide - Make targets reference - Environment variable configuration - Architecture overview - Resource limits - Standalone usage instructions - Performance characteristics Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* Add more partial endpoints Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * Add ids and search APIs for gw, server, a2a Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * Clean up APIs Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * Add templates for gw, server and agent Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * Update gw, agent to use htmx Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * Fix bugs in gw loading Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * Fix bugs in partial templates Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * Fix agent fields in partial Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * Fix flake8 issues Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * Fix eslint issues Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * Fix counts in server list Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * Fix update and deletion of a2a tools Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * Fix test Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * pylint fixes Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * refactor: remove commented code and debug statements - Remove commented-out admin_list_gateway_ids endpoint (replaced by new implementation) - Remove console.log debug statements from admin.js tab switching logic - Add standard encoding header to migration file Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: correct route paths, element IDs, and restore null gateway option - Fix A2A route mismatch: /admin/agents/partial → /admin/a2a/partial - Fix A2A element ID mismatch: #a2a-agents-table-body → #agents-table-body - Fix gateway selector class: .gateway-item → .tool-item for JS compatibility - Restore REST/A2A null gateway checkbox in gateway selector (page 1 only) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: add eager loading and transaction cleanup for partial endpoints - Add selectinload for server relationships (tools, resources, prompts, a2a_agents) to prevent N+1 queries in convert_server_to_read - Add db.commit() before template rendering in servers, gateways, and a2a partial endpoints to release transactions and avoid idle-in-transaction timeouts under load Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: preserve tool visibility on A2A agent updates and handle sync failures - Preserve existing tool visibility and team_id when updating tool from A2A agent to prevent unintentionally making private/team tools public (ToolUpdate defaults to "public" which was overriding existing settings) - Wrap tool sync in try/except to handle failures gracefully - the agent update is the primary operation and should succeed even if tool sync fails Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: handle datetime string formatting and A2A tool auth sync - Fix gateway.last_seen template to handle both datetime objects and ISO strings (after jsonable_encoder serialization) - Fix A2A tool auth sync to use auth=AuthenticationValues(...) instead of auth_type/auth_value which are not fields on ToolUpdate schema Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Rebase Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
…ool creation (IBM#1967) * fix(IBM#1956): propagate team_id, owner_email and visibility to a2a tool creation When creating tools from A2A agents, the team_id, owner_email, and visibility fields from the agent were not being propagated to the tool registration. This caused A2A agent tools to appear without team association in the admin UI, and made them inaccessible via team-scoped API queries. This change ensures that tools created from A2A agents inherit the same team_id, owner_email, and visibility as the parent agent, consistent with how gateways propagate these fields to their federated tools. Closes IBM#1956 Signed-off-by: Jerry Liu <liutao@us.ibm.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: add unit test for A2A tool scoping field propagation Add test_create_tool_from_a2a_agent_passes_scope_fields to verify that team_id, owner_email, and visibility are correctly forwarded from the A2A agent to register_tool when creating tools from agents. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Jerry Liu <liutao@us.ibm.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…indows (IBM#1941) This change introduces a batched aggregation path for custom log search windows to avoid per-window recomputation. Instead of scanning log data separately for each window, the implementation computes metrics for all requested windows in a single pass over the full time range. For PostgreSQL, a single SQL query using generate_series and ordered-set aggregates computes per-window percentiles efficiently. For other databases, logs are fetched once per component/operation and bucketed into windows in Python. Key changes: - Add aggregate_all_components_batch() method to LogAggregator service - Modify _aggregate_custom_windows() to collect all windows first and delegate to batch method - Implement proper fallback to per-window aggregation if batch fails - Include db.rollback() before fallback to handle PostgreSQL failed state - Filter duration_ms IS NOT NULL in batch SQL JOIN for consistent error_count semantics with per-window path - Filter component/operation_type IS NOT NULL and non-empty in both PostgreSQL and Python paths for consistency with per-window behavior - Add warning log for large ranges in non-PostgreSQL fallback path - Add warning when sparse window_starts detected (batch generates full range) - Limit window list to 10000 entries, keeping most recent windows - Preserve existing aggregation semantics and filters - Add comprehensive tests for batch aggregation parity Performance impact: - Before: N windows × M component/operation pairs = N×M database queries - After PostgreSQL: 1 SQL query for all windows and pairs - After SQLite: M queries regardless of window count Closes IBM#1826 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* Optimize doctest Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Optimize doctest Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * change parallel text executiong Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* Optimize tools Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Optimize MCP UI Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Jonathan Springer <jspringer@us.ibm.com>
…BM#1981) * Add UI Overview page Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Add UI Overview page Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Add UI Overview page Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
perf(ci): remove redundant steps from full-build-pipeline
…IBM#1968) * percentiles Signed-off-by: rakdutta <rakhibiswas@yahoo.com> * Revert "percentiles" This reverts commit d3a7933. * percentiles Signed-off-by: rakdutta <rakhibiswas@yahoo.com> * testing Signed-off-by: rakdutta <rakhibiswas@yahoo.com> * USE_POSTGRESDB_PERCENTILES Signed-off-by: rakdutta <rakhibiswas@yahoo.com> * test Signed-off-by: rakdutta <rakhibiswas@yahoo.com> * observability_service Signed-off-by: rakdutta <rakhibiswas@yahoo.com> * docstring Signed-off-by: rakdutta <rakhibiswas@yahoo.com> * val Signed-off-by: rakdutta <rakhibiswas@yahoo.com> * flake8 Signed-off-by: rakdutta <rakhibiswas@yahoo.com> * flake Signed-off-by: rakdutta <rakhibiswas@yahoo.com> * fix: move json_key validation for defense-in-depth Move the json_key validation outside the PostgreSQL-specific block to protect both PostgreSQL and SQLite code paths. While current callers always use safe hardcoded values, this provides defense-in-depth. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: address review findings for percentile performance PR - Apply limit to Python fallback path to prevent unbounded response sizes - Exclude tests/performance from default pytest runs (explicit opt-in) - Skip test_use_postgresdb_percentiles_toggle which cannot properly test Python fallback on PostgreSQL due to global backend variable usage Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: address additional review findings - Make extract_json_field() dialect-aware with optional dialect_name param - Pass session dialect to extract_json_field in Python fallback path - Add commit=False param to start_span/end_span to avoid double commits - Update tool_service.py to use commit=False with fresh_db_session() - Re-enable test_use_postgresdb_percentiles_toggle now that dialect is passed Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: add docstring to nested percentile function Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: use timezone-aware cutoff for PostgreSQL fallback path - Use cutoff_time (aware) for PostgreSQL to avoid timezone drift - Use cutoff_time_naive for SQLite which doesn't support timezones - Soften timing assertions in benchmark tests to warnings (non-flaky) - Expand percentile function docstring for flake8 compliance Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: rakdutta <rakhibiswas@yahoo.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* Avoid full scan in ResourceCache cleanup loop Signed-off-by: NAYANAR <nayana.r7813@gmail.com> * fixing ruff Signed-off-by: NAYANAR <nayana.r7813@gmail.com> * Fixed — tb renamed to _tb Signed-off-by: NAYANAR <nayana.r7813@gmail.com> * fxing pylint Signed-off-by: NAYANAR <nayana.r7813@gmail.com> * fix: apply black formatting Add blank line before nested function per black standards. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: address thread-safety issues in ResourceCache - Remove DualLock async context manager to avoid RLock thread mismatch - Add locking to get() method to prevent race conditions with cleanup - Add locking to __len__() for consistency - Replace async cleanup test with sync _cleanup_once() tests - Add tests for heap-based cleanup edge cases (updated/deleted entries) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: add docstring to _run_once inner function Ensures 100% docstring coverage for resource_cache.py. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: add docstring to _align_to_window_local Ensures 100% docstring coverage. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: bound heap memory growth with compaction - Add _compact_heap() to rebuild heap with only valid entries - Trigger compaction when heap size exceeds 2x max_size - Fix docstring: _lock is threading.Lock, not async lock - Add tests for heap compaction behavior Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * perf: minimize lock contention during heap compaction Move O(n) heapify work outside the lock to avoid blocking get/set: 1. Snapshot cache entries under lock (fast) 2. Build heap outside lock (O(n) but doesn't block) 3. Swap back under lock, preserving entries added during compaction Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: NAYANAR <nayana.r7813@gmail.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…ing (IBM#1972) * Migrate DCR and OAUTH from aiohttp to httpx Signed-off-by: Mohan Lakshmaiah <mohalaks@in.ibm.com> * Fixing Flake8 issues Signed-off-by: Mohan Lakshmaiah <mohalaks@in.ibm.com> * Fix httpx migration issues: redirects, timeouts, and test mocking - Add follow_redirects=True to all httpx.AsyncClient instances to match aiohttp's default behavior (dcr_service.py, oauth_manager.py, cli_export_import.py) - Add explicit 300s timeout to CLI export/import operations to prevent premature timeouts for slow operations - Update integration tests to mock httpx instead of aiohttp Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mohan Lakshmaiah <mohalaks@in.ibm.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mohan Lakshmaiah <mohalaks@in.ibm.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* fix for redis missing module Signed-off-by: Shriti Priya <shritip@ibm.com> * Apply redis availability check consistency fix with warning logging Apply the same try/except wrapper for importlib.util.find_spec in both event_service.py and version.py to handle ModuleNotFoundError gracefully when the redis module is not installed. Additionally, log a warning when the exception is caught to surface the issue rather than silently falling back. This helps operators identify when redis is present but broken vs simply not installed. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Shriti Priya <shritip@ibm.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* Add an error message to form field validation * fix: correct data attribute selectors and add null safety in form validation - Fix data attribute mismatch: JS selectors now match HTML attributes (data-error-message-for instead of data-error-for) - Fix typo in admin.html: data-error-messsage-for → data-error-message-for - Add null safety for inputLabel?.innerText in URL validation - Clear setCustomValidity on valid URL to prevent blocking form submission Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: address validation edge cases and security concerns - Skip URL validation for empty optional fields (e.g., OAuth URLs) - Re-add setCustomValidity for name fields to maintain client-side security - Fix prompt error message selector to use data-error-message-for="name" - Remove dead prompt-specific validation code (now handled by general validation) - Update modal reset to clear inline error messages and red border styling Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
…sync (IBM#1950) Add configurable feature to automatically sync gateway tools, resources, and prompts with the database during health checks. When enabled: - Fetches latest tools/resources/prompts from MCP servers - Compares against database state - Performs add/update/delete operations only when changes detected - Tracks updates per-type (tools/resources/prompts) for precise cache invalidation - Only deletes MCP-discovered items (health_check, oauth, federation, update, MCP) - Preserves user-created entries (api, ui, None/legacy) during cleanup - Allows empty server responses to clear stale items (except for auth_code OAuth) - Passes pre-authenticated headers from health check to avoid duplicate OAuth token fetch - Invalidates relevant caches per-gateway for efficiency - Uses fresh_db_session() to avoid holding DB connections during HTTP calls Configuration: - AUTO_REFRESH_SERVERS=false (disabled by default) - Reduces load on MCP servers when disabled Closes IBM#1910, IBM#172 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Clarified note on arm64 support for production and MacOS. Signed-off-by: Gabriel Costa <gabrielcg@proton.me>
* perf: eager load EmailTeam relationship in Server model to eliminate N+1 queries (IBM#1957) Add email_team relationship and team property to Server model, replacing the _get_team_name helper method. Use joinedload for list operations to fetch team names in a single query instead of per-server lookups. Signed-off-by: Jonathan Springer <jps@s390x.com> * fix(IBM#1962): resolve N+1 query fixes breaking tests - Fix convert_server_to_read to explicitly build server dict from attributes instead of using __dict__.copy(), which can return empty dict with SQLAlchemy's scalars().all() eager loading pattern - Add missing joinedload(DbServer.email_team) to get_server and update_server methods for consistent team name loading - Update test mocks to accept new options parameter for db.get() calls - Update mock_server fixture with explicit None values for optional tracking fields to prevent MagicMock auto-creation Closes IBM#1962 Signed-off-by: Jonathan Springer <jps@s390x.com> * perf(IBM#1962): add eager loading to toggle_server_status Add selectinload/joinedload options to db.get() call in toggle_server_status to prevent N+1 queries when accessing server relationships (tools, resources, prompts, a2a_agents, email_team). This completes the N+1 fix for issue IBM#1962 which listed toggle_server_status as an affected function with medium impact. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(IBM#1962): fix team property assignment and export N+1 query - admin.py: Remove manual team assignment loop that would fail since Server.team is now a read-only @Property. Add joinedload for email_team relationship instead. - export_service.py: Add selectinload for tools relationship in _export_selected_servers to prevent N+1 queries when accessing db_server.tools in a loop. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Jonathan Springer <jps@s390x.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* test: enhance Playwright UI testing Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: improve Playwright recordings Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: harden Playwright UI checks Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: expand Playwright UI coverage Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>
* docs: refresh documentation formatting and links Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: remove unused snakefood diagram Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: align api auth and readiness examples Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Docs update - diagram review Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>
* test(loadtest): expand Locust API coverage from 45% to 70% Add 11 new user classes to improve REST API load test coverage: Batch 1 - Core CRUD: - TeamsCRUDUser: Teams API operations - TokenCatalogCRUDUser: JWT token management - RBACCRUDUser: Role/permission CRUD - CancellationAPIUser: Request cancellation Batch 3 - Extended Operations: - RootsExtendedUser: Root CRUD operations - TagsExtendedUser: Tag-based entity discovery - LogSearchExtendedUser: Log search and trace - MetricsMaintenanceUser: Metrics cleanup/rollup - AuthExtendedUser: Auth login and user info - EntityToggleUser: Toggle operations for all entities - EntityUpdateUser: PUT/Update operations Also adds `make load-test-cli` target for headless testing with identical configuration to `make load-test-ui`. Note: LLM-related classes (LLMConfigCRUDUser, LLMChatUser, LLMProxyUser) and ProtocolExtendedUser were implemented but removed as they require external LLM provider configuration to function properly. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(loadtest): resolve test failures in RPC and Roots endpoints Fix three categories of failures: 1. /rpc tools/call DNS errors (560+ failures): - Add VIRTUAL_TOOL_PREFIXES to exclude test-api-tool-* and loadtest-tool-* - These virtual tools have no backing MCP server and fail on invocation 2. /roots/changes invalid JSON (157+ failures): - Remove this test - endpoint returns SSE stream, not JSON - Replace with simple /roots list endpoint 3. /roots/[uri] [delete] 500 errors (97+ failures): - Use catch_response to properly handle delete responses - Accept 200, 204, 404, 500 as valid responses (server-side issues) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>
* Increase playwright coverage Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(tests): improve playwright tool modal tests reliability - Use admin_page fixture consistently for authenticated access - Add explicit waits for modal visibility with :not(.hidden) selector - Skip tests properly when no tools are available instead of silent pass - Increase timeout to 10s for modal operations Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(tests): improve playwright test reliability and idempotency - Add _wait_for_codemirror() to wait for CodeMirror editor initialization before interacting with promptArgsEditor - Remove redundant navigation in test_admin_panel_loads since admin_page fixture already handles authentication and navigation - Add cleanup to all entity create tests (prompts, resources, servers, tools) to delete created entities after test completion - Fix _prepare_tools_table() to use state="attached" instead of requiring visibility, preventing timeouts on empty tables - Apply black/isort formatting fixes Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(tests): improve CodeMirror wait reliability in prompts test - Wait for CodeMirror library to load before checking editor instance - Increase timeout from 10s to 30s for slower CI environments - Add null check to editor wait condition Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>
Signed-off-by: mintzo20 <adirmintz@gmail.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>
* test: improve cache coverage Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: improve coverage for cli and runtime paths Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: fix toolops permission stubs Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: expand coverage for tool helpers and admin servers Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: extend coverage for low-coverage services Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: extend coverage for services Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: expand coverage for grpc oauth metrics Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: expand unit coverage for admin and services Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: expand observability and oauth coverage Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Fix flaky test Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * 80% threshold Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Docs update for testing Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: expand coverage for transports, plugins, wrapper Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Fix tests Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Fix tests Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Fix tests Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Test improvements Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Increase coverage Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: expand coverage for observability and services * test: expand bulk registration coverage Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Increase coverage Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Increase coverage Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>
* chore: unignore documentation files in .gitignore * chore: unignore FEATURES.md documentation files * docs: update oauth design and remove empty blog index * docs: cleanup placeholders, update statuses, and fix navigation * typo Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Documentation review & update Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>
…BM#2549) Replace long-lived database sessions in RBAC middleware with fresh_db_session() context manager to prevent session accumulation under high concurrent load. Changes: - Remove db parameter from get_current_user_with_permissions() - Use fresh_db_session() context manager for short-lived DB access - Keep "db": None in user context for backward compatibility - Add deprecation warnings to get_db() and get_permission_service() - Update all permission decorators to use fresh_db_session() fallback - Update PermissionChecker to use fresh_db_session() pattern - Simplify db.py by reusing get_db() generator for fresh_db_session Security fixes: - Use named kwargs (user, _user, current_user, current_user_ctx) for user context extraction instead of scanning all dicts for "email" to prevent request body injection attacks Performance fixes: - PermissionChecker.has_any_permission now uses single session for all permission checks instead of opening N sessions This prevents idle-in-transaction bottlenecks where sessions were held for entire request duration instead of milliseconds. Closes IBM#2340 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>
* feat: unified PDP plugin for issue IBM#2223 Adds a single plugin entry point that orchestrates access-control decisions across multiple policy engines (Native RBAC, MAC, OPA, Cedar). - plugins/unified_pdp/unified_pdp.py — Plugin class, hooks into tool_pre_invoke and resource_pre_fetch - plugins/unified_pdp/pdp.py — PolicyDecisionPoint orchestrator - plugins/unified_pdp/pdp_models.py — Pydantic models (Subject, Resource, Context, AccessDecision, config types) - plugins/unified_pdp/adapter.py — Abstract engine adapter base class - plugins/unified_pdp/cache.py — TTL-aware decision cache - plugins/unified_pdp/engines/ — Four engine adapters: native_engine, mac_engine, opa_engine, cedar_engine - plugins/unified_pdp/default_rules.json — Starter RBAC ruleset - tests/unit/plugins/test_unified_pdp.py — 46 unit tests - plugins/config.yaml — Plugin registration (mode: disabled) - MANIFEST.in — Added recursive-include plugins *.json Combination modes: all_must_allow | any_allow | first_match Native RBAC and MAC work out of the box. OPA and Cedar require their respective sidecars (see README). Closes IBM#2223 Signed-off-by: yiannis2804 <yiannis2804@gmail.com> * test: add plugin class unit tests, coverage 86% 13 tests covering UnifiedPDPPlugin hook methods (tool_pre_invoke, resource_pre_fetch), subject extraction (dict/string/None user), action string formatting, resource type mapping, and _build_pdp. unified_pdp.py now at 100% coverage. Remaining gaps are in OPA and Cedar engine adapters which require external sidecars to test. Signed-off-by: yiannis2804 <yiannis2804@gmail.com> * docs: add detailed README for unified PDP plugin Signed-off-by: yiannis2804 <yiannis2804@gmail.com> * fix(unified-pdp): fix bugs and improve tests - Fix undefined variable eng_type in pdp.py:get_effective_permissions() - Add shutdown() lifecycle method to UnifiedPDPPlugin to properly close HTTP clients for OPA/Cedar engines - Convert tests from respx to pytest-httpx (project standard) - Add test for shutdown() method Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * chore(unified-pdp): fix linting issues - Remove unused import List from mac_engine.py - Remove unused variable first_deny from pdp.py Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(unified-pdp): address review findings from additional security review - Cache key now includes user_agent and context.extra to prevent incorrect cached decisions when policies depend on these fields (MAC operation override, OPA/Cedar context-based rules) - Plugin now extracts IP and user_agent from HTTP headers and passes to PDP context for policy evaluation - Plugin passes tool args to context.extra and resource metadata to resource.annotations for fine-grained policy checks - Exception handling in _evaluate_parallel/_evaluate_sequential now catches all exceptions (not just TimeoutError/PolicyEvaluationError) to prevent crashing the whole request on unexpected errors - Native RBAC docstring corrected: only JSON files are supported (not YAML) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(unified-pdp): extract classification_level for MAC engine Extract classification_level from tool args and resource metadata so MAC engine can make proper Bell-LaPadula decisions instead of always denying due to missing classification. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: add docstrings for 100% interrogate coverage Add missing docstrings to all public functions and methods in the unified_pdp plugin to satisfy the project's 100% docstring coverage requirement enforced by interrogate. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: add comprehensive Google-style docstrings to unified_pdp Add complete Args, Returns, Raises, and Attributes documentation to all public functions and methods in the unified_pdp plugin, matching the project's docstring style with full parameter descriptions. Files updated: - adapter.py: PolicyEvaluationError, PolicyEngineAdapter methods - cache.py: _build_cache_key, _CacheEntry, DecisionCache methods - pdp.py: PolicyDecisionPoint and all evaluation/combination methods - engines/cedar_engine.py: CedarEngineAdapter and all methods - engines/mac_engine.py: MACEngineAdapter and all methods - engines/native_engine.py: NativeRBACAdapter and all methods - engines/opa_engine.py: OPAEngineAdapter and all methods - unified_pdp.py: shutdown lifecycle method Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: add __init__ docstring to PolicyEvaluationError Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: yiannis2804 <yiannis2804@gmail.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com> Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>
* feat(api): standardize gateway response format - Set *_unmasked fields to null in GatewayRead.masked() - Apply masking consistently across all gateway return paths - Mask credentials on cache reads - Update admin UI to indicate stored secrets are write-only - Update tests to verify masking behavior Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * delete artifact sbom Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * feat(gateway): add configurable URL validation for gateway endpoints Add comprehensive URL validation with configurable network access controls for gateway and tool URL endpoints. This allows operators to control which network ranges are accessible based on their deployment environment. New configuration options: - SSRF_PROTECTION_ENABLED: Master switch for URL validation (default: true) - SSRF_ALLOW_LOCALHOST: Allow localhost/loopback (default: true for dev) - SSRF_ALLOW_PRIVATE_NETWORKS: Allow RFC 1918 ranges (default: true) - SSRF_DNS_FAIL_CLOSED: Reject unresolvable hostnames (default: false) - SSRF_BLOCKED_NETWORKS: CIDR ranges to always block - SSRF_BLOCKED_HOSTS: Hostnames to always block Features: - Validates all resolved IP addresses (A and AAAA records) - Normalizes hostnames (case-insensitive, trailing dot handling) - Blocks cloud metadata endpoints by default (169.254.169.254, etc.) - Dev-friendly defaults with strict mode available for production - Full documentation and Helm chart support Also includes minor admin UI formatting improvements. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * feat(auth): add token-scoped filtering for list endpoints and gateway forwarding - Add token_teams parameter to list_servers and list_gateways endpoints for proper scoping based on JWT token team claims - Update server_service.list_servers() and gateway_service.list_gateways() to filter results by token scope (public-only, team-scoped, or unrestricted) - Skip caching for token-scoped queries to prevent cross-user data leakage - Update gateway forwarding (_forward_request_to_all) to respect token team scope - Fix public-only token handling in create endpoints (tools, resources, prompts, servers, gateways, A2A agents) to reject team/private visibility - Preserve None vs [] distinction in SSE/WebSocket for proper admin bypass - Update get_team_from_token to distinguish missing teams (legacy fallback) from explicit empty teams (public-only access) - Add request.state.token_teams storage in all auth paths for downstream access Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * feat(auth): add normalize_token_teams for consistent token scoping Introduces a centralized `normalize_token_teams()` function in auth.py that provides consistent token team normalization across all code paths: - Missing teams key → empty list (public-only access) - Explicit null teams + admin flag → None (admin bypass) - Explicit null teams without admin → empty list (public-only) - Empty teams array → empty list (public-only) - Team list → normalized string IDs (team-scoped) Additional changes: - Update _get_token_teams_from_request() to use normalized teams - Fix caching in server/gateway services to only cache public-only queries - Fix server creation visibility parameter precedence - Update token_scoping middleware to use normalize_token_teams() - Add comprehensive unit tests for token normalization behavior Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * feat(websocket): forward auth credentials to /rpc endpoint The WebSocket /ws endpoint now propagates authentication credentials when making internal requests to /rpc: - Forward JWT token as Authorization header when present - Forward proxy user header when trust_proxy_auth is enabled - Enables WebSocket transport to work with AUTH_REQUIRED=true Also adds unit tests to verify auth credential forwarding behavior. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * feat(rbac): add granular permission checks to all admin routes - Add @require_permission decorators to all 177 admin routes with allow_admin_bypass=False to enforce explicit permission checks - Add allow_admin_bypass parameter to require_permission and require_any_permission decorators for configurable admin bypass - Add has_admin_permission() method to PermissionService for checking admin-level access (is_admin, *, or admin.* permissions) - Update AdminAuthMiddleware to use has_admin_permission() for coarse-grained admin UI access control - Create shared test fixtures in tests/unit/mcpgateway/conftest.py for mocking PermissionService across unit tests - Update test files to use proper user context dict format Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs(rbac): comprehensive update to authentication and RBAC documentation Update documentation to accurately reflect the two-layer security model (Token Scoping + RBAC) and correct token scoping behavior. rbac.md: - Rewrite overview with two-layer security model explanation - Fix token scoping matrix (missing teams key = PUBLIC-ONLY, not UNRESTRICTED) - Add admin bypass requirements warning (requires BOTH teams:null AND is_admin:true) - Add public-only token limitations (cannot access private resources even if owned) - Add Permission System section with categories and fallback permissions - Add Configuration Safety section (AUTH_REQUIRED, TRUST_PROXY_AUTH warnings) - Update enforcement points matrix with Token Scoping and RBAC columns multitenancy.md: - Add Token Scoping Model section with secure-first defaults - Add Two-Layer Security Model section with request flow diagram - Add Enforcement Points Matrix - Add Token Scoping Invariants - Document multi-team token behavior (first team used for request.state.team_id) oauth-design.md & oauth-authorization-code-ui-design.md: - Add scope clarification notes (gateway OAuth delegation vs user auth) - Add Token Verification section - Add cross-references to RBAC and multitenancy docs AGENTS.md: - Add Authentication & RBAC Overview section with quick reference llms/mcpgateway.md & llms/api.md: - Add token scoping quick reference and examples - Add links to full documentation Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(rbac): add explicit db dependency to RBAC-protected routes Address load test findings from RCA #1 and IBM#2: - Add `db: Session = Depends(get_db)` to routes in email_auth.py, llm_config_router.py, and teams.py that use @require_permission - Fix test files to pass mock_db parameter after signature changes - Add shm_size: 256m to PostgreSQL in docker-compose.yml - Remove non-serializable content from resource update events - Disable CircuitBreaker plugin for consistent load testing These changes fix the NoneType errors (~33,700) observed under 4000 concurrent users where current_user_ctx["db"] was always None. Remaining critical issue: Transaction leak in streamablehttp_transport.py causing idle-in-transaction connections (see todo/rca2.md for details). Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(db): resolve transaction leak and connection pool exhaustion Critical fixes for load test failures at 4000 concurrent users: Issue #1 - Transaction leak in streamablehttp_transport.py (CRITICAL): - Add explicit asyncio.CancelledError handling in get_db() context manager - When MCP handlers are cancelled (client disconnect, timeout), the finally block may not execute properly, leaving transactions "idle in transaction" - Now explicitly rollback and close before re-raising CancelledError - Add rollback in direct SessionLocal usage at line ~1425 Issue IBM#2 - Missing db parameter in admin routes (HIGH): - Add `db: Session = Depends(get_db)` to 73 remaining admin routes - Routes with @require_permission but no db param caused decorator to create fresh session via fresh_db_session() for EVERY permission check - This doubled connection usage for affected routes under load Issue IBM#3 - Slow recovery from transaction leaks (MEDIUM): - Reduce IDLE_TRANSACTION_TIMEOUT from 300s to 30s in docker-compose.yml - Reduce CLIENT_IDLE_TIMEOUT from 300s to 60s - Leaked transactions now killed faster, preventing pool exhaustion Root cause confirmed: list_resources() MCP handler was primary source, with 155+ connections stuck on `SELECT resources.*` for up to 273 seconds. See todo/rca2.md for full analysis including live test data showing connection leak progression and 606+ idle transaction timeout errors. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(teams): use consistent user context format across all endpoints - Update request_to_join_team and leave_team to use dict-based user context - Fix teams router to use get_current_user_with_permissions consistently - Move /discover route before /{team_id} to prevent route shadowing - Update test fixtures to use mock_user_context dict format - Add transaction commits in resource_service to prevent connection leaks - Add missing docstring parameters for flake8 compliance Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(db): add explicit db.commit/close to prevent transaction leaks Add explicit db.commit(); db.close() calls to 100+ endpoints across all routers to prevent PostgreSQL connection leaks under high load. Problem: Under high concurrency, FastAPI's Depends(get_db) cleanup runs after response serialization, causing transactions to remain in 'idle in transaction' state for 20-30+ seconds, exhausting the connection pool. Solution: Explicitly commit and close database sessions immediately after database operations complete, before response serialization. Routers fixed: - tokens.py: 10 endpoints (create, list, get, update, revoke, usage, admin, team tokens) - llm_config_router.py: 14 endpoints (provider/model CRUD, health, gateway models) - sso.py: 5 endpoints (SSO provider CRUD) - email_auth.py: 3 endpoints (user create/update/delete) - oauth_router.py: 1 endpoint (delete_registered_client) - teams.py: 18 endpoints (team CRUD, members, invitations, join requests) - rbac.py: 12 endpoints (roles, user roles, permissions) - main.py: 14 CUD + 3 list + 7 RPC handlers Also fixes: - admin.py: Rename 21 unused db params to _db (pylint W0613) - test_teams*.py: Add mock_db fixture to tests calling router functions directly - Add llms/audit-db-transaction-management.md for future audits Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * ci(coverage): lower doctest coverage threshold to 30% Reduce the required doctest coverage from 34% to 30% to accommodate current coverage levels (32.17%). Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(rpc): fix list_gateways tuple unpacking and add token scoping The RPC list_gateways handler had two bugs: 1. Did not unpack the tuple (gateways, next_cursor) returned by gateway_service.list_gateways(), causing 'list' object has no attribute 'model_dump' error 2. Was missing token scoping via _get_rpc_filter_context(), which was the original R-02 security fix Also fixed all callers of list_gateways that expected a list but now receive a tuple: - mcpgateway/admin.py: get_gateways_section() - mcpgateway/services/import_service.py: 3 call sites Updated test mocks to return (list, None) tuples instead of lists. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(teams): build response before db.close() to avoid lazy-load errors The teams router was calling db.commit(); db.close() before building the TeamResponse, but TeamResponse includes team.get_member_count() which needs an active session. When the session is closed, the fallback in get_member_count() tries to access self.members (lazy-loaded), causing "Parent instance is not bound to a Session" errors. Fixed by building TeamResponse BEFORE calling db.close() in: - create_team - get_team - update_team Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(teams): fix update_team expecting team object but getting bool The service's update_team() returns bool, but the router was treating the return value as a team object and trying to access .id, .name, etc. Fixed by: 1. Checking the boolean return value for success 2. Fetching the team again after successful update to build the response Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(teams): fix update_member_role return type mismatch The service's update_member_role() returns bool, but the router treated it as a member object. Fixed by: 1. Checking the boolean success 2. Added get_member() method to TeamManagementService 3. Fetching the updated member to build the response Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Fix teams return Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>
Removed unreleased security changes regarding gateway credentials from CHANGELOG. Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>
Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>
Copilot stopped work on behalf of
hughhennelly due to an error
February 12, 2026 16:00
hughhennelly
added a commit
that referenced
this pull request
Feb 12, 2026
1. Fix broken imports (Issue #1): - Change from ..database to ..db - Fix unified_pdp imports to use plugins.unified_pdp - Update in routes, services, schemas, and tests 2. Register sandbox router in main.py (Issue IBM#2): - Add import and app.include_router call 3. Fix XSS vulnerability (Issue IBM#3): - Replace f-string HTML with Jinja2 template - Create sandbox_simulate_results.html template - Add Request parameter for template access 4. Add authentication (Issue IBM#4): - Add Depends(get_current_user) to simulate endpoint 5. Remove scratch files (Issue IBM#5): - Delete sandbox_header.txt and sandbox_new_header.txt 6. Resolve schemas conflict (Issue IBM#6): - Merge schemas/sandbox.py into schemas.py - Remove conflicting schemas/ directory - Update imports in routes and services All changes tested and ready for review. Related to IBM#2226 Signed-off-by: hughhennelly <hughhennelly06@gmail.com>
jonpspri
pushed a commit
that referenced
this pull request
May 8, 2026
…or handling (IBM#3159) * fix(api): change apijsonpath from Body to Query parameter for GET endpoints - Move apijsonpath parameter from request body to query string for list_tools and get_tool endpoints - Add _parse_apijsonpath() helper to handle both JSON string and JsonPathModifier inputs - Return ORJSONResponse to bypass FastAPI response_model validation when using JSONPath modifiers - Add comprehensive error handling with proper HTTPException status codes - Add test coverage for Query parameter handling and error cases This change resolves issue IBM#2848 by making GET requests RESTful-compliant (no request body). Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com> * fix(security): improve error handling and reduce information disclosure in apijsonpath parsing Address code review feedback with three security and performance improvements: 1. Specific Exception Handling (Issue #1) - Catch specific exceptions (json.JSONDecodeError, ValueError, ValidationError) - Separate user errors (400) from system errors (500) - Log unexpected errors with full context for debugging 2. Sanitize Exception Messages (Issue IBM#2) - Return generic error messages in production (non-DEBUG mode) - Prevent information disclosure about internal types and stack traces - Detailed messages only shown when LOG_LEVEL=DEBUG for troubleshooting 3. Optimize Debug Logging (Issue IBM#3) - Check if debug logging is enabled before building log message - Avoid string formatting overhead in production - Improves performance in high-traffic scenarios Updated test assertions to verify generic error messages (security improvement). All tests pass (18 jsonpath tests verified). Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com> * doc Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com> * except Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com> * test coverage Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com> * pagination Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com> * review Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com> * head remove Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com> * flake Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com> * fix: validate apijsonpath before DB query and add missing test coverage - Move _parse_apijsonpath() call before the database query in list_tools to fail fast on invalid input without wasting a DB round-trip - Add tests for the ValidationError path (extra fields rejected by extra="forbid" on JsonPathModifier) - Run black/isort on test file Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: correct nextCursor wire key, sanitize JSONPath in debug log, add e2e tests - Fix pagination+JSONPath response to use "nextCursor" (camelCase) matching the CursorPaginatedToolsResponse alias contract. The hand-built dict bypassed Pydantic aliasing via ORJSONResponse, emitting "next_cursor" which broke any client relying on the documented field name. Confirmed regression against live server at localhost:8080. - Sanitize user-supplied JSONPath expression in debug log via SecurityValidator.sanitize_log_message() to prevent CWE-117 log forging (newlines, ANSI escapes in crafted jsonpath values). - Add end-to-end HTTP tests (TestApijsonpathHTTP) exercising the apijsonpath query parameter through FastAPI TestClient, covering the full HTTP binding path that __wrapped__() unit tests skip. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Rakhi Dutta <rakhibiswas@yahoo.com>
jonpspri
pushed a commit
that referenced
this pull request
May 8, 2026
* fix(api): MCP Tool Validation Fix Signed-off-by: Gabriel Costa <gabrielcg@proton.me> * test: improve differential test coverage for validation_strict gating - Parameterize non-strict test to verify ALL 7 forbidden patterns pass through (was only testing "> ") - Add test for break behavior: single warning logged even with multiple matching patterns in one description - Move inline `import logging` to top-level imports - Covers issue IBM#3711 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * linting Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(docs): replace misleading <placeholder> example in validation docs The forbidden-pattern list matches "< " (with trailing space), not bare "<". Strings like "<placeholder>" never hit the forbidden-pattern check; they are stripped by sanitize_display_text() as HTML-like content regardless of VALIDATION_STRICT. Replace with "< input" and "cmd | grep" which actually match the gated patterns. Addresses Codex review finding #1. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: add regression tests for empty exception str() fallback in gateway init Cover the `raw_error = str(root_cause) or type(root_cause).__name__` guard in _initialize_gateway: when an exception's str() is empty, the GatewayConnectionError message now includes the type name instead of a blank suffix. Tests both plain exceptions and ExceptionGroup unwrapping. Addresses Codex review finding IBM#2. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Gabriel Costa <gabrielcg@proton.me> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
jonpspri
pushed a commit
that referenced
this pull request
May 8, 2026
…dation (IBM#3750) * test(rate-limiter): harden rate limiter plugin — gaps #1-IBM#8 Closes IBM#3740 ## What changed ### Plugin fixes (plugins/rate_limiter/rate_limiter.py) - Config validation at __init__: _validate_config() parses all rate strings at startup — bad config raises immediately, not mid-request - Graceful degradation: both hooks wrapped in try/except; unexpected errors are logged and the request is allowed through (permissive) - prompt_pre_fetch now enforces by_tool limits using prompt_id as key - MemoryBackend: asyncio.Lock makes counter increments atomic - MemoryBackend: background TTL sweep evicts expired windows (0.5s interval) - RedisBackend: atomic INCR+EXPIRE via Lua script; shared state across all gateway instances; native TTL expiry; falls back to memory on error ### Test additions (tests/unit/.../test_rate_limiter.py) - Gap tests: 4 xfail -> pass (shared state, eviction, prompt by_tool, graceful degradation); 1 xfail remains (fixed window burst, deferred) - Edge case tests: malformed/unsupported config raises at init (not request time); runtime errors degrade gracefully via mock injection - Redis backend test uses injected FakeRedis — no live server required ### Config changes - plugins/config.yaml: RateLimiterPlugin enabled with enforce mode - tests/performance/plugins/config.yaml: RateLimiterPlugin set to permissive for inclusion in cProfile benchmark runs Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com> * chore(config): enable Redis backend for RateLimiterPlugin in plugins/config.yaml Switch the default stack config from in-memory to Redis-backed rate limiting. This ensures the 30/m per-user limit is enforced as a true shared limit across all gateway instances rather than 30/m per process. Validated via Redis MONITOR: all 3 gateway instances atomically increment the same rl:user:<id>:60 counter via the Lua INCR+EXPIRE script. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com> * test(rate-limiter): add benchmark-rate-limiter load test for multi-instance correctness Adds locustfile_rate_limiter.py and a make benchmark-rate-limiter target to demonstrate the multi-instance rate limit enforcement gap and its fix. The test sends 1 req/s (60 req/min = 2x the 30/m limit) through 3 gateway instances. With a memory backend each instance only sees ~20 req/min and never fires the limiter (~0% failures). With the Redis backend the shared counter reaches 30/min and blocks ~50% of requests — clearly showing the fix works across instances. Expected results: Memory backend: ~0% blocked (each instance sees 20 req/min < 30/m limit) Redis backend: ~50% blocked (shared counter: 60 req/min > 30/m limit) Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com> * test(rate-limiter): add hardening tests, bypass resistance, PII fix, and updated docs - Add 22 new unit tests (70 passed total, 4 xfailed): - Permissive vs enforce mode through PluginExecutor - Redis fallback: memory takeover when Redis is down, limit still enforced, no-fallback graceful degradation - Cross-tenant isolation: independent counters, no counter bleed between tenants - Header accuracy: Retry-After bounds, X-RateLimit-Reset future/consistency, Remaining decrement - Bypass resistance: None/whitespace user identity, tool name case sensitivity and whitespace (documented as xfail gaps) - PII: violation description must not contain user or tenant identifiers - Fix PII leak in violation description: remove user/tenant from description string in both prompt_pre_fetch and tool_pre_invoke — identifiers appeared in log output via permissive-mode manager warning and enforce-mode PluginViolationError message - Rewrite plugins/rate_limiter/README.md: was describing the old pre-fix implementation (in-memory only, no Redis, Redis as TODO). Now documents both backends, full config reference, response headers, examples, and accurate limitations table - Update plugin-manifest.yaml description to reflect Redis backend support Closes IBM#3740 Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com> * fix(rate-limiter): review fixes — dead code, test correctness, config validation - Remove unused _allow() module-level function (dead code — plugin uses self._rate_backend.allow() directly) - Fix test_graceful_degradation test: was patching _allow() which is never called by the plugin; now patches backend.allow() via patch.object so the try/except error path is actually exercised - Add prompt_pre_fetch graceful degradation test (was only tested for tool_pre_invoke) - Fix inconsistent by_tool lookup in tool_pre_invoke: remove unnecessary hasattr(__contains__) guard, align with prompt_pre_fetch pattern - Add backend validation to _validate_config(): typo like 'reddis' now raises ValueError at startup instead of silently falling back to memory - Add test for malformed by_tool rate string validation - Add test for invalid backend name validation - Change default config mode from enforce to permissive for safety (consistent with all other security plugins in the default config) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
jonpspri
pushed a commit
that referenced
this pull request
May 8, 2026
…fixes IBM#538 US-1) (IBM#3251) * feat(security): enforce content size limits for resources and prompts Implement configurable content size validation to prevent DoS attacks via oversized content submissions. Resources are limited to 100KB and prompt templates to 10KB by default (configurable via environment variables CONTENT_MAX_RESOURCE_SIZE and CONTENT_MAX_PROMPT_SIZE). Validation occurs at the service layer before database operations, returning 413 Payload Too Large with structured error details. Closes IBM#538 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(security): review fixes for content size limits PR - Fix ResourceCreate.validate_content to remove MAX_CONTENT_LENGTH schema-level check (now handled by service layer, consistent with ResourceUpdate) - Fix broken test assertions checking raw bytes in formatted messages (TestErrorMessageClarity was asserting "200000" in human-readable "195.3 KB" strings) - Fix broken test_update_prompt_validation_and_integrity_errors which had its body accidentally deleted by the original PR - Restore accidentally deleted assertions in test_list_resources and test_create_resource_endpoint - Add global exception handler for ContentSizeError (defense-in-depth, consistent with existing handlers for ValidationError/IntegrityError) - Add reset_content_security_service() for test isolation - Remove "Made with Bob" watermark from test file - Remove duplicate pytest import in integration tests - Fix broken README link to non-existent migration guide - Fix double-commented lines in .env.example Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: resolve flake8 lint issues in content size limits PR - Move noqa:DUO138 to the line flake8 reports (validators.py) - Add Args/Returns to content_size_exception_handler docstring (DAR101/DAR201) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(tests): update security tests for service-layer size validation Content size validation moved from Pydantic schema to service layer (returns 413 instead of 422). Update test_resource_create_content_validation and test_zip_bomb_prevention to reflect that ResourceCreate no longer rejects oversized content at schema level. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(security): address codex review findings for content size limits 1. Remove MAX_TEMPLATE_LENGTH check from SecurityValidator.validate_template so the configurable CONTENT_MAX_PROMPT_SIZE is not silently capped at 64KB by the schema-level validator (finding #1) 2. Add db.rollback() to ContentSizeError handlers in resource_service to prevent partial metadata mutations from being committed when content size validation fails after ORM fields are already updated (finding IBM#2) 3. Include actual_size and max_size in admin route 413 responses to match the structured error contract used by API routes (finding IBM#3) 4. Update security and validator tests that expected schema-level length rejection — size enforcement now happens at service layer Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * chore: fix trailing whitespace in test files Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: add content size limits to configuration docs, securing guide, and Helm values - docs/docs/manage/configuration.md: add Content Security section with env var table, scope note, and error response example - docs/docs/manage/securing.md: add section 9 (Content Size Limits) to production security checklist, renumber subsequent sections - charts/mcp-stack/values.yaml: add CONTENT_MAX_RESOURCE_SIZE and CONTENT_MAX_PROMPT_SIZE to gateway env config Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Suresh Kumar Moharajan <suresh.kumar.m@ibm.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
jonpspri
pushed a commit
that referenced
this pull request
May 8, 2026
…ons (IBM#4050) * feat: implement separate observability session pattern (best-effort) Observability write operations now use independent database sessions that commit immediately on a best-effort basis, separate from main request transactions. This ensures observability data persists even when the main request fails, providing visibility into partial failures. Implementation details: - Observability write methods (start_trace, end_trace, start_span, end_span, add_event, record_metric, record_token_usage, delete_old_traces) now create their own independent sessions using SessionLocal() instead of accepting a db parameter from the caller - Context managers (trace_span, trace_tool_invocation, trace_a2a_request) create and manage a single session for the full span lifecycle - Query methods still accept db: Session parameter for RBAC/token scoping - ObservabilityMiddleware no longer creates or manages request.state.db - Each observability operation creates its own short-lived session that commits immediately in a try/except/finally block - Updated all callers across resource_service, prompt_service, tool_service, and instrumentation/sqlalchemy to use new API (removed db parameters) - Added comprehensive test coverage verifying separate session behavior - Fixed pre-existing bugs in add_event() and record_metric() where db.refresh() was called after commit failure Benefits: - Observability data persists independently of main request transaction success - Provides visibility into partial failures and error states - Follows pattern established in instrumentation/sqlalchemy.py (lines 58-87) - Reduces transaction contention between observability and business logic Trade-offs: - NOT atomic with main request transaction (intentional design choice) - Traces may show "in progress" or partial states for failed requests - Provides eventual consistency rather than strict consistency Documentation: - Added "Observability Transaction Behavior" section to AGENTS.md explaining the separate session pattern, implementation details, and security notes - Updated module docstrings with session management details Tests: 96 observability tests passing, full suite 16,673 tests passing Pylint: 10.00/10 rating maintained Closes IBM#3883 Signed-off-by: Mohan Lakshmaiah <mohan.economist@gmail.com> * fix: add type annotations to observability context managers Add proper return type annotations to context manager methods: - trace_span() -> Generator[str, None, None] - trace_tool_invocation() -> Generator[Tuple[Optional[str], Dict[str, Any]], None, None] - trace_a2a_request() -> Generator[Tuple[Optional[str], Dict[str, Any]], None, None] These annotations improve type safety and make the return values explicit. The trace_span context manager yields just the span_id string, while trace_tool_invocation and trace_a2a_request yield a tuple of (span_id, result_dict) for capturing operation results. Tests: 96 observability tests passing Pylint: 10.00/10 rating maintained Signed-off-by: Mohan Lakshmaiah <mohan.economist@gmail.com> * test: Add coverage tests for observability session close exception handling Add 11 new test functions to achieve 100% coverage of exception handling paths in observability_service.py. These tests verify that all write methods and context managers handle session close failures gracefully: - test_start_trace_session_close_failure - test_end_trace_session_close_failure - test_start_span_session_close_failure - test_start_span_with_commit_true - test_end_span_session_close_failure - test_add_event_session_close_failure - test_record_metric_session_close_failure - test_delete_old_traces_session_close_failure - test_trace_span_context_manager_session_close_failure - test_trace_tool_invocation_context_manager_session_close_failure - test_trace_a2a_request_context_manager_session_close_failure Each test mocks the session close() method to raise an exception and verifies that the operation completes successfully despite the close failure. This covers the previously uncovered lines in finally blocks where session close exceptions are caught and logged. Resolves CI/CD coverage requirement for observability_service.py. Signed-off-by: Mohan Lakshmaiah <mohan.economist@gmail.com> * fix: complete separate session migration for delete_old_traces and record_transport_activity - Fix runtime TypeError in cleanup_old_traces router: delete_old_traces() no longer accepts db parameter (missed call site from IBM#3883) - Remove unused db parameter from record_transport_activity() for API consistency with other write methods - Update docstring and doctest for delete_old_traces call in router - Update test calls to match new record_transport_activity signature - Add 3 tests for 100% differential coverage: error-path exception handlers in trace_tool_invocation, trace_a2a_request, and session close failure in record_token_usage Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: address review feedback for separate observability session pattern This commit addresses valid concerns raised in PR review: 1. Fix class docstring showing old API (madhu-mohan-jaishankar) - Updated ObservabilityService class docstring to show correct API - Removed db parameter from examples (old: start_trace(db, name)) - Added clarification about write vs query methods - Examples now demonstrate correct usage with proper parameters 2. Fix context manager atomicity (msureshkumar88 issue IBM#5) - Added optional obs_db parameter to add_event() method - Context managers now pass their session to add_event() in error paths - Ensures atomic commit of span updates and error events together - Prevents orphaned events when span update fails - Updated all three context managers: * trace_span() at line 644 * trace_tool_invocation() at line 757 * trace_a2a_request() at line 1112 3. Document connection pool sizing (msureshkumar88 issue IBM#7) - Added Connection Pool Sizing section to AGENTS.md - Explains 4-6 sessions per traced request pattern - Documents default pool (200+10 = 210 connections) - Supports ~35 concurrent traced requests by default - Guidance for high-traffic deployments (>50 req/sec) - Note about SQLite 50-connection cap Implementation details: - add_event() now accepts optional obs_db: Optional[Session] = None - When obs_db is provided, uses caller's session (owned = False) - When obs_db is None, creates independent session (owned = True) - Only closes session if owned is True - All existing tests pass (87 observability service tests, 23 middleware tests) Tests: All 87 observability service tests passing All 23 observability middleware tests passing 100% differential coverage maintained Addresses review feedback from: - @madhu-mohan-jaishankar (approved with comments) - @msureshkumar88 (issues IBM#5 and IBM#7 from detailed review) Signed-off-by: Mohan Lakshmaiah <mohan.economist@gmail.com> * fix: address remaining review concerns with safety improvements This commit addresses all remaining valid reviewer concerns: 1. Document dead code guard design choice (madhu-mohan-jaishankar issue #1) - Added Note section to _get_or_create_observability_session() docstring - Explains tuple return pattern enables future session reuse optimization - Guards remain for forward compatibility (maintainer approved pattern) 2. Move session creation inside try blocks (msureshkumar88 issue #1) - Initialize obs_db = None and owned = False before try blocks - Create session inside try to prevent theoretical resource leak - Update finally blocks to check `if owned and obs_db is not None:` - Applied to all 8 write methods: * start_trace() (line 317) * end_trace() (line 385) * trace_span() context manager (line 619) * trace_tool_invocation() context manager (line 719) * add_event() (line 836) * record_token_usage() (line 926) * trace_a2a_request() context manager (line 1087) * record_metric() (line 1306) * delete_old_traces() (line 1718) 3. Add header sanitization (msureshkumar88 issue IBM#10) - Added sanitize_header_for_storage() helper function - Removes control characters (prevents log injection) - Truncates to max_length (prevents DoS via large headers) - Applied to user_agent (max 500 chars) and http_url (max 2000 chars) - Defense-in-depth security improvement Implementation details: - sanitize_header_for_storage() removes non-printable chars except space/tab - Returns "unknown" for None/empty values - All 87 observability service tests passing - All 23 observability middleware tests passing Deferred to future PR (with justification): - Code duplication refactoring (issue IBM#9): Too complex and risky for this PR. The three context managers share patterns but have significant domain-specific differences (attributes, yield signatures, sanitization). Refactoring would require extensive testing and carries risk of subtle bugs. Current code is correct and well-tested. Recommend separate focused PR for refactoring. Tests: All 110 observability tests passing (87 service + 23 middleware) 100% differential coverage maintained Addresses remaining review feedback from: - @madhu-mohan-jaishankar (issue #1 - dead code guards) - @msureshkumar88 (issues #1, IBM#10 - resource leak, header sanitization) Signed-off-by: Mohan Lakshmaiah <mohan.economist@gmail.com> * fix: correct doctest examples in sanitize_header_for_storage The function removes control characters completely (doesn't replace with spaces) and truncates without adding ellipsis. Updated doctest examples to match actual behavior: - Control chars removed: 'Evil\x00\nInjection' -> 'EvilInjection' - Truncation verified by length check instead of expecting '...' - Added test for None input returning 'unknown' All doctests now passing. Signed-off-by: Mohan Lakshmaiah <mohan.economist@gmail.com> * Update secrets following rebase Signed-off-by: Brian Hussey <brian.hussey@ie.ibm.com> --------- Signed-off-by: Mohan Lakshmaiah <mohan.economist@gmail.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Signed-off-by: Brian Hussey <brian.hussey@ie.ibm.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Brian Hussey <brian.hussey@ie.ibm.com>
jonpspri
added a commit
that referenced
this pull request
May 8, 2026
…IBM#4296) * feat(runtime): runtime-mutable MCP and A2A mode via admin API Add PATCH /admin/runtime/mcp-mode and PATCH /admin/runtime/a2a-mode that flip the public /mcp ingress and registered-A2A invocation path between shadow and edge at runtime without a restart. Boot env vars (RUST_MCP_MODE, RUST_A2A_MODE) still select the initial mode; overrides are in-memory only. When Redis is configured the override propagates cluster-wide via pub/sub on contextforge:runtime:mode with a per-runtime monotonic INCR counter and a 24h hint key per runtime so freshly-started pods reconcile to the cluster's current desired override on boot. Off and full boot modes are intentionally not flippable (409): off has no Rust sidecar to swap to; full keeps Rust as the owner of MCP session/ event-store/resume cores and flipping to shadow would orphan that state. The reverse-proxy caveat (admin flips don't propagate past an upstream nginx until the proxy is reconfigured) is documented and tracked separately in IBM#4278. Closes IBM#4273 Signed-off-by: Jonathan Springer <jps@s390x.com> * fix(runtime): preserve session-auth-reuse safety invariant under override Codex stop-time review caught that `override="edge"` on a shadow-boot deployment silently bypassed the documented safety invariant for routing public traffic to Rust: `_should_mount_public_rust_transport` had been loosened to require only `experimental_rust_mcp_runtime_enabled`, skipping `experimental_rust_mcp_session_auth_reuse_enabled` (and the analogous `delegate_enabled` for A2A). A boot=shadow deployment with an admin-issued edge override would have started routing public MCP to a Rust sidecar that never opted into session-auth-reuse. Fix: - `_should_mount_public_rust_transport` keeps the original invariant unchanged. Override="shadow" still forces Python; override="edge" now matches the default (honors the invariant). Same shape for `_should_delegate_a2a_to_rust`. - Narrow `_FLIPPABLE_BOOT_MODES` to `{edge}`. Only boot=edge deployments have both flags true and can safely flip. boot=shadow PATCHes now return 409 with an operator-facing explanation of the invariant. - Two belt-and-braces unit tests exercise the read side directly (even if a stale hint somehow landed override=edge in state, the transport function refuses to route to Rust without the safety flags). - Two router tests cover the new 409 for shadow boot (MCP + A2A). - Docs updated: architecture doc, ADR 043 addendum, crate README/STATUS, .env.example, docker-compose.yml. Signed-off-by: Jonathan Springer <jps@s390x.com> * fix(runtime): allow shadow override clearance and discard incompatible hints Codex stop-time review caught a new dead-end introduced by the previous safety-invariant fix: after restricting _FLIPPABLE_BOOT_MODES to {edge}, a shadow-boot deployment that inherited a stale override=edge via a Redis hint (left by a prior edge-boot deploy) had no API path to clear state — every PATCH returned 409. The operator's only recourse was flushing Redis by hand. Fix is two-sided: Router: - Gate on target mode, not boot mode. mode=shadow accepted from any boot that has a dispatcher mounted (shadow or edge). mode=edge still requires boot=edge (safety invariant). off and full still 409 for any PATCH (no dispatcher). - Replaced _FLIPPABLE_BOOT_MODES with _DISPATCHER_BOOT_MODES and _EDGE_OVERRIDE_BOOT_MODES to make the two gates explicit. Coordinator: - _reconcile_from_hint now checks whether the hint's target mode can safely take effect on this deployment. Incompatible hints (e.g. a former edge-boot pod's hint replayed on a shadow-boot deploy) are discarded and boot_reconcile_status is set to the new INCOMPATIBLE_HINT value, surfaced via /health. Without this guard, shadow-boot pods would boot with override=edge in state while the transport layer refused to honor it — misleading diagnostics that required the escape-hatch PATCH to clear. - Added _deployment_allows_mode helper that mirrors the safety-invariant checks in version.py (lazy-imported to avoid the version→settings cycle). Tests: - test_patch_mcp_mode_shadow_clears_stale_override_on_shadow_boot exercises the escape hatch end-to-end. - test_reconcile_from_hint_discards_incompatible_mode verifies the coordinator refuses to apply a shadow-boot-incompatible hint. - test_reconcile_from_hint_accepts_shadow_mode_on_shadow_boot covers the positive case (shadow is always compatible). Docs updated to explain the per-target-mode gating and the INCOMPATIBLE_HINT status. Signed-off-by: Jonathan Springer <jps@s390x.com> * fix(runtime): discard stranded hints on boot=off and boot=full pods Codex stop-time review caught that the previous _deployment_allows_mode check returned True for shadow unconditionally — but boot=off has no Rust sidecar at all and boot=full mounts a plain RustMCPRuntimeProxy with no dispatcher, so neither boot mode has a mechanism that could observe an override. A shadow hint replayed onto either pod would strand in state (effective_mode=shadow in diagnostics; transport keeps routing to its boot-determined target), and the router 409s for any PATCH on those boots so the operator would have no API path to clear it. Tighten _deployment_allows_mode to match the router's admit logic: shadow accepts only when a dispatcher is mounted (boot=shadow or edge for MCP; boot=shadow or edge for A2A), and edge still requires the session-auth-reuse / delegate-enabled invariant. Three new tests: - test_reconcile_from_hint_discards_shadow_on_full_boot - test_reconcile_from_hint_discards_any_mode_on_off_boot - test_reconcile_from_hint_discards_a2a_hint_on_a2a_off_boot Architecture doc updated to spell out all three INCOMPATIBLE_HINT cases. Signed-off-by: Jonathan Springer <jps@s390x.com> * fix(runtime): gate live pub/sub on compat check + structured rejection reasons Third-pass review identified a critical and several important issues: C1: _listen_loop applied remote pub/sub messages without the same compatibility check _reconcile_from_hint enforces. A remote pod's incompatible flip would strand in local state. Mirror the same gate. I2: _listen_loop's recovery branch fired on any non-exception return, including get_message returning None. A pubsub broken in a way that returns None reliably without raising would falsely re-promote cluster_propagation from DEGRADED to REDIS. Gate the recovery on message is not None — the existing idle-timeout branch already handles the legit no-message case. I1: Document the deliberate "discarded hint left in Redis" choice in _reconcile_from_hint's comment. S4+S5+S6 (combined refactor): - Promote _deployment_allows_mode to return MoveCompatibility enum with structured rejection reasons (NO_DISPATCHER, BOOT_FULL_STRANDS, EDGE_NEEDS_SAFETY_FLAG, OK). - Split BootReconcileStatus.INCOMPATIBLE_HINT into three granular values (INCOMPATIBLE_NO_DISPATCHER, INCOMPATIBLE_BOOT_FULL, INCOMPATIBLE_SAFETY_FLAG) so /health distinguishes the cases. - Move the helper to version.py next to the safety-invariant functions with a new public deployment_allows_override_mode wrapper. - Router now calls deployment_allows_override_mode directly, replacing the parallel _DISPATCHER_BOOT_MODES/_EDGE_OVERRIDE_BOOT_MODES sets with a single source of truth. The 409 detail string is generated from the MoveCompatibility reason and names the exact env var or flag the operator needs to flip. Tests added (I3 + I4 + I5 + S1 + S2): - test_reconcile_from_hint_discards_edge_on_full_boot - test_reconcile_from_hint_discards_edge_on_off_boot - test_incompatible_hint_does_not_delete_redis_key_or_downgrade_propagation (S1 + S2: assert hint key untouched AND cluster_propagation stays REDIS) - test_patch_mcp_mode_edge_409_when_boot_mode_full - test_deployment_allows_override_mode_mcp_table (8 parametrized rows) - test_deployment_allows_override_mode_a2a_table (6 parametrized rows) Test fixtures (edge_boot, off_boot, full_boot, shadow_boot) now patch the underlying settings instead of just monkeypatching the boot-mode helpers — the new deployment_allows_override_mode inspects raw settings, so fixture drift would have masked real bugs. Architecture doc updated to enumerate the three INCOMPATIBLE_* values and document the live pub/sub compat check. Signed-off-by: Jonathan Springer <jps@s390x.com> * fix(runtime): drop self-contradicting 'full' recommendation from NO_DISPATCHER 409 Codex stop-time review caught that the NO_DISPATCHER 409 detail told operators "Boot with RUST_..._MODE='shadow' or 'edge' (or 'full' for MCP) to enable overrides." But boot=full returns BOOT_FULL_STRANDS for the same reason (no dispatcher mounted) — switching to full would just trade one 409 for another. Worse, A2A has no 'full' mode at all, so the parenthetical was incoherent for A2A 409 responses. Drop the "(or 'full' for MCP)" clause. Add an inline comment explaining why full is intentionally excluded so a future maintainer doesn't re-add it. Two existing tests now also assert "'shadow' or 'edge'" is present and "'full'" is absent — pins the contract. Signed-off-by: Jonathan Springer <jps@s390x.com> * chore(secrets): refresh detect-secrets baseline after rebase onto main Line numbers in .secrets.baseline are tied to file content. After rebasing this branch onto main (which itself touched .secrets.baseline in upstream commits), the baseline needed a re-scan to match the post-rebase final file state. No new secrets detected. Signed-off-by: Jonathan Springer <jps@s390x.com> * docs(adr): defer generic cluster-wide settings propagation framework (ADR-050) Records the decision to keep the runtime-mutable mode override implementation purpose-built for MCP/A2A rather than extracting a generic ClusterStatePrimitive framework today. Captures the alternative abstraction sketch for future reference and enumerates the trigger conditions that would justify revisiting (second concrete consumer in flight, third independently-tracked use case on the horizon, or a need for cross-consumer coordination like atomic multi-setting flips). This is N=1 — CLAUDE.md's "three similar lines is better than a premature abstraction" applies. The right shape will emerge from copy-modifying for the second consumer and diffing the result. Signed-off-by: Jonathan Springer <jps@s390x.com> * docs(adr): scrub AI-assistant name from ADR-050 Repo guideline forbids naming AI assistants in committed diffs. The rationale section in ADR-050 cited 'Codex' as the source of stop-time review iterations; rephrase to describe the review process without naming the tool. Also drop the explicit 'CLAUDE.md' file citation in favor of 'repo convention' since the supporting quote stands on its own. Signed-off-by: Jonathan Springer <jps@s390x.com> * docs(adr): tighten ADR-050 claims, deliver defenses checklist, add 0049 row Comment-analyzer and code-reviewer review of ADR-050 surfaced several claims that would rot, one promised section that didn't exist, and a missing index row that visually framed ADR-050 as having skipped a number. ADR-050 fixes: - Correct the "≈900 LOC together" claim to ~1.5 kLOC. Actual line counts are runtime_state.py 977 + runtime_admin_router.py 489 = 1,466. - Replace "six stop-time review iterations" with the verifiable "five follow-up fix(runtime): commits on top of the initial feat(runtime):" so the count is grep-checkable from git log. - Drop the "repo convention is explicit: three similar lines beats a premature abstraction" framing — that line is from contributor tooling, not anything written in the repo. The N=1 argument now stands on its own merits. - Replace "the most-reviewed code on the branch" superlative with the timeless "received unusually heavy review during initial development." - Caption the ClusterStatePrimitive code block as a hypothetical sketch so readers who grep-arrive mid-document don't think the class exists. - Move the reverse-proxy WARN attribution from runtime_state.py to runtime_admin_router.py (it lives in the router). - Name LISTEN_LOOP_DEGRADE_THRESHOLD = 5 and RUNTIME_STATE_HINT_TTL_SECONDS = 24 * 60 * 60 in the context section so future drift in those constants is detectable from this ADR. - Reconcile trigger-#1's "build both consumers side-by-side" with the rationale's "copy-modify" guidance: copy-modify first, then extract the framework from the resulting two-implementation diff. - Deliver the "non-obvious defenses" checklist that the Negative consequence promised but never wrote. The new "Defenses checklist for the copy-modify path" section enumerates 12 defenses, each tied to a bug class caught during this PR's review iterations. Index fix: - Add the missing 0049 row for "Multi-Protocol Virtual Servers" (Accepted, Architecture, 2026-04-16). The 048→050 jump made ADR-050 look like it had skipped a number when ADR-049 was just un-indexed. - Other duplicate ADR numbers on disk (005, 034, 038, 048) and other index gaps are pre-existing and tracked separately in IBM#4290. Signed-off-by: Jonathan Springer <jps@s390x.com> * feat(transports): swappable MCPIngressMount + nginx-style Rust public proxy (ADR-051) Replace MCPStreamableHTTPModeDispatcher (closed-set, two transports, policy baked into dispatch) with MCPIngressMount: a thin ASGI indirection that holds a registry of named ingresses + a swappable selector. Adding a new ingress is one register() call; selection policy is one function. Three ingresses registered today: - "python" — the existing MCPRuntimeHeaderTransportWrapper. Always registered. - "rust-internal" — the existing RustMCPRuntimeProxy that forwards via httpx to MCP_RUST_LISTEN_HTTP / UDS (trusted-internal listener). Registered for boot=shadow/edge. - "rust-public" — new RustMCPPublicProxyApp that does nginx-style reverse-proxy to MCP_RUST_PUBLIC_LISTEN_HTTP. Adds X-Forwarded-*, RFC 7239 Forwarded, X-Real-IP. Streams both directions, preserves Authorization/Cookie. Registered only for boot=edge. Selector chooses between rust-internal and rust-public based on the new settings.mcp_rust_ingress (default "internal" preserves prior behavior bit-for-bit). Selector preserves all existing safety invariants — _should_mount_public_rust_transport() is consulted first, shadow override / boot=full / boot=off behave identically to before, the runtime-mode override admin API requires no changes. Test results: 1839 passing across the touched modules. Lint clean. ADR-051 documents the design decision, alternatives considered, and migration notes from MCPStreamableHTTPModeDispatcher. Index updated. Signed-off-by: Jonathan Springer <jps@s390x.com> * fix(transports): close XFF spoof + harden Rust public proxy ingress Response to review of 1967770 (MCPIngressMount + nginx-style Rust public proxy). Closes a security issue and several diagnostic gaps in the public proxy and ingress selector. Security fix — RustMCPPublicProxyApp: - _build_forwarded_headers now drops client-supplied Forwarded / X-Forwarded-* / X-Real-IP from the inbound iteration before re-deriving them from request.client. The prior code appended to the client-supplied X-Forwarded-For, letting any caller spoof their own source IP to Rust's /_internal/mcp/authenticate callback in the no-nginx-in-front deployment this proxy targets. New _FORWARDED_CHAIN_HEADERS frozenset enumerates the strip set. - _format_forwarded_for emits RFC 7239 §6.3-conformant `for=unknown` (no synthetic `:0`) for absent clients, brackets IPv6 addresses exactly once, and tolerates already-bracketed input. Selector hardening — _select_mcp_ingress / config: - mcp_rust_ingress is now Literal["internal", "public"] so Pydantic rejects typos at boot. - Selector only returns "rust-public" when boot_mcp_runtime_mode() == "edge" — the public listener isn't bound on shadow boot, so a stale edge override + mcp_rust_ingress=public combination would otherwise return an unregistered name and silently fall back to Python at debug-only log severity. - _build_mcp_transport_app emits a logger.error (not warning) when mcp_rust_ingress=public is set on shadow boot, so the misconfig survives the project-default LOG_LEVEL=ERROR. Error handling — RustMCPPublicProxyApp: - __call__ now distinguishes asyncio.CancelledError (re-raised), httpx.HTTPError (logged with type name + 502), and any other exception (logger.exception + 500), so a non-HTTPError no longer bypasses the handler and produces a silent Starlette 500. - _body_iter logs CancelledError at debug (common for SSE client disconnect) and other exceptions at error (rare upstream death), before re-raising. The finally still calls aclose to return the upstream connection to the pool. - Auth-relevant or server-error upstream statuses (401/403/5xx) now log a warning with method+path+status so a credential-rotation flood is visible without raising the global log level. MCPIngressMount diagnostics: - dispatch's selector-miss paths (fallback used, 503-on-no-fallback) now log warning with the unknown name AND self.names(), instead of debug-only on the fallback path with no log on the 503 path. A misrouted ingress should never be silent. Tests: - New tests/unit/mcpgateway/transports/test_rust_mcp_public_proxy.py (24 tests). Covers _format_forwarded_for parametrized over IPv4 with/without port, IPv6, already-bracketed IPv6, None/empty host; _build_forwarded_headers spoof rejection + hop-by-hop strip + auth preservation + unknown-client emission; proxy ASGI behavior across status passthrough, 502 on httpx.HTTPError, 500 on unexpected, parametrized warning on (401, 403, 500, 503), non-HTTP scope → 404, CancelledError re-raise on send and mid-stream debug-branch, lazy client lifecycle, multi-chunk SSE streaming, and aclose-on-exit. - Two new tests in test_mcp_ingress_mount.py pin the warning-level log severity on selector-miss + 503 paths. - Two new tests in test_main_extended.py pin the boot-time error log for the shadow + mcp_rust_ingress=public misconfig and the selector downgrade to rust-internal under the same condition. Doc rot fixes: - mcp_ingress_mount.py module docstring uses the full ADR filename (was an ellipsis). - version.py docstring references MCPIngressMount instead of the removed MCPStreamableHTTPModeDispatcher. - ADR-051 migration notes use test-group descriptors instead of pinned line numbers / test names. Test results: 652 passing across the touched modules. Lint clean (black, ruff, interrogate 100%). Signed-off-by: Jonathan Springer <jps@s390x.com> * chore(quality): fix pylint W0706/W2301/R0401 and close coverage gaps PR review feedback: Pylint findings: - W0706 (try-except-raise) at runtime_state.py:933 and rust_mcp_public_proxy.py:237: removed the redundant ``except asyncio.CancelledError: raise`` blocks. CancelledError is BaseException in Python 3.8+, so the surrounding ``except Exception`` / ``except httpx.HTTPError`` arms never caught it; the explicit re-raise was no-op and made pylint complain. Behavior is unchanged (asyncio still cancels both coroutines naturally), and the intent is preserved as a comment. - W2301 (unnecessary-ellipsis) at mcp_ingress_mount.py:37: dropped the ``...`` from the ASGIApp Protocol's ``__call__`` body — the docstring already counts as a body for Protocol methods. - R0401 (cyclic-import) on the runtime_state ↔ version pair: added per-line ``# pylint: disable=cyclic-import`` to the two lazy imports in runtime_state.py. Pylint follows function-body imports for cycle detection; the lazy imports break the runtime cycle but pylint still flags the static graph. The full extract (move MoveCompatibility + coercers to a shared module) is bigger refactor than this PR warrants. Coverage gaps closed (per-file delta): - routers/runtime_admin_router.py 97.3% → 100%. Added three tests for ``_warn_if_behind_reverse_proxy``: header detection emits warning with header names, no warning when no forwarded headers, defensive early return when ``request.headers`` is None. - transports/mcp_ingress_mount.py 94.7% → 100%. Added test for ``set_fallback`` covering swap mid-life and ``None`` restoring 503-on-miss behavior. - transports/rust_mcp_public_proxy.py 96.5% → 100%. Added test for ``_body_iter`` taking the ``except Exception`` arm on a non-cancel mid-stream error: asserts ``logger.exception`` fires with traceback. - runtime_state.py 89.9% → 99%. Added 9 tests covering: defensive ValueError branches in 4 accessor methods (parametrized), idempotent ``coordinator.start()``, ``_reconcile_from_hint`` no-redis and malformed-mode paths, ``_cleanup_pubsub`` timeout / exception / fallback-to-close paths, listen-loop bytes-payload decode + idle recovery + malformed-payload + empty-data + uncoerceable-fields branches. - version.py: targeted user-noted gaps (379, 434, 509, 570, 623) all closed. Added tests for last_change population in _mcp_runtime_status_payload / _a2a_runtime_status_payload, A2A shadow override forcing _should_delegate_a2a_to_rust False, and the boot_mcp_transport_mount public wrapper. Line 509 (NO_DISPATCHER fallthrough — _coerce_runtime rejects unknown kinds upstream) marked ``# pragma: no cover`` as a defensive default. Other defensive defaults marked ``# pragma: no cover``: - runtime_admin_router.py:99 (OK fallthrough — caller already returned for OK upstream). - runtime_state.py:164, 171 in ``_move_compat_to_reconcile_status`` (OK case shouldn't reach here per docstring; final return is a defensive default for future MoveCompatibility variants). Test results: 1272 passing across the touched modules. Lint clean (black, ruff, pylint 10.00/10, interrogate 100%). Signed-off-by: Jonathan Springer <jps@s390x.com> --------- Signed-off-by: Jonathan Springer <jps@s390x.com>
jonpspri
added a commit
that referenced
this pull request
May 8, 2026
* fix: update scope['modified_path'] for server_id extraction with root_path (#4270)
Fixes #4266
The /mcp endpoint was returning ALL tools instead of server-scoped tools
when accessed via /servers/{id}/mcp in reverse proxy deployments with
root_path configured. The /sse endpoint worked correctly.
Root cause: MCPPathRewriteMiddleware computed app_path by stripping
root_path prefix (line 3026) but never updated scope['modified_path'].
When streamablehttp_transport.py:2898 reads modified_path to extract
server_id via regex (r'^/servers/(?P<server_id>[^/]+)/mcp'), the regex
fails to match paths like '/prefix/servers/123/mcp', causing server_id
to be None and tools/list to return all tools instead of server-scoped.
Solution: Update scope['modified_path'] = app_path after computing the
app-relative path. This ensures streamablehttp_transport receives the
path without root_path prefix for correct server_id extraction.
Changes:
- Set scope['modified_path'] to app_path after stripping root_path
- Add regression tests verifying modified_path is app-relative
- All existing MCPPathRewriteMiddleware tests pass
Signed-off-by: Mohan Lakshmaiah <mohan.economist@gmail.com>
Signed-off-by: Brian Hussey <brian.hussey@ie.ibm.com>
* fix(transport): return 405 on session-less GET /mcp (#4205) (#4284)
* fix(docker): install wget in a2a-echo-agent image for healthcheck
The compose healthcheck for a2a_echo_agent runs `wget -qO- http://localhost:9100/health`,
but the debian:bookworm-slim runtime stage only installed ca-certificates and tzdata —
so every probe failed with "exec: wget: executable file not found in $PATH" and the
container stayed perpetually unhealthy despite the agent itself listening correctly.
Signed-off-by: Jonathan Springer <jps@s390x.com>
* fix(transport): return 405 on session-less GET /mcp (#4205)
A strict MCP client opens a passive SSE stream via GET right after initialize
to receive server-initiated messages. When ContextForge cannot anchor that
stream to a session (stateful sessions disabled, session-core disabled in
Rust, or no Mcp-Session-Id presented), it previously either committed to an
empty event stream (Rust live-stream relay) or handed off to the SDK which
held the GET open without emitting events. Strict clients then timed out on
the empty stream and reported "Failed to initialize server session".
The MCP Streamable HTTP spec allows the server to return 405 for such GETs;
this change takes that route in both gateways.
Python (mcpgateway/transports/streamablehttp_transport.py):
- SessionManagerWrapper.handle_streamable_http now returns 405 with
Allow: POST, DELETE for GET when either settings.use_stateful_sessions
is false or no Mcp-Session-Id was presented.
- The check runs after _validate_server_id() and the RBAC gate so bogus
server IDs still 404 and unauthorized callers still 403, preserving the
"RBAC before processing" invariant.
- Branch-specific detail messages distinguish "stateful sessions disabled"
from "no session id presented" so clients don't retry in a loop.
- logger.info on rejection with path + flags for ops visibility.
Rust (crates/mcp_runtime/src/lib.rs):
- forward_transport_request returns 405 for GET when either
session_core_enabled is false or no session id is presented in header
or session_id query parameter. Without session_core, Rust cannot anchor
an SSE stream; without a session id, there is nothing to anchor to.
- The 405 response carries the same capability headers as every other
terminal response in the function.
- tracing::info! on rejection and a RuntimeStats.transport.get_rejected_no_session
counter so ops can distinguish "no traffic" from "every client rejected".
- One pre-existing test (server_scoped_transport_wrappers_inject_server_header)
updated to provide a session id + record so it still exercises the
server-id header injection path.
Tests:
- Python: 3 new parametrized tests covering stateless/no-session/pass-through
and both Mcp-Session-Id header aliases; server-scoped variant confirms
405 fires only after server-id validation (and that 404 still wins for
nonexistent servers).
- Rust: 3 new tests covering 405 without session id, 405 when session_core
disabled even with header, and live-stream branch still reached when a
valid session id is presented.
- Also un-hollows two pre-existing header-parsing tests
(test_handle_streamable_http_non_tuple_header_skipped,
test_handle_streamable_http_non_bytes_header_skipped) that would have
been silently short-circuited by the new 405 path.
Note: This fixes the downstream GET-SSE cascade described in the comment
on #4205. The upstream per-client session-isolation problem from the
original reproducer (shared pooled upstream session leaking state between
downstream clients) is a separate architectural change and will be
addressed in a follow-up.
Signed-off-by: Jonathan Springer <jps@s390x.com>
* fix(transport): address review feedback on GET /mcp 405 path (#4205)
Follow-up to 252e8c228. Addresses review feedback surfaced by the
PR-review agents:
- Rust `runtime_session_id_from_request` now also accepts
`x-mcp-session-id`, restoring parity with the Python gateway so a
client isn't accepted by one runtime and 405'd by the other.
- Split the Rust 405 counter: `get_rejected_no_session` stays as the
total and a new `get_rejected_session_core_disabled` subcounter
isolates the deployment-config branch, so ops can distinguish client
misuse from session-core being off.
- Added a Python Prometheus counter (`transport_get_rejected_total`)
with matching `no_session_id` / `stateful_disabled` outcomes to close
the observability gap between runtimes.
- Split log levels on both sides: `warn` for the operator-facing
disabled-session / stateful-disabled branch, `debug` for the
no-session-id branch (routine strict-client probing).
- Added a regression guard test that GET with RBAC denial returns 403
(not 405) and emits no `Allow` header.
- Session-affinity `except Exception` in the Python gateway now warns
instead of silently rendering infrastructure failures as 405s.
- 404 on unknown server no longer allowed to leak an `Allow` header.
- GET OAuth-enabled server test asserts 401 before the 405 path.
- Rust happy-path tests snapshot counters to assert no 405 tick.
- "Why 405 (not 404/400)" rationale added to both comment blocks.
Signed-off-by: Jonathan Springer <jps@s390x.com>
* fix(transport): gate Rust 405 inside live-stream branch (#4205)
Integration tests `get_and_delete_mcp_routes_forward_to_internal_transport_bridge`
and `live_stream_core_disabled_falls_back_to_python_transport_get` were
failing in CI: the standalone 405 block in `forward_transport_request`
pre-empted the Python transport-bridge fallback path, even for GETs the
runtime could have cleanly proxied.
The empty-stream cascade the 405 was defending against only arises when
the live-stream relay is entered without a session anchor. Move the 405
guard inside that branch so GETs that don't request SSE (or where
live_stream_core is disabled) fall through to the existing Python-proxy
fallback, which has its own 405 logic. Update the matching unit test to
set `Accept: text/event-stream` so it still reaches the gate.
No change to the counter semantics or Python-side behaviour. Full
workspace test suite (Rust + Python) passes.
Signed-off-by: Jonathan Springer <jps@s390x.com>
* chore(rust): apply clippy 1.95.0 lint fixes in non-PR crates
Rustc stable moved from 1.94.1 to 1.95.0 between the last main Rust CI
run (#3704, 2026-04-16) and PR #4284, enabling two additional lints at
workspace `-D warnings` level. The new lints flag pre-existing code in
crates outside this PR's surface area, blocking Rust clippy CI for every
branch that touches any Rust file.
Apply clippy's exact suggestions:
- `crates/wrapper/src/json_rpc_id_fast.rs`: collapse a `FieldName`
match arm's nested `if` into a match guard (`collapsible_match`).
- `crates/a2a_runtime/src/event_store.rs`: drop a redundant
`.into_iter()` in a `.zip()` argument (`useless_conversion`).
- `crates/a2a_runtime/src/server.rs`: collapse two match-arm nested
`if`s (`get`|`cancel` and `list`) into match guards.
Verified with `cargo clippy --workspace --all-targets -- -D warnings`
and `cargo test --workspace` on rustc 1.95.0. No behaviour change.
Signed-off-by: Jonathan Springer <jps@s390x.com>
---------
Signed-off-by: Jonathan Springer <jps@s390x.com>
Signed-off-by: Brian Hussey <brian.hussey@ie.ibm.com>
* feat(runtime): runtime-mutable RUST_MCP_MODE + A2A_MODE via admin API (#4296)
* feat(runtime): runtime-mutable MCP and A2A mode via admin API
Add PATCH /admin/runtime/mcp-mode and PATCH /admin/runtime/a2a-mode that
flip the public /mcp ingress and registered-A2A invocation path between
shadow and edge at runtime without a restart. Boot env vars
(RUST_MCP_MODE, RUST_A2A_MODE) still select the initial mode; overrides
are in-memory only. When Redis is configured the override propagates
cluster-wide via pub/sub on contextforge:runtime:mode with a per-runtime
monotonic INCR counter and a 24h hint key per runtime so freshly-started
pods reconcile to the cluster's current desired override on boot.
Off and full boot modes are intentionally not flippable (409): off has
no Rust sidecar to swap to; full keeps Rust as the owner of MCP session/
event-store/resume cores and flipping to shadow would orphan that state.
The reverse-proxy caveat (admin flips don't propagate past an upstream
nginx until the proxy is reconfigured) is documented and tracked
separately in #4278.
Closes #4273
Signed-off-by: Jonathan Springer <jps@s390x.com>
* fix(runtime): preserve session-auth-reuse safety invariant under override
Codex stop-time review caught that `override="edge"` on a shadow-boot
deployment silently bypassed the documented safety invariant for routing
public traffic to Rust: `_should_mount_public_rust_transport` had been
loosened to require only `experimental_rust_mcp_runtime_enabled`, skipping
`experimental_rust_mcp_session_auth_reuse_enabled` (and the analogous
`delegate_enabled` for A2A). A boot=shadow deployment with an admin-issued
edge override would have started routing public MCP to a Rust sidecar
that never opted into session-auth-reuse.
Fix:
- `_should_mount_public_rust_transport` keeps the original invariant
unchanged. Override="shadow" still forces Python; override="edge" now
matches the default (honors the invariant). Same shape for
`_should_delegate_a2a_to_rust`.
- Narrow `_FLIPPABLE_BOOT_MODES` to `{edge}`. Only boot=edge deployments
have both flags true and can safely flip. boot=shadow PATCHes now
return 409 with an operator-facing explanation of the invariant.
- Two belt-and-braces unit tests exercise the read side directly (even
if a stale hint somehow landed override=edge in state, the transport
function refuses to route to Rust without the safety flags).
- Two router tests cover the new 409 for shadow boot (MCP + A2A).
- Docs updated: architecture doc, ADR 043 addendum, crate README/STATUS,
.env.example, docker-compose.yml.
Signed-off-by: Jonathan Springer <jps@s390x.com>
* fix(runtime): allow shadow override clearance and discard incompatible hints
Codex stop-time review caught a new dead-end introduced by the previous
safety-invariant fix: after restricting _FLIPPABLE_BOOT_MODES to {edge},
a shadow-boot deployment that inherited a stale override=edge via a
Redis hint (left by a prior edge-boot deploy) had no API path to clear
state — every PATCH returned 409. The operator's only recourse was
flushing Redis by hand.
Fix is two-sided:
Router:
- Gate on target mode, not boot mode. mode=shadow accepted from any
boot that has a dispatcher mounted (shadow or edge). mode=edge still
requires boot=edge (safety invariant). off and full still 409 for any
PATCH (no dispatcher).
- Replaced _FLIPPABLE_BOOT_MODES with _DISPATCHER_BOOT_MODES and
_EDGE_OVERRIDE_BOOT_MODES to make the two gates explicit.
Coordinator:
- _reconcile_from_hint now checks whether the hint's target mode can
safely take effect on this deployment. Incompatible hints (e.g. a
former edge-boot pod's hint replayed on a shadow-boot deploy) are
discarded and boot_reconcile_status is set to the new INCOMPATIBLE_HINT
value, surfaced via /health. Without this guard, shadow-boot pods
would boot with override=edge in state while the transport layer
refused to honor it — misleading diagnostics that required the
escape-hatch PATCH to clear.
- Added _deployment_allows_mode helper that mirrors the safety-invariant
checks in version.py (lazy-imported to avoid the version→settings cycle).
Tests:
- test_patch_mcp_mode_shadow_clears_stale_override_on_shadow_boot
exercises the escape hatch end-to-end.
- test_reconcile_from_hint_discards_incompatible_mode verifies the
coordinator refuses to apply a shadow-boot-incompatible hint.
- test_reconcile_from_hint_accepts_shadow_mode_on_shadow_boot covers
the positive case (shadow is always compatible).
Docs updated to explain the per-target-mode gating and the
INCOMPATIBLE_HINT status.
Signed-off-by: Jonathan Springer <jps@s390x.com>
* fix(runtime): discard stranded hints on boot=off and boot=full pods
Codex stop-time review caught that the previous _deployment_allows_mode
check returned True for shadow unconditionally — but boot=off has no
Rust sidecar at all and boot=full mounts a plain RustMCPRuntimeProxy
with no dispatcher, so neither boot mode has a mechanism that could
observe an override. A shadow hint replayed onto either pod would
strand in state (effective_mode=shadow in diagnostics; transport keeps
routing to its boot-determined target), and the router 409s for any
PATCH on those boots so the operator would have no API path to clear it.
Tighten _deployment_allows_mode to match the router's admit logic:
shadow accepts only when a dispatcher is mounted (boot=shadow or edge
for MCP; boot=shadow or edge for A2A), and edge still requires the
session-auth-reuse / delegate-enabled invariant.
Three new tests:
- test_reconcile_from_hint_discards_shadow_on_full_boot
- test_reconcile_from_hint_discards_any_mode_on_off_boot
- test_reconcile_from_hint_discards_a2a_hint_on_a2a_off_boot
Architecture doc updated to spell out all three INCOMPATIBLE_HINT cases.
Signed-off-by: Jonathan Springer <jps@s390x.com>
* fix(runtime): gate live pub/sub on compat check + structured rejection reasons
Third-pass review identified a critical and several important issues:
C1: _listen_loop applied remote pub/sub messages without the same
compatibility check _reconcile_from_hint enforces. A remote pod's
incompatible flip would strand in local state. Mirror the same gate.
I2: _listen_loop's recovery branch fired on any non-exception return,
including get_message returning None. A pubsub broken in a way that
returns None reliably without raising would falsely re-promote
cluster_propagation from DEGRADED to REDIS. Gate the recovery on
message is not None — the existing idle-timeout branch already handles
the legit no-message case.
I1: Document the deliberate "discarded hint left in Redis" choice in
_reconcile_from_hint's comment.
S4+S5+S6 (combined refactor):
- Promote _deployment_allows_mode to return MoveCompatibility enum
with structured rejection reasons (NO_DISPATCHER, BOOT_FULL_STRANDS,
EDGE_NEEDS_SAFETY_FLAG, OK).
- Split BootReconcileStatus.INCOMPATIBLE_HINT into three granular
values (INCOMPATIBLE_NO_DISPATCHER, INCOMPATIBLE_BOOT_FULL,
INCOMPATIBLE_SAFETY_FLAG) so /health distinguishes the cases.
- Move the helper to version.py next to the safety-invariant functions
with a new public deployment_allows_override_mode wrapper.
- Router now calls deployment_allows_override_mode directly, replacing
the parallel _DISPATCHER_BOOT_MODES/_EDGE_OVERRIDE_BOOT_MODES sets
with a single source of truth. The 409 detail string is generated
from the MoveCompatibility reason and names the exact env var or
flag the operator needs to flip.
Tests added (I3 + I4 + I5 + S1 + S2):
- test_reconcile_from_hint_discards_edge_on_full_boot
- test_reconcile_from_hint_discards_edge_on_off_boot
- test_incompatible_hint_does_not_delete_redis_key_or_downgrade_propagation
(S1 + S2: assert hint key untouched AND cluster_propagation stays REDIS)
- test_patch_mcp_mode_edge_409_when_boot_mode_full
- test_deployment_allows_override_mode_mcp_table (8 parametrized rows)
- test_deployment_allows_override_mode_a2a_table (6 parametrized rows)
Test fixtures (edge_boot, off_boot, full_boot, shadow_boot) now patch
the underlying settings instead of just monkeypatching the boot-mode
helpers — the new deployment_allows_override_mode inspects raw
settings, so fixture drift would have masked real bugs.
Architecture doc updated to enumerate the three INCOMPATIBLE_* values
and document the live pub/sub compat check.
Signed-off-by: Jonathan Springer <jps@s390x.com>
* fix(runtime): drop self-contradicting 'full' recommendation from NO_DISPATCHER 409
Codex stop-time review caught that the NO_DISPATCHER 409 detail told
operators "Boot with RUST_..._MODE='shadow' or 'edge' (or 'full' for
MCP) to enable overrides." But boot=full returns BOOT_FULL_STRANDS
for the same reason (no dispatcher mounted) — switching to full would
just trade one 409 for another. Worse, A2A has no 'full' mode at all,
so the parenthetical was incoherent for A2A 409 responses.
Drop the "(or 'full' for MCP)" clause. Add an inline comment explaining
why full is intentionally excluded so a future maintainer doesn't
re-add it. Two existing tests now also assert "'shadow' or 'edge'" is
present and "'full'" is absent — pins the contract.
Signed-off-by: Jonathan Springer <jps@s390x.com>
* chore(secrets): refresh detect-secrets baseline after rebase onto main
Line numbers in .secrets.baseline are tied to file content. After
rebasing this branch onto main (which itself touched .secrets.baseline
in upstream commits), the baseline needed a re-scan to match the
post-rebase final file state. No new secrets detected.
Signed-off-by: Jonathan Springer <jps@s390x.com>
* docs(adr): defer generic cluster-wide settings propagation framework (ADR-050)
Records the decision to keep the runtime-mutable mode override
implementation purpose-built for MCP/A2A rather than extracting a
generic ClusterStatePrimitive framework today. Captures the alternative
abstraction sketch for future reference and enumerates the trigger
conditions that would justify revisiting (second concrete consumer in
flight, third independently-tracked use case on the horizon, or a need
for cross-consumer coordination like atomic multi-setting flips).
This is N=1 — CLAUDE.md's "three similar lines is better than a
premature abstraction" applies. The right shape will emerge from
copy-modifying for the second consumer and diffing the result.
Signed-off-by: Jonathan Springer <jps@s390x.com>
* docs(adr): scrub AI-assistant name from ADR-050
Repo guideline forbids naming AI assistants in committed diffs. The
rationale section in ADR-050 cited 'Codex' as the source of stop-time
review iterations; rephrase to describe the review process without
naming the tool. Also drop the explicit 'CLAUDE.md' file citation in
favor of 'repo convention' since the supporting quote stands on its
own.
Signed-off-by: Jonathan Springer <jps@s390x.com>
* docs(adr): tighten ADR-050 claims, deliver defenses checklist, add 0049 row
Comment-analyzer and code-reviewer review of ADR-050 surfaced several
claims that would rot, one promised section that didn't exist, and a
missing index row that visually framed ADR-050 as having skipped a
number.
ADR-050 fixes:
- Correct the "≈900 LOC together" claim to ~1.5 kLOC. Actual line counts
are runtime_state.py 977 + runtime_admin_router.py 489 = 1,466.
- Replace "six stop-time review iterations" with the verifiable "five
follow-up fix(runtime): commits on top of the initial feat(runtime):"
so the count is grep-checkable from git log.
- Drop the "repo convention is explicit: three similar lines beats a
premature abstraction" framing — that line is from contributor tooling,
not anything written in the repo. The N=1 argument now stands on its
own merits.
- Replace "the most-reviewed code on the branch" superlative with the
timeless "received unusually heavy review during initial development."
- Caption the ClusterStatePrimitive code block as a hypothetical sketch
so readers who grep-arrive mid-document don't think the class exists.
- Move the reverse-proxy WARN attribution from runtime_state.py to
runtime_admin_router.py (it lives in the router).
- Name LISTEN_LOOP_DEGRADE_THRESHOLD = 5 and
RUNTIME_STATE_HINT_TTL_SECONDS = 24 * 60 * 60 in the context section
so future drift in those constants is detectable from this ADR.
- Reconcile trigger-#1's "build both consumers side-by-side" with the
rationale's "copy-modify" guidance: copy-modify first, then extract
the framework from the resulting two-implementation diff.
- Deliver the "non-obvious defenses" checklist that the Negative
consequence promised but never wrote. The new "Defenses checklist for
the copy-modify path" section enumerates 12 defenses, each tied to a
bug class caught during this PR's review iterations.
Index fix:
- Add the missing 0049 row for "Multi-Protocol Virtual Servers"
(Accepted, Architecture, 2026-04-16). The 048→050 jump made ADR-050
look like it had skipped a number when ADR-049 was just un-indexed.
- Other duplicate ADR numbers on disk (005, 034, 038, 048) and other
index gaps are pre-existing and tracked separately in #4290.
Signed-off-by: Jonathan Springer <jps@s390x.com>
* feat(transports): swappable MCPIngressMount + nginx-style Rust public proxy (ADR-051)
Replace MCPStreamableHTTPModeDispatcher (closed-set, two transports,
policy baked into dispatch) with MCPIngressMount: a thin ASGI
indirection that holds a registry of named ingresses + a swappable
selector. Adding a new ingress is one register() call; selection
policy is one function.
Three ingresses registered today:
- "python" — the existing MCPRuntimeHeaderTransportWrapper. Always
registered.
- "rust-internal" — the existing RustMCPRuntimeProxy that forwards
via httpx to MCP_RUST_LISTEN_HTTP / UDS (trusted-internal listener).
Registered for boot=shadow/edge.
- "rust-public" — new RustMCPPublicProxyApp that does nginx-style
reverse-proxy to MCP_RUST_PUBLIC_LISTEN_HTTP. Adds X-Forwarded-*,
RFC 7239 Forwarded, X-Real-IP. Streams both directions, preserves
Authorization/Cookie. Registered only for boot=edge.
Selector chooses between rust-internal and rust-public based on the
new settings.mcp_rust_ingress (default "internal" preserves prior
behavior bit-for-bit). Selector preserves all existing safety
invariants — _should_mount_public_rust_transport() is consulted first,
shadow override / boot=full / boot=off behave identically to before,
the runtime-mode override admin API requires no changes.
Test results: 1839 passing across the touched modules. Lint clean.
ADR-051 documents the design decision, alternatives considered, and
migration notes from MCPStreamableHTTPModeDispatcher. Index updated.
Signed-off-by: Jonathan Springer <jps@s390x.com>
* fix(transports): close XFF spoof + harden Rust public proxy ingress
Response to review of 19677706b (MCPIngressMount + nginx-style Rust
public proxy). Closes a security issue and several diagnostic gaps in
the public proxy and ingress selector.
Security fix — RustMCPPublicProxyApp:
- _build_forwarded_headers now drops client-supplied Forwarded /
X-Forwarded-* / X-Real-IP from the inbound iteration before
re-deriving them from request.client. The prior code appended to
the client-supplied X-Forwarded-For, letting any caller spoof their
own source IP to Rust's /_internal/mcp/authenticate callback in the
no-nginx-in-front deployment this proxy targets. New
_FORWARDED_CHAIN_HEADERS frozenset enumerates the strip set.
- _format_forwarded_for emits RFC 7239 §6.3-conformant `for=unknown`
(no synthetic `:0`) for absent clients, brackets IPv6 addresses
exactly once, and tolerates already-bracketed input.
Selector hardening — _select_mcp_ingress / config:
- mcp_rust_ingress is now Literal["internal", "public"] so Pydantic
rejects typos at boot.
- Selector only returns "rust-public" when boot_mcp_runtime_mode() ==
"edge" — the public listener isn't bound on shadow boot, so a stale
edge override + mcp_rust_ingress=public combination would otherwise
return an unregistered name and silently fall back to Python at
debug-only log severity.
- _build_mcp_transport_app emits a logger.error (not warning) when
mcp_rust_ingress=public is set on shadow boot, so the misconfig
survives the project-default LOG_LEVEL=ERROR.
Error handling — RustMCPPublicProxyApp:
- __call__ now distinguishes asyncio.CancelledError (re-raised),
httpx.HTTPError (logged with type name + 502), and any other
exception (logger.exception + 500), so a non-HTTPError no longer
bypasses the handler and produces a silent Starlette 500.
- _body_iter logs CancelledError at debug (common for SSE client
disconnect) and other exceptions at error (rare upstream death),
before re-raising. The finally still calls aclose to return the
upstream connection to the pool.
- Auth-relevant or server-error upstream statuses (401/403/5xx) now
log a warning with method+path+status so a credential-rotation flood
is visible without raising the global log level.
MCPIngressMount diagnostics:
- dispatch's selector-miss paths (fallback used, 503-on-no-fallback)
now log warning with the unknown name AND self.names(), instead of
debug-only on the fallback path with no log on the 503 path. A
misrouted ingress should never be silent.
Tests:
- New tests/unit/mcpgateway/transports/test_rust_mcp_public_proxy.py
(24 tests). Covers _format_forwarded_for parametrized over IPv4
with/without port, IPv6, already-bracketed IPv6, None/empty host;
_build_forwarded_headers spoof rejection + hop-by-hop strip + auth
preservation + unknown-client emission; proxy ASGI behavior across
status passthrough, 502 on httpx.HTTPError, 500 on unexpected,
parametrized warning on (401, 403, 500, 503), non-HTTP scope → 404,
CancelledError re-raise on send and mid-stream debug-branch, lazy
client lifecycle, multi-chunk SSE streaming, and aclose-on-exit.
- Two new tests in test_mcp_ingress_mount.py pin the warning-level
log severity on selector-miss + 503 paths.
- Two new tests in test_main_extended.py pin the boot-time error log
for the shadow + mcp_rust_ingress=public misconfig and the selector
downgrade to rust-internal under the same condition.
Doc rot fixes:
- mcp_ingress_mount.py module docstring uses the full ADR filename
(was an ellipsis).
- version.py docstring references MCPIngressMount instead of the
removed MCPStreamableHTTPModeDispatcher.
- ADR-051 migration notes use test-group descriptors instead of
pinned line numbers / test names.
Test results: 652 passing across the touched modules. Lint clean
(black, ruff, interrogate 100%).
Signed-off-by: Jonathan Springer <jps@s390x.com>
* chore(quality): fix pylint W0706/W2301/R0401 and close coverage gaps
PR review feedback:
Pylint findings:
- W0706 (try-except-raise) at runtime_state.py:933 and
rust_mcp_public_proxy.py:237: removed the redundant
``except asyncio.CancelledError: raise`` blocks. CancelledError is
BaseException in Python 3.8+, so the surrounding ``except Exception``
/ ``except httpx.HTTPError`` arms never caught it; the explicit
re-raise was no-op and made pylint complain. Behavior is unchanged
(asyncio still cancels both coroutines naturally), and the intent is
preserved as a comment.
- W2301 (unnecessary-ellipsis) at mcp_ingress_mount.py:37: dropped the
``...`` from the ASGIApp Protocol's ``__call__`` body — the
docstring already counts as a body for Protocol methods.
- R0401 (cyclic-import) on the runtime_state ↔ version pair: added
per-line ``# pylint: disable=cyclic-import`` to the two lazy imports
in runtime_state.py. Pylint follows function-body imports for cycle
detection; the lazy imports break the runtime cycle but pylint still
flags the static graph. The full extract (move MoveCompatibility +
coercers to a shared module) is bigger refactor than this PR
warrants.
Coverage gaps closed (per-file delta):
- routers/runtime_admin_router.py 97.3% → 100%. Added three tests for
``_warn_if_behind_reverse_proxy``: header detection emits warning
with header names, no warning when no forwarded headers, defensive
early return when ``request.headers`` is None.
- transports/mcp_ingress_mount.py 94.7% → 100%. Added test for
``set_fallback`` covering swap mid-life and ``None`` restoring
503-on-miss behavior.
- transports/rust_mcp_public_proxy.py 96.5% → 100%. Added test for
``_body_iter`` taking the ``except Exception`` arm on a non-cancel
mid-stream error: asserts ``logger.exception`` fires with traceback.
- runtime_state.py 89.9% → 99%. Added 9 tests covering: defensive
ValueError branches in 4 accessor methods (parametrized), idempotent
``coordinator.start()``, ``_reconcile_from_hint`` no-redis and
malformed-mode paths, ``_cleanup_pubsub`` timeout / exception /
fallback-to-close paths, listen-loop bytes-payload decode + idle
recovery + malformed-payload + empty-data + uncoerceable-fields
branches.
- version.py: targeted user-noted gaps (379, 434, 509, 570, 623) all
closed. Added tests for last_change population in
_mcp_runtime_status_payload / _a2a_runtime_status_payload, A2A
shadow override forcing _should_delegate_a2a_to_rust False, and
the boot_mcp_transport_mount public wrapper. Line 509
(NO_DISPATCHER fallthrough — _coerce_runtime rejects unknown kinds
upstream) marked ``# pragma: no cover`` as a defensive default.
Other defensive defaults marked ``# pragma: no cover``:
- runtime_admin_router.py:99 (OK fallthrough — caller already returned
for OK upstream).
- runtime_state.py:164, 171 in ``_move_compat_to_reconcile_status``
(OK case shouldn't reach here per docstring; final return is a
defensive default for future MoveCompatibility variants).
Test results: 1272 passing across the touched modules. Lint clean
(black, ruff, pylint 10.00/10, interrogate 100%).
Signed-off-by: Jonathan Springer <jps@s390x.com>
---------
Signed-off-by: Jonathan Springer <jps@s390x.com>
Signed-off-by: Brian Hussey <brian.hussey@ie.ibm.com>
* build(container): consolidate on Containerfile.lite; deprecate others (#4297)
Closes #2613.
The top-level `Containerfile` duplicated the .lite build with weaker layer
caching and shipped build tooling in the runtime image. `Containerfile.scratch`
had no frontend-builder stage (admin UI bundle absent), failed under QEMU
(amd64-only), and had no consumer in CI, docs, or the Makefile. Both are
removed; `Containerfile.lite` is now the only canonical gateway image.
Apply the same layer-consolidation and caching pass across `Containerfile.lite`
and the per-component Containerfiles:
- Merge adjacent RUN steps (microdnf + ln + useradd; profile.d setup + chmod;
Rust MCP + A2A cargo builds; ...) to reduce layer churn.
- Reorder COPYs from stable to volatile so frontend/Rust-binary changes no
longer invalidate the heavy venv-install layer.
- Use COPY --chown=1001:0 in the UBI nodejs frontend-builder stage — fixes
EACCES on package-lock.json under the non-root UBI default user — and
COPY --chmod=0755 on run scripts to drop follow-up chmod layers.
- Replace \`npm install --frozen-lockfile\` (a yarn/pnpm flag npm silently
ignores) with \`npm ci\`.
- Silence pre-existing hadolint DL3013/DL3041 findings with targeted ignore
comments consistent with the file-header policy.
Update every touchpoint of the deprecated images: Makefile podman/docker[-dev]
targets, fly.toml, docker-compose commented migration examples, the
docker-scan workflow matrix and trigger paths, .travis.yml,
.pre-commit-config.yaml, .bumpversion.cfg, MANIFEST.in, and user-facing docs
(packaging, release-management, compose, openshift, container, fly-io, argocd
tutorial, MULTIPLATFORM, roadmap). Update the docker-scan unit test to match
the trimmed workflow. Incidentally fix the stale \$(CONTAINERFILE) typo in
the snyk-iac Makefile target.
Signed-off-by: Jonathan Springer <jps@s390x.com>
Signed-off-by: Brian Hussey <brian.hussey@ie.ibm.com>
* feat: add HCS-14 UAID support for A2A agents (#4125)
* add HCS-14 UAID support for A2A agents
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* migration
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* lint error fix
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* test case fix
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* pyporject.toml
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* test coverage
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* feat: add base58 dependency and UAID routing config
- Add base58>=2.1.1 dependency for UAID generation support
- Add uaid_allowed_domains config setting for cross-gateway routing security
- Document UAID_ALLOWED_DOMAINS in .env.example
Related to #3956
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* a2a_agent_id column size change
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* feat: add UAID support to Admin UI
- Add UAID generation checkbox to create/edit A2A agent forms
- Add UAID registry and protocol configuration fields
- Display UAID vs UUID badge in agents table
- Show UAID metadata section in agent view modal
- Add JavaScript toggle logic for UAID fields
- Update form population for editing UAID agents
- Rebuild UI bundle with UAID changes
Completes HCS-14 UAID feature with full UI support.
Related to #3956
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* uaid ui change
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* revert id column length
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* uaid in admin ui create form
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* edit form UAID
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* test
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* length change
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* Add security documentation for UAID-based cross-gateway routing feature
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* documentation
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* test case
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* test: Add coverage for UAID cross-gateway routing edge cases
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* alembic head
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* test coverage
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* base58 and its deps resolved
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* security: fix SSRF and domain allowlist bypass in UAID cross-gateway routing
Fixes two critical security vulnerabilities in UAID-based cross-gateway routing:
1. Domain Allowlist Bypass (CVE-level)
- Previous: endpoint.endswith(d) allowed "evilexample.com" to bypass "example.com" allowlist
- Fixed: Require exact match OR proper subdomain prefix (.example.com)
- Now correctly blocks: "evilexample.com", "notexample.com"
- Still allows: "gateway.example.com", "example.com", "sub.gateway.example.com"
2. SSRF Injection via nativeId (CVE-level)
- Previous: nativeId field directly interpolated into URL without validation
- Attack vectors blocked:
* Protocol injection: file:///etc/passwd, gopher://internal
* User-info bypass: evil@127.0.0.1 (routes to internal IP)
* Path injection: gateway.com/admin (access internal endpoints)
- Legitimate ports still allowed: gateway.example.com:8443
Security validations added before URL construction:
- Reject protocol prefixes (://)
- Reject @ character (user-info bypass)
- Reject path components (/)
- Validate hostname format via urlparse
- Extract domain from endpoint:port for allowlist checking
Test coverage (8 new tests):
- test_invoke_remote_agent_domain_allowlist_bypass_attack
- test_invoke_remote_agent_domain_allowlist_subdomain_valid
- test_invoke_remote_agent_domain_allowlist_exact_match
- test_invoke_remote_agent_ssrf_protocol_injection
- test_invoke_remote_agent_ssrf_userinfo_bypass
- test_invoke_remote_agent_ssrf_path_injection
- test_invoke_remote_agent_ssrf_internal_ip (documents behavior)
- test_invoke_remote_agent_port_allowed
Related: #3956
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* style: apply black and isort formatting to schemas and UAID tests
- black: Fix line length in PluginPolicyItem config field description
- isort: Move First-Party imports to correct section in test files
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* security
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* uaid split
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* ruff
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* test cases
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* security test cases
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* varibale correct
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* ruff
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* rebase fix
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
* feat(a2a): A2A-1.0 UAID path in Rust sidecar + federation-loop guard
Splits UAID cross-gateway routing between the two runtimes by intent:
the Python gateway keeps the ContextForge-to-ContextForge federation
shape (target `/a2a/{uaid}/invoke` on a peer CF gateway), and the Rust
a2a_runtime gets a first-class A2A-1.0-compliant path that POSTs the
original JSON-RPC 2.0 body directly to the agent URL encoded in the
UAID's `nativeId`. The two paths are intentionally not reverse-compatible.
Key pieces:
- `crates/a2a_runtime/src/uaid.rs` (new): HCS-14 parser with SSRF
rejection (user-info, path injection, unsupported scheme),
case-insensitive domain allowlist (exact or `.suffix`), DoS length
cap, sealed `RoutingTarget` with private fields and a `Proto` enum
(`FromStr` + exhaustiveness test), stable `UaidError::code()` for
structured logs.
- `crates/a2a_runtime/src/server.rs`: UAID dispatch in
`handle_a2a_invoke` with local-resolve-first and cross-gateway
fall-through; SSE proxy (`handle_uaid_cross_gateway_streaming`)
with Last-Event-ID replay and a bounded JSON fallback that enforces
`max_response_body_bytes` via a streaming `read_bounded_body` helper
(per-chunk + cumulative cap); startup validation and rate-limited
warning for an empty allowlist; explicit one-time security warning
for the unforwarded-auth tradeoff.
- `mcpgateway/main.py` + `mcpgateway/services/a2a_service.py`: kind-
dispatch resolve (name/id/uaid) with 403 for visibility-denial so
the sidecar can distinguish hidden-from-absent; URL-encoded UAID in
the CF-federation target path; dropped redundant `agent_id` body
field; structured error logging without leaking 5xx internals.
Federation-loop prevention covers both A->B->A and self-referential
`endpoint_url` loops:
- `X-Contextforge-UAID-Hop: N` counter replaces the earlier binary
marker. Every outbound (cross-gateway, local-agent, and the
backward-compat `POST /invoke`) stamps `N+1`; every inbound rejects
at `uaid_max_federation_hops` (default 5).
- Strict parser with byte-identical rules in both runtimes: pure
ASCII digits only (no Unicode digits), no whitespace on single
values, saturating at `2**31 - 1`; RFC 7230 section 3.2.2 coalesced
form `"0, 10"` parses each element with OWS trim and takes the max
so smuggled low values can't mask a real high value.
- Max-of-variants defense against header-case smuggling: `HashMap`
iteration order can't be used to select a low value paired with a
high one, and axum's multi-valued `HeaderMap` is drained via
`get_all` + max.
- Named-agent paths translate resolve's 403 to 404 at the axum edge
so private-agent existence never leaks through the status code or
the error text (`map_agent_resolve_err` helper).
Config additions:
- Python `uaid_allowed_domains` (list), `uaid_max_length` (2048),
`uaid_max_federation_hops` (default 5).
- Rust `A2A_RUST_UAID_ALLOWED_DOMAINS`, `A2A_RUST_UAID_MAX_LENGTH`,
`A2A_RUST_UAID_MAX_FEDERATION_HOPS` (same defaults).
Tests:
- Python unit/integration: kind-dispatch resolve (uaid/id/name fall-
back), parse_hop_count parametric cases including Unicode digits,
whitespace, overflow, coalesced RFC 7230 form and smuggling, local-
agent hop stamping on `_execute_agent_request`, 3-hop chain locking
in the stamp-N+1 vs reject-at-max contract, TestClient-level
`/a2a/{agent_name}/invoke` header plumbing.
- Rust integration: 12+ UAID dispatch tests (local hit, cross-gateway
forward, streaming SSE proxy, body-size cap on chunked responses,
per-chunk cap, hop-limit at max for UAID and plain names,
multi-variant smuggling, coalesced header, 403->404 on named paths,
case-insensitive `/invoke` hop guard).
- Rust uaid.rs: 16 unit tests covering parse/validate/SSRF/allowlist
round-trip plus `Proto` exhaustiveness and `UaidError::code()`
stability contracts.
Closes #3956
Signed-off-by: Jonathan Springer <jps@s390x.com>
---------
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
Signed-off-by: Jonathan Springer <jps@s390x.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-authored-by: Jonathan Springer <jps@s390x.com>
Signed-off-by: Brian Hussey <brian.hussey@ie.ibm.com>
* fix(#4205): isolate upstream MCP sessions per downstream session (#4299)
* feat(services): scaffold UpstreamSessionRegistry for 1:1 upstream binding (#4205)
Introduces UpstreamSessionRegistry, which maps
(downstream_session_id, gateway_id) to a single upstream MCP ClientSession.
Replaces the identity-keyed MCPSessionPool's sharing semantics: two
downstream MCP sessions — even carrying the same user identity — can no
longer receive the same pooled upstream session. This is the core
isolation change #4205's reproducer needs.
Design:
- 1:1 per (downstream_session_id, gateway_id). Never shared.
- Connection reuse preserved WITHIN one downstream session: a client
making many tool calls against gateway G through session S still
hits the upstream's initialize exactly once.
- Health probe on reuse after idle_validation_seconds (60s default),
chain is ping → list_tools → list_prompts → list_resources → skip.
Failed probe recreates the upstream session.
- Transport + ClientSession live inside a dedicated asyncio.Task whose
anyio cancel scope is bound to that task, not to the request handler
(#3737). Shutdown is signal-driven via an asyncio.Event; cancellation
only as a last resort with a timeout.
- No configurable settings. The tuning surface of the old pool
(max_per_key, TTL, etc.) collapses under the 1:1 constraint.
- Purely in-process. Multi-worker stickiness for a downstream session
is the session-affinity layer's concern (to be extracted next) —
each worker's registry only sees requests affinity has already
routed to it.
API surface:
- acquire() async context manager keyed by downstream_session_id,
gateway_id, url, headers, transport_type. Rejects empty ids.
- evict_session(downstream_session_id) for DELETE /mcp paths.
- evict_gateway(gateway_id) for gateway rotation/removal.
- close_all() for app shutdown.
- snapshot() -> RegistrySnapshot for /admin and logs.
- Module-level init/get/shutdown singleton accessors matching the
shape of get_mcp_session_pool() so call-sites migrate cleanly.
Tests (16, all passing):
isolation across downstream sessions (the #4205 invariant), reuse
within a session, distinct upstreams per gateway for one session,
concurrent acquires collapse to one create via the per-key lock,
idle probe + reuse, failed probe + recreate, evict_session /
evict_gateway / close_all teardown, dead owner-task detection,
gateway-internal header stripping, and the singleton accessors.
A FakeClientSession + fake SessionFactory keep the tests hermetic.
Not yet wired: nothing in the codebase calls the registry yet. The
follow-up commits will extract session-affinity to its own module,
wire startup/shutdown + DELETE eviction, migrate tool/prompt/resource
services, refactor gateway health checks, delete MCPSessionPool, and
remove the associated feature flags.
Signed-off-by: Jonathan Springer <jps@s390x.com>
* feat(services): wire UpstreamSessionRegistry into app lifecycle (#4205)
Adds startup init, shutdown drain, and DELETE-triggered eviction for
the registry introduced in the previous commit. All additive; the old
MCPSessionPool still runs alongside for now.
main.py:
- init_upstream_session_registry() at startup, unconditionally (no
feature flag — the registry is always on).
- shutdown_upstream_session_registry() in the teardown block,
ordered between pool shutdown and SharedHttpClient shutdown to
ensure upstream sessions close before their HTTP transports go.
cache/session_registry.py (downstream session registry, distinct from
the upstream one this PR introduces):
- remove_session() now calls get_upstream_session_registry().evict_session(id)
as its last step. Fires on every path that drops a downstream session:
explicit DELETE /mcp, internal /_internal/mcp/session DELETE, SSE
disconnect housekeeping, database-backed session expiry. Wrapped so
a missing singleton (tests, early shutdown) or an eviction exception
never masks downstream teardown.
Tests (5 new, all passing):
remove_session → evict_session forwarding; remove_session tolerating
an uninitialized singleton; remove_session surviving an evict_session
that raises; shutdown close_all() call + singleton clear; re-init
after shutdown returns a fresh instance. Existing session_registry
coverage tests still green (72 tests, no regressions).
Signed-off-by: Jonathan Springer <jps@s390x.com>
* feat(tool_service): route MCP tool calls through UpstreamSessionRegistry (#4205)
Both MCP call-sites in tool_service.invoke_tool (SSE at ~L5048, StreamableHTTP
at ~L5230) now acquire their upstream ClientSession from the registry when a
downstream Mcp-Session-Id is in scope. This is the first visible behavioural
change of the #4205 fix: two downstream MCP sessions served by the same user
now build SEPARATE upstream sessions, so stateful upstream servers (counter,
etc.) no longer leak state between downstream clients.
Changes:
- Replace the conditional pool path with an unconditional registry path,
gated on the presence of a downstream Mcp-Session-Id (read from the
transport's request_headers_var via a new _downstream_session_id_from_request
helper) AND a gateway_id AND not tracing_active.
- Drop the `settings.mcp_session_pool_enabled` check at the call-sites.
The registry is always on; its applicability is determined by whether a
downstream session id is in scope.
- Keep the per-call fallback path for callers without a downstream
session id: admin UI test-invoke, internal /rpc, and anything that
drives the tool_service outside the streamable-http transport.
- Preserve the tracing trade-off: when tracing_active, skip the registry
to allow per-request traceparent/X-Correlation-ID injection.
- Remove the now-unused `get_mcp_session_pool, TransportType` import from
mcp_session_pool; TransportType is now imported from the registry module.
- _downstream_session_id_from_request uses a lazy import of
streamablehttp_transport.request_headers_var to avoid a circular
dependency at module load time.
Tests:
- NEW test_invoke_tool_mcp_two_downstream_sessions_hit_registry_with_distinct_ids:
the direct-consequence #4205 test — two invocations with different
downstream session ids must produce two acquire() calls with distinct
(session_id, gateway_id) keys, same gateway, different sessions.
- Rewrote test_invoke_tool_mcp_pooled_path_does_not_inject_trace_headers
as the equivalent registry-path test (same invariant: reused upstream
transports must not receive per-request trace headers).
- Rewrote 4 pool-hit tests in test_tool_service_coverage.py to use the
registry API (and set request_headers_var with a session id).
- 870 related tests pass; no regressions.
This migration leaves the old pool in place — it simply isn't called from
tool_service anymore. Prompt/resource/gateway call-sites still point at the
pool and will migrate in the next commits.
Signed-off-by: Jonathan Springer <jps@s390x.com>
* feat(prompts, resources): route MCP calls through UpstreamSessionRegistry (#4205)
Mirrors the tool_service migration: prompt_service and resource_service acquire
their upstream MCP ClientSession from the registry when a downstream Mcp-Session-Id
is in scope, falling back to per-call sessions otherwise. Same 1:1 isolation,
same connection reuse within a session, same trace-header trade-off.
Changes:
- prompt_service._fetch_prompt_from_gateway: replace pool path with registry
path; drop the `settings.mcp_session_pool_enabled` gate; drop the now-unused
`pool_user_identity` local.
- resource_service.invoke_resource SSE + StreamableHTTP helpers: same
rewrite. Also delete the `pool_user_identity` normalization block at
line ~1708 (no longer referenced).
- upstream_session_registry: add `downstream_session_id_from_request_context()`
so the three services share one implementation. tool_service now aliases
the shared helper rather than carrying its own copy.
- TransportType is imported from upstream_session_registry in all three
services; the pool's copy becomes unused and disappears in the
hollow-and-rename step.
Tests:
- Rewrote 4 pool-hit tests in test_resource_service.py as registry-path
tests (set request_headers_var with a downstream session id, patch
get_upstream_session_registry). Renamed for accuracy:
test_sse_session_pool_used_and_signature_validated →
test_sse_registry_used_and_signature_validated
test_invoke_resource_streamablehttp_uses_session_pool_when_available →
test_invoke_resource_streamablehttp_uses_registry_when_available
The two "pool not initialized falls back" tests now simulate an
uninitialized registry via RuntimeError from get_upstream_session_registry.
- 1173 service-layer tests pass; no regressions.
Two holdouts still touch the pool: gateway_service's health check (task #19)
and the pool itself (task #23 deletes its upstream-session code and renames
the file to session_affinity.py).
Signed-off-by: Jonathan Springer <jps@s390x.com>
* refactor(gateway_service): drop pool from health checks (#4205)
Gateway health checks are system operations — no downstream MCP session id
is in scope — so they can't key the UpstreamSessionRegistry. They also
don't benefit meaningfully from connection reuse: each probe is a one-shot
initialize round-trip to detect reachability. Use a fresh per-call session
unconditionally.
Changes:
- _check_single_gateway_health (streamablehttp branch): replaces the
pool-or-fallback block with a single streamablehttp_client +
ClientSession per probe.
- Drop the `mcp_session_pool_explicit_health_rpc` feature flag usage
(the setting itself disappears in the config cleanup commit). The
initialize() round-trip is the probe; no optional list_tools() needed.
- Imports: drop `get_mcp_session_pool, TransportType` from the pool
import; keep `register_gateway_capabilities_for_notifications`
(still used by three other call-sites in gateway_service). Also drop
the now-unused `anyio` import (was used only for the pool branch's
fail_after on list_tools).
Tests:
- Rewrote test_streamablehttp_pool_not_initialized_falls_back_to_per_call_session
as test_streamablehttp_health_uses_per_call_session (since per-call is
now unconditional, that's the whole behavior the test pins).
- Deleted test_streamablehttp_pool_used_and_explicit_health_rpc_calls_list_tools
(exercised a code path that no longer exists).
- Deleted tests/unit/mcpgateway/services/test_gateway_explicit_health_rpc.py
entirely — it only tested the MCP_SESSION_POOL_EXPLICIT_HEALTH_RPC
feature flag's on/off branches.
- 168 gateway_service tests pass; no regressions.
Signed-off-by: Jonathan Springer <jps@s390x.com>
* refactor(services): rename mcp_session_pool.py → session_affinity.py (#4205)
Pure file rename + import updates. No behavioural changes. The
MCPSessionPool class, its methods, and the module-level init/get/close
helpers keep their names for this commit — those renames happen in the
follow-up once the pool machinery inside the file is hollowed out.
Rationale: the file's true cargo is cluster affinity (Redis-backed
session→worker mapping, heartbeat, SET NX ownership claim, Lua CAS
reclaim, session-owner forwarding, RPC listener). The MCP upstream-session
pooling part — which the UpstreamSessionRegistry has replaced — is now
dead weight, and naming the file "session pool" no longer reflects what
this module actually does for the codebase.
Changes:
- git mv mcpgateway/services/mcp_session_pool.py → session_affinity.py
(history-preserving; git blame / log --follow still work).
- Bulk-update every `mcpgateway.services.mcp_session_pool` import across
production (7 files) and tests (12 files) to
`mcpgateway.services.session_affinity`. Mechanical sed, reviewable
as one pass.
- No class/method/function renames in this commit — that's the next
one, after the hollow.
Tests: 1481 pass, 2 skipped (pre-existing). No regressions.
Next: hollow out the pool-only code (PooledSession, acquire/release/
session() context manager, pool queue, health chain, identity hashing,
max-per-key semaphore, circuit breaker) from session_affinity.py. The
affinity surface stays. After that, a final commit renames MCPSessionPool
→ SessionAffinity and updates method names to drop the pool/streamablehttp
prefixes where they're now misleading.
Signed-off-by: Jonathan Springer <jps@s390x.com>
* refactor(services): hollow out pool machinery from session_affinity.py (#4205)
Deletes the upstream-MCP session-pooling code that no service-layer
caller uses anymore — the UpstreamSessionRegistry replaced it four
commits back. What remains in session_affinity.py is the multi-worker
cluster-affinity surface that streamablehttp_transport.py still relies
on: Redis-backed downstream-session→worker mapping, worker heartbeat,
atomic SET NX ownership claim, Lua CAS reclaim from dead workers,
cross-worker session-owner HTTP/RPC forwarding, the pub/sub RPC
listener, and the is_valid_mcp_session_id validator.
Net: session_affinity.py goes from 2417 → 1030 lines. Class name,
method names, and module-level accessor names are unchanged in this
commit so the diff is purely "delete dead code"; the rename is the
next commit.
Deleted from the file:
- class TransportType (moved to upstream_session_registry earlier)
- class PooledSession
- _get_cleanup_timeout helper (read an obsolete setting)
- PoolKey / HttpxClientFactory / IdentityExtractor type aliases
- DEFAULT_IDENTITY_HEADERS frozenset
- MCPSessionPool.__init__'s 16 pool-specific parameters and their
corresponding self._* fields (pools, active sets, locks, semaphores,
circuit breaker state, pool_last_used, eviction throttling, pool-hit
metrics, identity_headers / identity_extractor, health-check
config, max_total_keys / max_total_sessions)
- __aenter__ / __aexit__ (pool session context manager)
- _compute_identity_hash, _make_pool_key
- _get_or_create_lock, _get_or_create_pool
- _is_circuit_open / _record_failure / _record_success (circuit breaker)
- acquire(), release()
- _maybe_evict_idle_pool_keys
- _validate_session, _run_health_check_chain (pool health chain)
- _session_owner_coro, _create_session, _close_session
- get_metrics() (pool-level stats)
- session() context manager
Rewritten:
- close_all() — now stops the heartbeat and RPC-listener tasks and
clears the in-memory session mapping. No pool state to drain.
- drain_all() — clears the in-memory session mapping only. Kept for
the SIGHUP handler contract; upstream-session lifetime is owned by
the registry now.
- init_mcp_session_pool() — signature collapses from 18 parameters
to 3 (message_handler_factory, enable_notifications,
notification_debounce_seconds). All pool tunables are gone.
Callers updated:
- main.py startup: single init_mcp_session_pool() with no args,
gated only on settings.mcpgateway_session_affinity_enabled (the
`or settings.mcp_session_pool_enabled` fork is redundant now).
- main.py post-startup: notification service start moves under the
affinity guard alongside heartbeat and RPC-listener startup
(they share lifecycle).
- main.py shutdown: same simplification.
- _create_jwt_identity_extractor in main.py is now unused; it'll
disappear in the config-cleanup commit (#20) along with the
mcp_session_pool_jwt_identity_extraction setting.
Tests: 2064 pass across service + transport + cache suites. 2 pre-existing
skips. No regressions.
Next: the mechanical class/method rename (MCPSessionPool → SessionAffinity,
plus the corresponding method rename-downs of the `pool_` / `streamable_http_`
prefixes that no longer reflect what this module does).
Signed-off-by: Jonathan Springer <jps@s390x.com>
* refactor(services): rename MCPSessionPool → SessionAffinity (#4205)
Mechanical rename across 22 files. No behavioural changes. Follows the
hollow commit that removed the pool machinery; what survives is
affinity-only, so the class name and public API now match the module name.
Renames applied verbatim (order matters — longer substrings first):
Class:
MCPSessionPool → SessionAffinity
Module-level accessors:
get_mcp_session_pool → get_session_affinity
init_mcp_session_pool → init_session_affinity
close_mcp_session_pool → close_session_affinity
drain_mcp_session_pool → drain_session_affinity
start_pool_notification_service → start_affinity_notification_service
_mcp_session_pool (module global) → _session_affinity
Methods (public surface that callers use):
register_pool_session_owner → register_session_owner
cleanup_streamable_http_session_owner → cleanup_session_owner
get_streamable_http_session_owner → get_session_owner
forward_streamable_http_to_owner → forward_to_owner
Internal helpers:
_get_pool_session_owner → _get_session_owner
_cleanup_pool_session_owner → _cleanup_session_owner
_pool_owner_key → _session_owner_key
Kept as-is (not misleading after the hollow):
- is_valid_mcp_session_id (validates the downstream Mcp-Session-Id)
- forward_request_to_owner (already transport-agnostic)
- register_gateway_capabilities_for_notifications
- unregister_gateway_from_notifications
- _session_mapping_redis_key, _worker_heartbeat_key
- Redis key string literals (mcpgw:pool_owner:*, mcpgw:worker_heartbeat:*,
mcpgw:session_mapping:*) — left untouched so a worker rolling between
this branch and main's tip stays interoperable with in-flight Redis
state. The Python identifiers move; the wire format does not.
Files touched:
mcpgateway/services/session_affinity.py, upstream_session_registry.py,
notification_service.py, server_classification_service.py,
mcpgateway/main.py, admin.py, cache/session_registry.py,
handlers/signal_handlers.py, transports/streamablehttp_transport.py,
plus 13 corresponding test files.
Tests: 2064 pass, 2 pre-existing skips (unchanged). No regressions.
Closes task #23 (hollow-and-rename). Next up: task #20 (delete the 18
obsolete mcp_session_pool_* settings from config.py now that nothing
reads them), task #21 (delete orphan pool tests), task #22 (the #4205
counter-server e2e reproducer that lets the issue close).
Signed-off-by: Jonathan Springer <jps@s390x.com>
* fix(gateway): evict upstream sessions on gateway delete and connect-field update (#4205)
Codex stop-time review caught a gap in the upstream-session isolation
work: after an admin deletes a gateway, or updates its URL / auth
fields, the UpstreamSessionRegistry still holds ClientSessions pinned
to the stale gateway_id. Subsequent acquire() calls keep returning
them, so in-flight downstream sessions keep talking to the old URL
with the old credentials until the downstream session itself ends.
Fix:
- New module-level helper _evict_upstream_sessions_for_gateway(id)
in gateway_service.py. Forwards to registry.evict_gateway(id).
Best-effort: tolerates an uninitialized registry (tests, early
startup) and swallows unexpected registry errors so gateway
mutations never block on upstream teardown failures.
- GatewayService.delete_gateway — calls the helper right after
db.commit(), before cache invalidation / notification. Captures
every upstream session bound to the now-gone gateway.
- GatewayService.update_gateway — captures original_auth_value,
original_auth_query_params, original_oauth_config alongside the
existing original_url / original_auth_type. After db.commit(),
if ANY connect-affecting field changed, calls the helper. Non-
connect changes (name, description, tags, passthrough_headers,
visibility, etc.) deliberately leave upstream sessions alone so
the 1:1 downstream-session connection-reuse benefit survives
cosmetic edits.
Tests (3 new in test_upstream_session_registry_lifecycle.py):
- helper forwards gateway_id to registry.evict_gateway and returns
the eviction count.
- helper returns 0 and does not raise when the registry singleton
is not initialized.
- helper swallows unexpected registry exceptions (e.g. Redis down)
so gateway mutation paths stay robust.
1619 service-layer tests pass. 2 pre-existing skips unchanged.
Signed-off-by: Jonathan Springer <jps@s390x.com>
* fix(gateway): evict upstream sessions on TLS / mTLS field update (#4205)
Codex stop-time review flagged that the previous eviction-on-update
commit handled url + auth fields but missed the TLS/mTLS material that
equally changes the upstream connection envelope. Admin rotating a CA
bundle, updating its signature, switching the signing algorithm, or
pushing new mTLS client cert/key would leave upstream sessions pinned
to the pre-rotation TLS context — the next acquire would hand the
stale ClientSession back and keep using the old cert material until
the downstream session died.
Change:
- update_gateway now snapshots original_{ca_certificate,
ca_certificate_sig, signing_algorithm, client_cert, client_key}
alongside the existing URL/auth originals, and adds them to the
"did any connect-affecting field change?" disjunction that decides
whether to call _evict_upstream_sessions_for_gateway(gateway.id).
- Non-connect fields (name, description, tags, passthrough_headers,
visibility) still skip eviction, so cosmetic edits keep the 1:1
connection-reuse benefit.
Plus one contract test:
- test_connect_field_inventory_matches_gateway_model ties
_CONNECT_FIELD_NAMES to both (a) the source of update_gateway (via
a grep for "original_<field>" and the bare field name) and (b) the
Gateway ORM columns. Adding a new TLS / auth / URL column to the
Gateway model without wiring it through the eviction check — or
renaming one of the existing originals — now fails this test with
a specific message rather than silently regressing #4205's intent.
317 service-layer tests pass (+1 over the previous commit's 316).
Signed-off-by: Jonathan Springer <jps@s390x.com>
* fix(gateway): evict upstream sessions on transport update (#4205)
Codex stop-time review flagged one more connect-affecting field the
previous two eviction commits missed: gateway.transport. Switching a
gateway between SSE and STREAMABLE_HTTP pins a different upstream MCP
client class in the registry — tool_service/prompt_service/resource_service
map gateway.transport into the TransportType passed to registry.acquire,
so stale sessions returned for the wrong transport would continue to
speak the old protocol against a server now expecting the new one.
Changes:
- Capture original_transport alongside the other connect originals.
- Add it to the change-detection check.
- The 11-expression disjunction tripped ruff's PLR0916 (too-many-bool);
refactored the comparison into a tuple of (current, original) pairs
evaluated via any(), which is also easier to extend next time a new
connect-affecting column is added to Gateway.
- Contract test's _CONNECT_FIELD_NAMES grows to 11, now including
"transport". The grep-and-ORM-…
jonpspri
added a commit
that referenced
this pull request
May 8, 2026
* feat: Add grpc_service_id to Tool schema IBM#2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> AI-usage: full * feat: Add tool registration and deletion for registering/deleting gRPC services IBM#2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> AI-usage: full * feat: Add load_file_descriptors for grpc serialized proto descriptor bytes IBM#2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> AI-usage: full * feat: Add gRPC tool invocation support IBM#2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> AI-usage: full * test: Unit tests for gRPC tool registration IBM#2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> AI-usage: full * style: make lint IBM#2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * test: Fix missing grpc_service_id in tool unit test IBM#2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> AI-usage: full * test: Flesh out test coverage for new gRPC code paths IBM#2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> AI-usage: full * fix: Better error handling for protobuf descriptor pool adding IBM#2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Inherit visibility from parent gRPC service when mapping to tools IBM#2854 Branch: GrpcMethodsAsTools-2854 AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix alembic migration after rebase IBM#2854 Branch: GrpcMethodsAsTools-2854 AI-usage: full Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Repoint alembic migration parent to current head after rebase Updates down_revision from x7h8i9j0k1l2 to aa1b2c3d4e5f, the current head on origin/main. Several new migrations have landed since this PR was last rebased, so the previous parent is no longer the head. Verified single head: alembic heads -> w7x8y9z0a1b2 (head) IBM#2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> * test: Add grpc_service_id=None to TestRustMcpExecutionPlan SimpleNamespace mocks Two tests added to TestRustMcpExecutionPlan after this PR was last rebased construct tool mocks via SimpleNamespace. The PR's change to _build_tool_cache_payload reads tool.grpc_service_id, so those mocks now AttributeError. Mirrors b576ec4's earlier fix for the same pattern in MagicMock(spec=DbTool) helpers. Affected: - TestRustMcpExecutionPlan.test_prepare_rust_mcp_tool_execution_uses_live_gateway_auth_fields_for_loaded_tools - TestRustMcpExecutionPlan.test_prepare_rust_mcp_tool_execution_uses_live_gateway_string_auth_values IBM#2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> * style: make black isort Tightens whitespace and import grouping per project lint config: - mcpgateway/services/grpc_service.py: black removes a stray blank line - tests/unit/mcpgateway/{services/test_grpc_service,services/test_tool_service,test_translate_grpc}.py: isort regroups stdlib/third-party/first-party/local imports IBM#2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> * fix(grpc): use protobuf 6.x kwarg in MessageToDict and isolate descriptor pool Two issues addressed in mcpgateway/translate_grpc.py: 1. Pre-existing runtime bug: invoke()/invoke_streaming() called json_format.MessageToDict(..., including_default_value_fields=True), the parameter name from protobuf 4.x. The repo pins protobuf>=6.33.6 (see pyproject.toml), where the parameter was renamed to always_print_fields_with_no_presence. Calls raised TypeError at runtime, breaking gRPC tool invocation. Renamed the kwarg and replaced the # pylint: disable=unexpected-keyword-arg suppression with an anti-regression comment. 2. Descriptor pool poisoning: GrpcEndpoint shared descriptor_pool.Default(), the process-wide singleton. Reflected FileDescriptorProto from one (untrusted) upstream service could leak across requests and cause symbol collisions or type confusion in subsequent calls. Switched to a per-endpoint descriptor_pool.DescriptorPool(); MessageFactory now binds to that private pool. Updated tests/unit/mcpgateway/test_translate_grpc.py monkeypatches to expose DescriptorPool() instead of Default() and MessageFactory(pool=...) instead of MessageFactory(). Closes review: B2, B11 IBM#2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> * fix(grpc): security and robustness hardening from review Addresses BLOCKING findings from a multi-agent code review of PR IBM#3202: Security: * Reject local-only gRPC schemes (unix:, unix-abstract:, vsock:, fd:) in _validate_grpc_target since they bypass the IP-based SSRF model. Also normalize dns:/// and ipv4:/ipv6: name-resolver prefixes and parse bracketed IPv6 hosts correctly so the host-extraction step cannot be tricked. (review B1) * Add _enforce_descriptor_limits to bound reflected FileDescriptorProto count, per-blob size, and aggregate size before storing or loading. Hardcoded constants (not settings) so a config change cannot silently weaken the DoS defense. (review B3) * Add _validate_reflected_tool_name that runs reflected tool names through SecurityValidator.validate_tool_name() before persisting. Reflected names from upstream gRPC servers no longer bypass the injection / length / character checks applied to user-registered tools. (review B4) * Strip metadata pseudo-keys (anything starting with '_') when copying discovered_services into endpoint._services so the _file_descriptors entry can never be confused with a service. * Decode stored descriptors with base64.b64decode(..., validate=True) to fail fast on tampered DB content. Robustness / Layer 1 invariants: * _sync_tools_from_reflection now runs each method through a per-tool try/except so a single bad method cannot abort the entire reflection sync. Counts created / updated / failed. (review B9) * _sync_tools_from_reflection now propagates parent service visibility, team_id, and owner_email to existing reflected tools, not just to newly created ones. Closes a Layer 1 token-scoping gap where a public->team visibility change on the parent service was silently ignored on already-discovered tools. (review B5) Operational: * invoke_method gains a timeout parameter (default settings.tool_timeout) and wraps endpoint.start() / endpoint.invoke() in asyncio.wait_for so a slow upstream cannot tie up a worker indefinitely. (review B6) * tool_service.invoke_tool gRPC branch wraps the call in asyncio.wait_for, surfaces ToolTimeoutError + post-invoke timeout hook (matches A2A branch), preserves CancelledError instead of swallowing it, and logs with %-style + exc_info=True. (review B6/B7/B8) * invoke_method preserves CancelledError, logs with %-style + exc_info=True, re-raises GrpcServiceError/GrpcServiceNotFoundError unwrapped instead of double-wrapping them. (review B8) IBM#2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> * refactor(grpc): drop redundant inline settings import The inline 'from mcpgateway.config import settings' in _validate_grpc_target was made redundant by hoisting the same import to module top in the security hardening commit. Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> * test(grpc): cover security-hardening helpers and Layer 1 invariants Adds two test classes exercising the helpers introduced by the security/robustness hardening commit: TestSecurityHardening: * _validate_grpc_target rejects unix:/unix-abstract:/vsock:/fd: schemes * _validate_grpc_target normalizes dns:///, dns://, dns:, ipv4:, ipv6: name-resolver prefixes * _validate_grpc_target accepts bracketed IPv6 literals and rejects malformed bracket syntax * _enforce_descriptor_limits enforces count, per-blob, and aggregate size caps; happy path passes within bounds * _validate_reflected_tool_name rejects empty / over-length / injection patterns and accepts proto-style identifiers TestVisibilityPropagation: * _sync_tools_from_reflection propagates parent service visibility, team_id, and owner_email to existing reflected tools (Layer 1 invariant). * _sync_tools_from_reflection isolates per-method failures: a method with an invalid name is skipped while the rest of the sync proceeds (B9 regression test). All 13 new tests pass; the broader gRPC + tool_service + translate_grpc suite is green at 520 tests. IBM#2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> * fix(grpc): close remaining review gaps from iteration-2 oracle pass Addresses BLOCKING items the second-pass review identified as still present after the first round of fixes. * B1 (DNS resolution): _validate_grpc_target now delegates the hostname/IP-network/DNS policy to SecurityValidator._validate_ssrf so hostnames like metadata.google.internal are resolved and the resolved IPs go through the same blocklist as HTTP targets. Removes the duplicate hand-rolled IP-network logic that previously lived inline. * B5 (visibility on update): GrpcService.update_service now snapshots visibility / team_id / owner_email before applying the update and bulk-updates every child Tool with grpc_service_id == service.id in the same transaction whenever any of those fields change. Closes the Layer 1 token-scoping gap where a public->team change on the parent service was silently ignored on already-discovered tools. * B6 (real gRPC deadlines): GrpcEndpoint.start, _discover_services, and _discover_service_details now accept a timeout kwarg that is threaded into ServerReflectionInfo, and GrpcEndpoint.invoke binds the deadline onto the underlying unary_unary call so a slow upstream cannot keep an executor thread alive after asyncio.wait_for cancels the wrapping coroutine. invoke_method passes the effective tool timeout into both endpoint.start and endpoint.invoke. * B7 (cancellation propagation in outer except BaseException): adds except asyncio.CancelledError: raise immediately before the three outer except BaseException as e blocks in tool_service.invoke_tool so a cancellation is never wrapped as a ToolInvocationError. * B12 (protobuf 6.x compatibility): MessageFactory.GetPrototype was removed in protobuf >= 5.x. Both the unary and streaming invocation paths in translate_grpc now call message_factory.GetMessageClass(...) instead. Updated the corresponding unit-test mocks. * Cleanup: drops unused import ipaddress that became dead after the SecurityValidator delegation. Test mocks updated to match the new signatures (timeout kwarg on the populate_service / _populate callbacks; GetMessageClass at the module level instead of MessageFactory.GetPrototype). Lint clean (black, isort, ruff E3/E4/E7/E9/F/D1). 520 targeted tests pass. IBM#2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> * fix(grpc): preserve timeout=0 fail-fast semantics in deadline ternaries The 'if timeout' ternary in three reflection/unary call sites silently drops timeout=0, but gRPC accepts timeout=0 as the explicit 'fail immediately' deadline. Switched to 'if timeout is not None' so a caller-supplied 0.0 reaches the underlying gRPC call with the right semantics. Found in iteration-3 oracle review (final pass, no new BLOCKING items). Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> * fix(grpc): apply non-blocking review nits flagged across the three review passes Bundle of low-risk, non-blocking improvements that surfaced during the deep review iterations. Each item below was explicitly classified as fix-now in the review summary; broader / design-dependent items were filed as follow-ups under IBM#4612 and IBM#4613. * Schema parity (#1): add grpc_service_id to ToolRead in schemas.py alongside the existing gateway_id field so API consumers can identify gRPC-discovered tools (review S2.4 / TD1). * db.py cleanup (IBM#2): drop redundant nullable=True from Tool.grpc_service_id; Mapped[Optional[str]] already implies it and the sibling gateway_id column omits it (review TD2). * translate_grpc.load_file_descriptors (IBM#3): widen the parameter type from List[bytes] to Sequence[bytes] and reject a single bytes object passed by mistake. Without the guard Python would silently iterate byte-by-byte (review TD3). * translate_grpc descriptor pool conflict (IBM#4): replace the inaccurate 'no-op if already added' comment with an explicit TypeError handler. protobuf raises TypeError when a file with the same name has conflicting content; the existing descriptor stays authoritative (review S3.3). * test_sync_tools_removes_stale_tools (IBM#5): replace the brittle string-match assertion ('DELETE' in str(call)) with an explicit call_count == 4 (1 select + 3 deletes), which no longer depends on the SQLAlchemy Delete object repr (review S2.6). * translate_grpc close() (IBM#6): convert the f-string log to lazy %-style for consistency with the rest of this PR's logging (review N2.1). * grpc_service _sync_tools_from_reflection (IBM#7): document why input_schema['properties'] is empty by design — gRPC arg shape is validated at the protobuf invocation layer, not the MCP tool-call layer; the actual proto types live in the x-grpc-* extensions (review N2.2). * TestInvokeMethodGuards (IBM#8): add 5 edge-case tests covering invoke_method paths the previous suite did not exercise: - service-not-found -> GrpcServiceNotFoundError - disabled service -> GrpcServiceError('is disabled') - invalid method format (no dot) -> GrpcServiceError - _validate_grpc_target spy called with service.target - _validate_tls_path spy called for both cert and key paths 525 targeted tests pass (was 520). Lint clean. Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> * style: clean lint cycle (make autoflake/isort/black/ruff/bandit/interrogate/pylint) Remaining fixes the project lint cycle surfaced after the fix-now bundle landed: * isort: re-sort sibling imports in test_grpc_service.py (_enforce_descriptor_limits before _GRPC_MAX_*; case-insensitive order). * ruff PLW0108 (unnecessary lambda) in test_translate_grpc.py: align the 4 mocks I added with the 7 pre-existing patterns in the same file (lambda **_kw: MagicMock() / lambda *_a, **_kw: MagicMock()). * ruff RET501 in test_grpc_service.py::tls_spy: drop the redundant trailing 'return None'. * pylint no-member on the alembic migration: add the project-standard '# pylint: disable=no-member' header that 9 other migrations use to silence alembic.op's dynamic-attribute false positives. * pylint try-except-raise on tool_service.invoke_tool gRPC branch: the 'except asyncio.CancelledError: raise' clause LOOKS like a no-op but exists specifically so the LATER 'except Exception' cannot swallow cancellation (PR IBM#3202 review B7). Added an inline disable plus a comment citing B7 so the next lint sweep doesn't delete the anti-regression code. Verified after fix: make autoflake / isort / black -> clean make ruff (E3,E4,E7,E9,F,D1) -> All checks passed make bandit -> 0 issues, 18022 LoC make interrogate -> 100% docstring coverage make pylint -> 10.00/10 on all 6 production files pytest (targeted gRPC suite) -> 525/525 passing Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> * fix: Re-parent alembic migration to current head after second rebase Updates down_revision from aa1b2c3d4e5f to 926d3e07d098. Origin/main gained the CPEX plugin framework replacement (IBM#3754) which introduced two new migrations and a mergepoint, leaving our grpc_service_id migration as a sibling head. Re-pointing to the new head returns the chain to single-headed. Verified single head: alembic heads -> w7x8y9z0a1b2 (head) alembic upgrade head -> clean on a fresh sqlite DB IBM#2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> * fix(grpc): restore reserved/multicast guard and update test_grpc_service_no_grpc Two related fixes for the no-grpc test file that surfaced after rebase: 1. Restore is_reserved / is_multicast pre-check in _validate_grpc_target. The earlier delegation to SecurityValidator._validate_ssrf inadvertently dropped this guard because the shared validator only checks blocked-networks / localhost / private. IP literals like 224.0.0.1 (multicast) or 0.0.0.0 (reserved) would have slipped past unless they were also in ssrf_blocked_networks. The restored check excludes loopback so ::1 (which Python flags as both is_loopback AND is_reserved) still passes through the localhost gate as intended. 2. Update test_grpc_service_no_grpc.py: - Add timeout=None kwarg to the three FakeEndpoint.start() and invoke() methods so they accept the per-RPC deadline added in the security/robustness hardening commit. - Update test_validate_grpc_target_enforces_ssrf_rules regex patterns to match the SecurityValidator messages reached via delegation (blocked hostname X, localhost address which is blocked, private network address which is blocked). All 4 gRPC test files pass: 528 / 528. IBM#2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> * test: increase diff coverage for gRPC PR from 92% to 100% on grpc_service+translate_grpc Adds 11 targeted tests + a decision-record comment to address the diff-coverage report that flagged 18 previously-uncovered lines: grpc_service.py: 95.7% -> 100% (7/7 covered) translate_grpc.py: 91.7% -> 100% (3/3 covered) tool_service.py: 69.2% -> 62% (5/8 covered + 3 documented) New tests in tests/unit/mcpgateway/services/test_grpc_service.py: * test_validate_grpc_target_empty_string -> line 134 * test_invoke_method_propagates_cancelled_error -> line 1021 * test_invoke_method_re_raises_timeout -> lines 1023-1024 * test_invoke_method_re_raises_grpc_service_error_unwrapped -> line 1026 * test_update_service_propagates_visibility_to_child_tools -> lines 503-504 New tests in tests/unit/mcpgateway/test_translate_grpc.py: * test_load_file_descriptors_rejects_single_bytes -> line 409 * test_load_file_descriptors_skips_pool_conflict -> lines 420, 425 New tests in tests/unit/mcpgateway/services/test_tool_service.py: * test_invoke_grpc_tool_propagates_cancellation -> line 5847 * test_invoke_grpc_tool_timeout_raises_tool_timeout_error -> lines 5849-5850, 5852 Restored a regression: the earlier B1 refactor (delegating to SecurityValidator._validate_ssrf) inadvertently dropped the is_reserved / is_multicast guard, since SecurityValidator only checks blocked-networks / localhost / private. Restored the local guard with an is_loopback exclusion so IPv6 ::1 (which Python flags as both is_loopback AND is_reserved) still passes through the localhost gate. Coverage decision-record left in test_tool_service.py for the 3 remaining B7 anti-regression lines (5446 REST, 5632 MCP, 5851 gRPC timeout-with-pm): they are byte-identical to the gRPC variant at line 5847 (which is fully exercised). Equivalent REST/MCP tests require coaxing CancelledError through asyncio.wait_for, which Python's event loop converts to TimeoutError in some states. Protecting one branch is sufficient to detect a regression that would affect all three. 541 tests passing, ruff/black/isort/pylint clean. IBM#2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> * fix: re-parent alembic migration onto current head + restore interrogate 100% Two CI failures triggered by main moving forward during this PR's review: * alembic multi-head: PR IBM#4501 ("add ondelete=CASCADE to FK constraints") introduced 9fb98535724d as a sibling of our w7x8y9z0a1b2, both parented to 926d3e07d098. Re-parents w7x8y9z0a1b2 -> 9fb98535724d so the chain has a single head again and bootstrap_db / SQLite+Postgres upgrade validation pass. Verified single head with alembic heads. * interrogate --fail-under=100: three nested helper functions had no docstrings, dropping the package to 99.9%. One is mine (GrpcEndpoint.invoke._call introduced by the gRPC timeout fix in PR IBM#3202 review B6); the other two (_user_obj_to_dict._iso, _user_dict_to_obj._dt in services/email_auth_service.py) landed on main via PR IBM#4595 and were already failing main's CI. All three added one-line docstrings. Cascade fixed by these two changes: - SQLite + PostgreSQL Fresh/Upgrade (alembic head check) - pytest (py3.12) (bootstrap_db on alembic head check) - playwright-ci-smoke (depends on bootstrap) - sql-sanitizer E2E (depends on bootstrap) - interrogate (mcpgateway) (docstring coverage) - Run pre-commit hooks (interrogate is a pre-commit hook) The 2 email_auth_service docstrings are technically out of scope for this PR but the local lint cycle scoping I had been using (make interrogate TARGET=<my-files>) didn't catch them. Folding them in here unblocks PR-level CI rather than waiting for a separate fix to main. Verified locally: alembic heads -> single head w7x8y9z0a1b2 interrogate -> 100% on full mcpgateway/ black --check -> clean ruff check -> clean pylint -> 10.00/10 targeted gRPC tests -> 541 passing IBM#2854 Branch: GrpcMethodsAsTools-2854 Signed-off-by: Jonathan Springer <jps@s390x.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Signed-off-by: Jonathan Springer <jps@s390x.com> Co-authored-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Jonathan Springer <jps@s390x.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Thanks for asking me to work on this. I will get started on it and keep this PR's description up to date as I form a plan and make progress.
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.