Skip to content

[WIP] Update AI model details in documentation#1

Closed
Copilot wants to merge 2076 commits intomainfrom
copilot/update-ai-model-information
Closed

[WIP] Update AI model details in documentation#1
Copilot wants to merge 2076 commits intomainfrom
copilot/update-ai-model-information

Conversation

Copy link
Copy Markdown

Copilot AI commented Feb 12, 2026

Thanks for asking me to work on this. I will get started on it and keep this PR's description up to date as I form a plan and make progress.


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

crivetimihai and others added 30 commits January 7, 2026 05:16
…1939)

MCP list endpoints (tools/list, resources/list, prompts/list) were
returning only ~50 items due to pagination default limit being applied.
MCP clients expect all items returned without pagination.

Fixed by passing limit=0 to service methods, which returns all items:
- streamablehttp_transport.py: tools/list, prompts/list, resources/list
- main.py handle_rpc(): tools/list, list_tools, resources/list, prompts/list

Also fixed incorrect positional argument passing in streamablehttp
transport where request_headers was passed to gateway_id parameter.

Closes IBM#1937

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* Cache tool lookups by name closes IBM#1940

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Cache tool lookups by name closes IBM#1940

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Cache tool lookups by name closes IBM#1940

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Cache tool lookups by name closes IBM#1940

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…concurrent users IBM#1806 (IBM#1943)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* perf: Add TEMPLATES_AUTO_RELOAD setting and nginx admin caching IBM#1944

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* perf: Add TEMPLATES_AUTO_RELOAD setting and nginx admin caching IBM#1944

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* revert: Remove nginx admin caching (moved to IBM#1946)

The nginx caching changes for admin pages have been moved to a
separate issue (IBM#1946) due to multi-tenant security concerns that
require careful review.

This PR now only contains the TEMPLATES_AUTO_RELOAD changes.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…BM#1947)

Enable short-TTL (5s) caching for admin pages to reduce CPU load from
Jinja2 template rendering under high concurrency.

Security measures for multi-tenant safety:
- Cache key includes all auth methods ($http_authorization,
  $cookie_jwt_token, $cookie_access_token) to isolate users
- Cache-Control: private prevents shared caches (CDNs/proxies)
  from storing user-specific content
- Set-Cookie header ignored for cache decisions (safe due to
  short TTL and read-only GET requests)

Performance impact (4000 concurrent users):
- /admin/ response time: 5414ms → 199ms (96% improvement)
- Throughput: ~2400 → ~4000 RPS (67% improvement)

Closes IBM#1946

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Optimizes AuthCache by checking in-memory L1 cache before Redis L2 cache:
- Implements L1-first lookup pattern in get_auth_context, get_user_role,
  get_user_teams, is_token_revoked, and get_team_membership_valid
- Adds write-through from L2 to L1 to populate local cache on Redis hits
- Removes Redis round-trip latency for hot authentication data
- Adds comprehensive unit tests for L1/L2 behavior and write-through logic
- Updates docstrings to accurately reflect L1-first architecture

Closes: IBM#1881

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…1 queries (IBM#1957)

Add email_team relationship with lazy="joined" to Tool model, replacing
manual team name lookups with a single JOIN query. This eliminates N+1
query patterns when fetching tools with their team names.

Changes:
- Add email_team relationship to Tool class with lazy="joined"
- Add primaryjoin condition to filter only active teams (is_active == True)
- Add team property to Tool class for convenient team name access
- Document the team property in Tool class docstring
- Remove _get_team_name helper method from tool_service.py
- Remove manual team_map batch loading in list_tools
- Remove manual outerjoin/add_columns in list_server_tools and list_tools_for_user
- Update tests to use new query pattern (.scalars().all())
- Fix flake8 issues in test file


Signed-off-by: Jonathan Springer <jonpspri@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…eir testing in Locust (IBM#1954)

* Multi gateway in benchmark
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* register benchmark at start in locust
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* refactor benchmark register
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* Add benchmarks servers to locustfile
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* Add benchmark to generate_docker_compose
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* Use ports from container for benchmark
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* Add benchmark, time and test servers in generate_docker_compose
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* Update fast-test-server port
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* Fix fast-test-server port
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* Use port 8080 for nginx in performance tests
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* Add benchmark and fast_test_server to docker-compose
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* Use staging profile as default for benchmark consistency
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* Reduce limits for benchmark_server
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* Fix tests
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* fix: correct health check port and style issues

- Fix fast-time-server health check port (8888 → 8002) in generate_docker_compose.py
- Add missing blank line before function definition in locustfile.py (PEP8 E302)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* refactor: simplify benchmark server setup with dedicated profile

- Add dedicated "benchmark" profile with auto-registration at startup
- Add make benchmark-up/down/status/logs targets
- Reduce to 10 servers (ports 9000-9009) with auto-registration
- Remove complex auto-detection and dynamic registration from locustfile
- Benchmark servers registered via register_benchmark service at compose startup
- Remove benchmark from monitoring profile for cleaner separation

Usage: make benchmark-up
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: remove healthcheck from benchmark_server (scratch image has no shell)

- Remove CMD-SHELL healthcheck (scratch image lacks /bin/sh)
- Change dependency to service_started with 5s sleep
- Servers verified healthy via HTTP endpoint

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* refactor: locust no longer registers gateways, add benchmark-clean and env vars

- Remove all gateway registration from locustfile.py (handled by compose services)
- Add benchmark-clean target to remove volumes
- Add BENCHMARK_SERVER_COUNT and BENCHMARK_START_PORT environment variables
- Support up to 100 benchmark servers (9000-9099 port range)
- Increased resource limits for larger deployments

Usage:
  make benchmark-up                           # Default: 10 servers
  BENCHMARK_SERVER_COUNT=50 make benchmark-up # Custom: 50 servers
  make benchmark-clean                        # Stop and remove volumes
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docs: add Benchmark Server Stack documentation

Add comprehensive documentation for the benchmark server stack including:
- Quick start guide
- Make targets reference
- Environment variable configuration
- Architecture overview
- Resource limits
- Standalone usage instructions
- Performance characteristics

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* Add more partial endpoints
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* Add ids and search APIs for gw, server, a2a
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* Clean up APIs
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* Add templates for gw, server and agent
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* Update gw, agent to use htmx
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* Fix bugs in gw loading
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* Fix bugs in partial templates
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* Fix agent fields in partial
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* Fix flake8 issues
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* Fix eslint issues
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* Fix counts in server list
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* Fix update and deletion of a2a tools
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* Fix test
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* pylint fixes
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* refactor: remove commented code and debug statements

- Remove commented-out admin_list_gateway_ids endpoint (replaced by new implementation)
- Remove console.log debug statements from admin.js tab switching logic
- Add standard encoding header to migration file

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: correct route paths, element IDs, and restore null gateway option

- Fix A2A route mismatch: /admin/agents/partial → /admin/a2a/partial
- Fix A2A element ID mismatch: #a2a-agents-table-body → #agents-table-body
- Fix gateway selector class: .gateway-item → .tool-item for JS compatibility
- Restore REST/A2A null gateway checkbox in gateway selector (page 1 only)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: add eager loading and transaction cleanup for partial endpoints

- Add selectinload for server relationships (tools, resources, prompts,
  a2a_agents) to prevent N+1 queries in convert_server_to_read
- Add db.commit() before template rendering in servers, gateways, and
  a2a partial endpoints to release transactions and avoid idle-in-transaction
  timeouts under load

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: preserve tool visibility on A2A agent updates and handle sync failures

- Preserve existing tool visibility and team_id when updating tool from
  A2A agent to prevent unintentionally making private/team tools public
  (ToolUpdate defaults to "public" which was overriding existing settings)
- Wrap tool sync in try/except to handle failures gracefully - the agent
  update is the primary operation and should succeed even if tool sync fails

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: handle datetime string formatting and A2A tool auth sync

- Fix gateway.last_seen template to handle both datetime objects and
  ISO strings (after jsonable_encoder serialization)
- Fix A2A tool auth sync to use auth=AuthenticationValues(...) instead
  of auth_type/auth_value which are not fields on ToolUpdate schema

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Rebase

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
…ool creation (IBM#1967)

* fix(IBM#1956): propagate team_id, owner_email and visibility to a2a tool creation

When creating tools from A2A agents, the team_id, owner_email, and
visibility fields from the agent were not being propagated to the
tool registration. This caused A2A agent tools to appear without team
association in the admin UI, and made them inaccessible via team-scoped
API queries.

This change ensures that tools created from A2A agents inherit the same
team_id, owner_email, and visibility as the parent agent, consistent
with how gateways propagate these fields to their federated tools.

Closes IBM#1956

Signed-off-by: Jerry Liu <liutao@us.ibm.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* test: add unit test for A2A tool scoping field propagation

Add test_create_tool_from_a2a_agent_passes_scope_fields to verify that
team_id, owner_email, and visibility are correctly forwarded from the
A2A agent to register_tool when creating tools from agents.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Jerry Liu <liutao@us.ibm.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…indows (IBM#1941)

This change introduces a batched aggregation path for custom log search
windows to avoid per-window recomputation. Instead of scanning log data
separately for each window, the implementation computes metrics for all
requested windows in a single pass over the full time range.

For PostgreSQL, a single SQL query using generate_series and ordered-set
aggregates computes per-window percentiles efficiently. For other
databases, logs are fetched once per component/operation and bucketed
into windows in Python.

Key changes:
- Add aggregate_all_components_batch() method to LogAggregator service
- Modify _aggregate_custom_windows() to collect all windows first and
  delegate to batch method
- Implement proper fallback to per-window aggregation if batch fails
- Include db.rollback() before fallback to handle PostgreSQL failed state
- Filter duration_ms IS NOT NULL in batch SQL JOIN for consistent
  error_count semantics with per-window path
- Filter component/operation_type IS NOT NULL and non-empty in both
  PostgreSQL and Python paths for consistency with per-window behavior
- Add warning log for large ranges in non-PostgreSQL fallback path
- Add warning when sparse window_starts detected (batch generates full range)
- Limit window list to 10000 entries, keeping most recent windows
- Preserve existing aggregation semantics and filters
- Add comprehensive tests for batch aggregation parity

Performance impact:
- Before: N windows × M component/operation pairs = N×M database queries
- After PostgreSQL: 1 SQL query for all windows and pairs
- After SQLite: M queries regardless of window count

Closes IBM#1826

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* Optimize doctest

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Optimize doctest

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* change parallel text executiong

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* Optimize tools

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Optimize MCP UI

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Jonathan Springer <jspringer@us.ibm.com>
…BM#1981)

* Add UI Overview page

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Add UI Overview page

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Add UI Overview page

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
perf(ci): remove redundant steps from full-build-pipeline
…IBM#1968)

* percentiles

Signed-off-by: rakdutta <rakhibiswas@yahoo.com>

* Revert "percentiles"

This reverts commit d3a7933.

* percentiles

Signed-off-by: rakdutta <rakhibiswas@yahoo.com>

* testing

Signed-off-by: rakdutta <rakhibiswas@yahoo.com>

* USE_POSTGRESDB_PERCENTILES

Signed-off-by: rakdutta <rakhibiswas@yahoo.com>

* test

Signed-off-by: rakdutta <rakhibiswas@yahoo.com>

* observability_service

Signed-off-by: rakdutta <rakhibiswas@yahoo.com>

* docstring

Signed-off-by: rakdutta <rakhibiswas@yahoo.com>

* val

Signed-off-by: rakdutta <rakhibiswas@yahoo.com>

* flake8

Signed-off-by: rakdutta <rakhibiswas@yahoo.com>

* flake

Signed-off-by: rakdutta <rakhibiswas@yahoo.com>

* fix: move json_key validation for defense-in-depth

Move the json_key validation outside the PostgreSQL-specific block
to protect both PostgreSQL and SQLite code paths. While current
callers always use safe hardcoded values, this provides defense-in-depth.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: address review findings for percentile performance PR

- Apply limit to Python fallback path to prevent unbounded response sizes
- Exclude tests/performance from default pytest runs (explicit opt-in)
- Skip test_use_postgresdb_percentiles_toggle which cannot properly test
  Python fallback on PostgreSQL due to global backend variable usage

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: address additional review findings

- Make extract_json_field() dialect-aware with optional dialect_name param
- Pass session dialect to extract_json_field in Python fallback path
- Add commit=False param to start_span/end_span to avoid double commits
- Update tool_service.py to use commit=False with fresh_db_session()
- Re-enable test_use_postgresdb_percentiles_toggle now that dialect is passed

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docs: add docstring to nested percentile function

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: use timezone-aware cutoff for PostgreSQL fallback path

- Use cutoff_time (aware) for PostgreSQL to avoid timezone drift
- Use cutoff_time_naive for SQLite which doesn't support timezones
- Soften timing assertions in benchmark tests to warnings (non-flaky)
- Expand percentile function docstring for flake8 compliance

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: rakdutta <rakhibiswas@yahoo.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* Avoid full scan in ResourceCache cleanup loop

Signed-off-by: NAYANAR <nayana.r7813@gmail.com>

* fixing ruff

Signed-off-by: NAYANAR <nayana.r7813@gmail.com>

* Fixed — tb renamed to _tb

Signed-off-by: NAYANAR <nayana.r7813@gmail.com>

* fxing pylint

Signed-off-by: NAYANAR <nayana.r7813@gmail.com>

* fix: apply black formatting

Add blank line before nested function per black standards.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: address thread-safety issues in ResourceCache

- Remove DualLock async context manager to avoid RLock thread mismatch
- Add locking to get() method to prevent race conditions with cleanup
- Add locking to __len__() for consistency
- Replace async cleanup test with sync _cleanup_once() tests
- Add tests for heap-based cleanup edge cases (updated/deleted entries)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docs: add docstring to _run_once inner function

Ensures 100% docstring coverage for resource_cache.py.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docs: add docstring to _align_to_window_local

Ensures 100% docstring coverage.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: bound heap memory growth with compaction

- Add _compact_heap() to rebuild heap with only valid entries
- Trigger compaction when heap size exceeds 2x max_size
- Fix docstring: _lock is threading.Lock, not async lock
- Add tests for heap compaction behavior

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* perf: minimize lock contention during heap compaction

Move O(n) heapify work outside the lock to avoid blocking get/set:
1. Snapshot cache entries under lock (fast)
2. Build heap outside lock (O(n) but doesn't block)
3. Swap back under lock, preserving entries added during compaction

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: NAYANAR <nayana.r7813@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…ing (IBM#1972)

* Migrate DCR and OAUTH from aiohttp to httpx

Signed-off-by: Mohan Lakshmaiah <mohalaks@in.ibm.com>

* Fixing Flake8 issues

Signed-off-by: Mohan Lakshmaiah <mohalaks@in.ibm.com>

* Fix httpx migration issues: redirects, timeouts, and test mocking

- Add follow_redirects=True to all httpx.AsyncClient instances to match
  aiohttp's default behavior (dcr_service.py, oauth_manager.py, cli_export_import.py)
- Add explicit 300s timeout to CLI export/import operations to prevent
  premature timeouts for slow operations
- Update integration tests to mock httpx instead of aiohttp

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mohan Lakshmaiah <mohalaks@in.ibm.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mohan Lakshmaiah <mohalaks@in.ibm.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* fix for redis missing module

Signed-off-by: Shriti Priya <shritip@ibm.com>

* Apply redis availability check consistency fix with warning logging

Apply the same try/except wrapper for importlib.util.find_spec in both
event_service.py and version.py to handle ModuleNotFoundError gracefully
when the redis module is not installed.

Additionally, log a warning when the exception is caught to surface the
issue rather than silently falling back. This helps operators identify
when redis is present but broken vs simply not installed.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Shriti Priya <shritip@ibm.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* Add an error message to form field validation

* fix: correct data attribute selectors and add null safety in form validation

- Fix data attribute mismatch: JS selectors now match HTML attributes
  (data-error-message-for instead of data-error-for)
- Fix typo in admin.html: data-error-messsage-for → data-error-message-for
- Add null safety for inputLabel?.innerText in URL validation
- Clear setCustomValidity on valid URL to prevent blocking form submission

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: address validation edge cases and security concerns

- Skip URL validation for empty optional fields (e.g., OAuth URLs)
- Re-add setCustomValidity for name fields to maintain client-side security
- Fix prompt error message selector to use data-error-message-for="name"
- Remove dead prompt-specific validation code (now handled by general validation)
- Update modal reset to clear inline error messages and red border styling

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
…sync (IBM#1950)

Add configurable feature to automatically sync gateway tools, resources,
and prompts with the database during health checks. When enabled:

- Fetches latest tools/resources/prompts from MCP servers
- Compares against database state
- Performs add/update/delete operations only when changes detected
- Tracks updates per-type (tools/resources/prompts) for precise cache invalidation
- Only deletes MCP-discovered items (health_check, oauth, federation, update, MCP)
- Preserves user-created entries (api, ui, None/legacy) during cleanup
- Allows empty server responses to clear stale items (except for auth_code OAuth)
- Passes pre-authenticated headers from health check to avoid duplicate OAuth token fetch
- Invalidates relevant caches per-gateway for efficiency
- Uses fresh_db_session() to avoid holding DB connections during HTTP calls

Configuration:
- AUTO_REFRESH_SERVERS=false (disabled by default)
- Reduces load on MCP servers when disabled

Closes IBM#1910, IBM#172

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Clarified note on arm64 support for production and MacOS.

Signed-off-by: Gabriel Costa <gabrielcg@proton.me>
* perf: eager load EmailTeam relationship in Server model to eliminate N+1 queries (IBM#1957)

Add email_team relationship and team property to Server model, replacing
the _get_team_name helper method. Use joinedload for list operations to
fetch team names in a single query instead of per-server lookups.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(IBM#1962): resolve N+1 query fixes breaking tests

- Fix convert_server_to_read to explicitly build server dict from
  attributes instead of using __dict__.copy(), which can return empty
  dict with SQLAlchemy's scalars().all() eager loading pattern

- Add missing joinedload(DbServer.email_team) to get_server and
  update_server methods for consistent team name loading

- Update test mocks to accept new options parameter for db.get() calls

- Update mock_server fixture with explicit None values for optional
  tracking fields to prevent MagicMock auto-creation

Closes IBM#1962

Signed-off-by: Jonathan Springer <jps@s390x.com>

* perf(IBM#1962): add eager loading to toggle_server_status

Add selectinload/joinedload options to db.get() call in
toggle_server_status to prevent N+1 queries when accessing
server relationships (tools, resources, prompts, a2a_agents,
email_team).

This completes the N+1 fix for issue IBM#1962 which listed
toggle_server_status as an affected function with medium impact.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(IBM#1962): fix team property assignment and export N+1 query

- admin.py: Remove manual team assignment loop that would fail since
  Server.team is now a read-only @Property. Add joinedload for
  email_team relationship instead.

- export_service.py: Add selectinload for tools relationship in
  _export_selected_servers to prevent N+1 queries when accessing
  db_server.tools in a loop.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Jonathan Springer <jps@s390x.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
crivetimihai and others added 14 commits February 8, 2026 19:51
* test: enhance Playwright UI testing

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* test: improve Playwright recordings

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* test: harden Playwright UI checks

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* test: expand Playwright UI coverage

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>
* docs: refresh documentation formatting and links

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docs: remove unused snakefood diagram

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docs: align api auth and readiness examples

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Docs update - diagram review

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>
* test(loadtest): expand Locust API coverage from 45% to 70%

Add 11 new user classes to improve REST API load test coverage:

Batch 1 - Core CRUD:
- TeamsCRUDUser: Teams API operations
- TokenCatalogCRUDUser: JWT token management
- RBACCRUDUser: Role/permission CRUD
- CancellationAPIUser: Request cancellation

Batch 3 - Extended Operations:
- RootsExtendedUser: Root CRUD operations
- TagsExtendedUser: Tag-based entity discovery
- LogSearchExtendedUser: Log search and trace
- MetricsMaintenanceUser: Metrics cleanup/rollup
- AuthExtendedUser: Auth login and user info
- EntityToggleUser: Toggle operations for all entities
- EntityUpdateUser: PUT/Update operations

Also adds `make load-test-cli` target for headless testing with
identical configuration to `make load-test-ui`.

Note: LLM-related classes (LLMConfigCRUDUser, LLMChatUser, LLMProxyUser)
and ProtocolExtendedUser were implemented but removed as they require
external LLM provider configuration to function properly.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(loadtest): resolve test failures in RPC and Roots endpoints

Fix three categories of failures:

1. /rpc tools/call DNS errors (560+ failures):
   - Add VIRTUAL_TOOL_PREFIXES to exclude test-api-tool-* and loadtest-tool-*
   - These virtual tools have no backing MCP server and fail on invocation

2. /roots/changes invalid JSON (157+ failures):
   - Remove this test - endpoint returns SSE stream, not JSON
   - Replace with simple /roots list endpoint

3. /roots/[uri] [delete] 500 errors (97+ failures):
   - Use catch_response to properly handle delete responses
   - Accept 200, 204, 404, 500 as valid responses (server-side issues)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>
* Increase playwright coverage

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(tests): improve playwright tool modal tests reliability

- Use admin_page fixture consistently for authenticated access
- Add explicit waits for modal visibility with :not(.hidden) selector
- Skip tests properly when no tools are available instead of silent pass
- Increase timeout to 10s for modal operations

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(tests): improve playwright test reliability and idempotency

- Add _wait_for_codemirror() to wait for CodeMirror editor initialization
  before interacting with promptArgsEditor
- Remove redundant navigation in test_admin_panel_loads since admin_page
  fixture already handles authentication and navigation
- Add cleanup to all entity create tests (prompts, resources, servers,
  tools) to delete created entities after test completion
- Fix _prepare_tools_table() to use state="attached" instead of requiring
  visibility, preventing timeouts on empty tables
- Apply black/isort formatting fixes

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(tests): improve CodeMirror wait reliability in prompts test

- Wait for CodeMirror library to load before checking editor instance
- Increase timeout from 10s to 30s for slower CI environments
- Add null check to editor wait condition

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>
Signed-off-by: mintzo20 <adirmintz@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>
* test: improve cache coverage

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* test: improve coverage for cli and runtime paths

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* test: fix toolops permission stubs

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* test: expand coverage for tool helpers and admin servers

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* test: extend coverage for low-coverage services

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* test: extend coverage for services

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* test: expand coverage for grpc oauth metrics

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* test: expand unit coverage for admin and services

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* test: expand observability and oauth coverage

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Fix flaky test

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* 80% threshold

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Docs update for testing

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* test: expand coverage for transports, plugins, wrapper

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Fix tests

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Fix tests

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Fix tests

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Test improvements

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Increase coverage

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* test: expand coverage for observability and services

* test: expand bulk registration coverage

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Increase coverage

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Increase coverage

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>
* chore: unignore documentation files in .gitignore

* chore: unignore FEATURES.md documentation files

* docs: update oauth design and remove empty blog index

* docs: cleanup placeholders, update statuses, and fix navigation

* typo

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Documentation review & update

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>
…BM#2549)

Replace long-lived database sessions in RBAC middleware with
fresh_db_session() context manager to prevent session accumulation
under high concurrent load.

Changes:
- Remove db parameter from get_current_user_with_permissions()
- Use fresh_db_session() context manager for short-lived DB access
- Keep "db": None in user context for backward compatibility
- Add deprecation warnings to get_db() and get_permission_service()
- Update all permission decorators to use fresh_db_session() fallback
- Update PermissionChecker to use fresh_db_session() pattern
- Simplify db.py by reusing get_db() generator for fresh_db_session

Security fixes:
- Use named kwargs (user, _user, current_user, current_user_ctx) for
  user context extraction instead of scanning all dicts for "email"
  to prevent request body injection attacks

Performance fixes:
- PermissionChecker.has_any_permission now uses single session for
  all permission checks instead of opening N sessions

This prevents idle-in-transaction bottlenecks where sessions were
held for entire request duration instead of milliseconds.

Closes IBM#2340

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>
* feat: unified PDP plugin for issue IBM#2223

Adds a single plugin entry point that orchestrates access-control
decisions across multiple policy engines (Native RBAC, MAC, OPA, Cedar).

- plugins/unified_pdp/unified_pdp.py  — Plugin class, hooks into
  tool_pre_invoke and resource_pre_fetch
- plugins/unified_pdp/pdp.py          — PolicyDecisionPoint orchestrator
- plugins/unified_pdp/pdp_models.py   — Pydantic models (Subject, Resource,
  Context, AccessDecision, config types)
- plugins/unified_pdp/adapter.py      — Abstract engine adapter base class
- plugins/unified_pdp/cache.py        — TTL-aware decision cache
- plugins/unified_pdp/engines/        — Four engine adapters: native_engine,
  mac_engine, opa_engine, cedar_engine
- plugins/unified_pdp/default_rules.json — Starter RBAC ruleset
- tests/unit/plugins/test_unified_pdp.py — 46 unit tests
- plugins/config.yaml                 — Plugin registration (mode: disabled)
- MANIFEST.in                         — Added recursive-include plugins *.json

Combination modes: all_must_allow | any_allow | first_match
Native RBAC and MAC work out of the box. OPA and Cedar require their
respective sidecars (see README).

Closes IBM#2223

Signed-off-by: yiannis2804 <yiannis2804@gmail.com>

* test: add plugin class unit tests, coverage 86%

13 tests covering UnifiedPDPPlugin hook methods (tool_pre_invoke,
resource_pre_fetch), subject extraction (dict/string/None user),
action string formatting, resource type mapping, and _build_pdp.

unified_pdp.py now at 100% coverage. Remaining gaps are in OPA and
Cedar engine adapters which require external sidecars to test.

Signed-off-by: yiannis2804 <yiannis2804@gmail.com>

* docs: add detailed README for unified PDP plugin

Signed-off-by: yiannis2804 <yiannis2804@gmail.com>

* fix(unified-pdp): fix bugs and improve tests

- Fix undefined variable eng_type in pdp.py:get_effective_permissions()
- Add shutdown() lifecycle method to UnifiedPDPPlugin to properly close
  HTTP clients for OPA/Cedar engines
- Convert tests from respx to pytest-httpx (project standard)
- Add test for shutdown() method

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* chore(unified-pdp): fix linting issues

- Remove unused import List from mac_engine.py
- Remove unused variable first_deny from pdp.py

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(unified-pdp): address review findings from additional security review

- Cache key now includes user_agent and context.extra to prevent incorrect
  cached decisions when policies depend on these fields (MAC operation
  override, OPA/Cedar context-based rules)
- Plugin now extracts IP and user_agent from HTTP headers and passes to
  PDP context for policy evaluation
- Plugin passes tool args to context.extra and resource metadata to
  resource.annotations for fine-grained policy checks
- Exception handling in _evaluate_parallel/_evaluate_sequential now
  catches all exceptions (not just TimeoutError/PolicyEvaluationError)
  to prevent crashing the whole request on unexpected errors
- Native RBAC docstring corrected: only JSON files are supported (not YAML)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(unified-pdp): extract classification_level for MAC engine

Extract classification_level from tool args and resource metadata
so MAC engine can make proper Bell-LaPadula decisions instead of
always denying due to missing classification.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docs: add docstrings for 100% interrogate coverage

Add missing docstrings to all public functions and methods in the
unified_pdp plugin to satisfy the project's 100% docstring coverage
requirement enforced by interrogate.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docs: add comprehensive Google-style docstrings to unified_pdp

Add complete Args, Returns, Raises, and Attributes documentation to all
public functions and methods in the unified_pdp plugin, matching the
project's docstring style with full parameter descriptions.

Files updated:
- adapter.py: PolicyEvaluationError, PolicyEngineAdapter methods
- cache.py: _build_cache_key, _CacheEntry, DecisionCache methods
- pdp.py: PolicyDecisionPoint and all evaluation/combination methods
- engines/cedar_engine.py: CedarEngineAdapter and all methods
- engines/mac_engine.py: MACEngineAdapter and all methods
- engines/native_engine.py: NativeRBACAdapter and all methods
- engines/opa_engine.py: OPAEngineAdapter and all methods
- unified_pdp.py: shutdown lifecycle method

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docs: add __init__ docstring to PolicyEvaluationError

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: yiannis2804 <yiannis2804@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>
* feat(api): standardize gateway response format

- Set *_unmasked fields to null in GatewayRead.masked()
- Apply masking consistently across all gateway return paths
- Mask credentials on cache reads
- Update admin UI to indicate stored secrets are write-only
- Update tests to verify masking behavior

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* delete artifact sbom

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* feat(gateway): add configurable URL validation for gateway endpoints

Add comprehensive URL validation with configurable network access controls
for gateway and tool URL endpoints. This allows operators to control which
network ranges are accessible based on their deployment environment.

New configuration options:
- SSRF_PROTECTION_ENABLED: Master switch for URL validation (default: true)
- SSRF_ALLOW_LOCALHOST: Allow localhost/loopback (default: true for dev)
- SSRF_ALLOW_PRIVATE_NETWORKS: Allow RFC 1918 ranges (default: true)
- SSRF_DNS_FAIL_CLOSED: Reject unresolvable hostnames (default: false)
- SSRF_BLOCKED_NETWORKS: CIDR ranges to always block
- SSRF_BLOCKED_HOSTS: Hostnames to always block

Features:
- Validates all resolved IP addresses (A and AAAA records)
- Normalizes hostnames (case-insensitive, trailing dot handling)
- Blocks cloud metadata endpoints by default (169.254.169.254, etc.)
- Dev-friendly defaults with strict mode available for production
- Full documentation and Helm chart support

Also includes minor admin UI formatting improvements.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* feat(auth): add token-scoped filtering for list endpoints and gateway forwarding

- Add token_teams parameter to list_servers and list_gateways endpoints
  for proper scoping based on JWT token team claims
- Update server_service.list_servers() and gateway_service.list_gateways()
  to filter results by token scope (public-only, team-scoped, or unrestricted)
- Skip caching for token-scoped queries to prevent cross-user data leakage
- Update gateway forwarding (_forward_request_to_all) to respect token team scope
- Fix public-only token handling in create endpoints (tools, resources, prompts,
  servers, gateways, A2A agents) to reject team/private visibility
- Preserve None vs [] distinction in SSE/WebSocket for proper admin bypass
- Update get_team_from_token to distinguish missing teams (legacy fallback)
  from explicit empty teams (public-only access)
- Add request.state.token_teams storage in all auth paths for downstream access

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* feat(auth): add normalize_token_teams for consistent token scoping

Introduces a centralized `normalize_token_teams()` function in auth.py
that provides consistent token team normalization across all code paths:

- Missing teams key → empty list (public-only access)
- Explicit null teams + admin flag → None (admin bypass)
- Explicit null teams without admin → empty list (public-only)
- Empty teams array → empty list (public-only)
- Team list → normalized string IDs (team-scoped)

Additional changes:
- Update _get_token_teams_from_request() to use normalized teams
- Fix caching in server/gateway services to only cache public-only queries
- Fix server creation visibility parameter precedence
- Update token_scoping middleware to use normalize_token_teams()
- Add comprehensive unit tests for token normalization behavior

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* feat(websocket): forward auth credentials to /rpc endpoint

The WebSocket /ws endpoint now propagates authentication credentials
when making internal requests to /rpc:

- Forward JWT token as Authorization header when present
- Forward proxy user header when trust_proxy_auth is enabled
- Enables WebSocket transport to work with AUTH_REQUIRED=true

Also adds unit tests to verify auth credential forwarding behavior.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* feat(rbac): add granular permission checks to all admin routes

- Add @require_permission decorators to all 177 admin routes with
  allow_admin_bypass=False to enforce explicit permission checks
- Add allow_admin_bypass parameter to require_permission and
  require_any_permission decorators for configurable admin bypass
- Add has_admin_permission() method to PermissionService for checking
  admin-level access (is_admin, *, or admin.* permissions)
- Update AdminAuthMiddleware to use has_admin_permission() for
  coarse-grained admin UI access control
- Create shared test fixtures in tests/unit/mcpgateway/conftest.py
  for mocking PermissionService across unit tests
- Update test files to use proper user context dict format

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docs(rbac): comprehensive update to authentication and RBAC documentation

Update documentation to accurately reflect the two-layer security model
(Token Scoping + RBAC) and correct token scoping behavior.

rbac.md:
- Rewrite overview with two-layer security model explanation
- Fix token scoping matrix (missing teams key = PUBLIC-ONLY, not UNRESTRICTED)
- Add admin bypass requirements warning (requires BOTH teams:null AND is_admin:true)
- Add public-only token limitations (cannot access private resources even if owned)
- Add Permission System section with categories and fallback permissions
- Add Configuration Safety section (AUTH_REQUIRED, TRUST_PROXY_AUTH warnings)
- Update enforcement points matrix with Token Scoping and RBAC columns

multitenancy.md:
- Add Token Scoping Model section with secure-first defaults
- Add Two-Layer Security Model section with request flow diagram
- Add Enforcement Points Matrix
- Add Token Scoping Invariants
- Document multi-team token behavior (first team used for request.state.team_id)

oauth-design.md & oauth-authorization-code-ui-design.md:
- Add scope clarification notes (gateway OAuth delegation vs user auth)
- Add Token Verification section
- Add cross-references to RBAC and multitenancy docs

AGENTS.md:
- Add Authentication & RBAC Overview section with quick reference

llms/mcpgateway.md & llms/api.md:
- Add token scoping quick reference and examples
- Add links to full documentation

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(rbac): add explicit db dependency to RBAC-protected routes

Address load test findings from RCA #1 and IBM#2:

- Add `db: Session = Depends(get_db)` to routes in email_auth.py,
  llm_config_router.py, and teams.py that use @require_permission
- Fix test files to pass mock_db parameter after signature changes
- Add shm_size: 256m to PostgreSQL in docker-compose.yml
- Remove non-serializable content from resource update events
- Disable CircuitBreaker plugin for consistent load testing

These changes fix the NoneType errors (~33,700) observed under 4000
concurrent users where current_user_ctx["db"] was always None.

Remaining critical issue: Transaction leak in streamablehttp_transport.py
causing idle-in-transaction connections (see todo/rca2.md for details).

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(db): resolve transaction leak and connection pool exhaustion

Critical fixes for load test failures at 4000 concurrent users:

Issue #1 - Transaction leak in streamablehttp_transport.py (CRITICAL):
- Add explicit asyncio.CancelledError handling in get_db() context manager
- When MCP handlers are cancelled (client disconnect, timeout), the finally
  block may not execute properly, leaving transactions "idle in transaction"
- Now explicitly rollback and close before re-raising CancelledError
- Add rollback in direct SessionLocal usage at line ~1425

Issue IBM#2 - Missing db parameter in admin routes (HIGH):
- Add `db: Session = Depends(get_db)` to 73 remaining admin routes
- Routes with @require_permission but no db param caused decorator to
  create fresh session via fresh_db_session() for EVERY permission check
- This doubled connection usage for affected routes under load

Issue IBM#3 - Slow recovery from transaction leaks (MEDIUM):
- Reduce IDLE_TRANSACTION_TIMEOUT from 300s to 30s in docker-compose.yml
- Reduce CLIENT_IDLE_TIMEOUT from 300s to 60s
- Leaked transactions now killed faster, preventing pool exhaustion

Root cause confirmed: list_resources() MCP handler was primary source,
with 155+ connections stuck on `SELECT resources.*` for up to 273 seconds.

See todo/rca2.md for full analysis including live test data showing
connection leak progression and 606+ idle transaction timeout errors.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(teams): use consistent user context format across all endpoints

- Update request_to_join_team and leave_team to use dict-based user context
- Fix teams router to use get_current_user_with_permissions consistently
- Move /discover route before /{team_id} to prevent route shadowing
- Update test fixtures to use mock_user_context dict format
- Add transaction commits in resource_service to prevent connection leaks
- Add missing docstring parameters for flake8 compliance

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(db): add explicit db.commit/close to prevent transaction leaks

Add explicit db.commit(); db.close() calls to 100+ endpoints across
all routers to prevent PostgreSQL connection leaks under high load.

Problem: Under high concurrency, FastAPI's Depends(get_db) cleanup
runs after response serialization, causing transactions to remain
in 'idle in transaction' state for 20-30+ seconds, exhausting the
connection pool.

Solution: Explicitly commit and close database sessions immediately
after database operations complete, before response serialization.

Routers fixed:
- tokens.py: 10 endpoints (create, list, get, update, revoke, usage, admin, team tokens)
- llm_config_router.py: 14 endpoints (provider/model CRUD, health, gateway models)
- sso.py: 5 endpoints (SSO provider CRUD)
- email_auth.py: 3 endpoints (user create/update/delete)
- oauth_router.py: 1 endpoint (delete_registered_client)
- teams.py: 18 endpoints (team CRUD, members, invitations, join requests)
- rbac.py: 12 endpoints (roles, user roles, permissions)
- main.py: 14 CUD + 3 list + 7 RPC handlers

Also fixes:
- admin.py: Rename 21 unused db params to _db (pylint W0613)
- test_teams*.py: Add mock_db fixture to tests calling router functions directly
- Add llms/audit-db-transaction-management.md for future audits

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* ci(coverage): lower doctest coverage threshold to 30%

Reduce the required doctest coverage from 34% to 30% to accommodate
current coverage levels (32.17%).

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(rpc): fix list_gateways tuple unpacking and add token scoping

The RPC list_gateways handler had two bugs:
1. Did not unpack the tuple (gateways, next_cursor) returned by
   gateway_service.list_gateways(), causing 'list' object has no
   attribute 'model_dump' error
2. Was missing token scoping via _get_rpc_filter_context(), which
   was the original R-02 security fix

Also fixed all callers of list_gateways that expected a list but
now receive a tuple:
- mcpgateway/admin.py: get_gateways_section()
- mcpgateway/services/import_service.py: 3 call sites

Updated test mocks to return (list, None) tuples instead of lists.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(teams): build response before db.close() to avoid lazy-load errors

The teams router was calling db.commit(); db.close() before building
the TeamResponse, but TeamResponse includes team.get_member_count()
which needs an active session. When the session is closed, the fallback
in get_member_count() tries to access self.members (lazy-loaded),
causing "Parent instance is not bound to a Session" errors.

Fixed by building TeamResponse BEFORE calling db.close() in:
- create_team
- get_team
- update_team

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(teams): fix update_team expecting team object but getting bool

The service's update_team() returns bool, but the router was treating
the return value as a team object and trying to access .id, .name, etc.

Fixed by:
1. Checking the boolean return value for success
2. Fetching the team again after successful update to build the response

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(teams): fix update_member_role return type mismatch

The service's update_member_role() returns bool, but the router
treated it as a member object. Fixed by:
1. Checking the boolean success
2. Added get_member() method to TeamManagementService
3. Fetching the updated member to build the response

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Fix teams return

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>
Removed unreleased security changes regarding gateway credentials from CHANGELOG.

Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>
Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>
Copilot AI requested a review from hughhennelly February 12, 2026 16:00
Copilot stopped work on behalf of hughhennelly due to an error February 12, 2026 16:00
hughhennelly added a commit that referenced this pull request Feb 12, 2026
- Fix database import: from ..database to from ..db
- Fix unified_pdp imports: from plugins.unified_pdp (Issue #1)
- Remove scratch header files (Issue IBM#5)

Related to PR feedback on Issue IBM#2226

Signed-off-by: hughhennelly <hughhennelly06@gmail.com>
hughhennelly added a commit that referenced this pull request Feb 12, 2026
1. Fix broken imports (Issue #1):
   - Change from ..database to ..db
   - Fix unified_pdp imports to use plugins.unified_pdp
   - Update in routes, services, schemas, and tests

2. Register sandbox router in main.py (Issue IBM#2):
   - Add import and app.include_router call

3. Fix XSS vulnerability (Issue IBM#3):
   - Replace f-string HTML with Jinja2 template
   - Create sandbox_simulate_results.html template
   - Add Request parameter for template access

4. Add authentication (Issue IBM#4):
   - Add Depends(get_current_user) to simulate endpoint

5. Remove scratch files (Issue IBM#5):
   - Delete sandbox_header.txt and sandbox_new_header.txt

6. Resolve schemas conflict (Issue IBM#6):
   - Merge schemas/sandbox.py into schemas.py
   - Remove conflicting schemas/ directory
   - Update imports in routes and services

All changes tested and ready for review.

Related to IBM#2226

Signed-off-by: hughhennelly <hughhennelly06@gmail.com>
jonpspri pushed a commit that referenced this pull request May 8, 2026
…or handling (IBM#3159)

* fix(api): change apijsonpath from Body to Query parameter for GET endpoints

- Move apijsonpath parameter from request body to query string for list_tools and get_tool endpoints
- Add _parse_apijsonpath() helper to handle both JSON string and JsonPathModifier inputs
- Return ORJSONResponse to bypass FastAPI response_model validation when using JSONPath modifiers
- Add comprehensive error handling with proper HTTPException status codes
- Add test coverage for Query parameter handling and error cases

This change resolves issue IBM#2848 by making GET requests RESTful-compliant (no request body).

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* fix(security): improve error handling and reduce information disclosure in apijsonpath parsing

Address code review feedback with three security and performance improvements:

1. Specific Exception Handling (Issue #1)
   - Catch specific exceptions (json.JSONDecodeError, ValueError, ValidationError)
   - Separate user errors (400) from system errors (500)
   - Log unexpected errors with full context for debugging

2. Sanitize Exception Messages (Issue IBM#2)
   - Return generic error messages in production (non-DEBUG mode)
   - Prevent information disclosure about internal types and stack traces
   - Detailed messages only shown when LOG_LEVEL=DEBUG for troubleshooting

3. Optimize Debug Logging (Issue IBM#3)
   - Check if debug logging is enabled before building log message
   - Avoid string formatting overhead in production
   - Improves performance in high-traffic scenarios

Updated test assertions to verify generic error messages (security improvement).
All tests pass (18 jsonpath tests verified).

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* doc

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* except

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* test coverage

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* pagination

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* review

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* head remove

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* flake

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* fix: validate apijsonpath before DB query and add missing test coverage

- Move _parse_apijsonpath() call before the database query in list_tools
  to fail fast on invalid input without wasting a DB round-trip
- Add tests for the ValidationError path (extra fields rejected by
  extra="forbid" on JsonPathModifier)
- Run black/isort on test file

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: correct nextCursor wire key, sanitize JSONPath in debug log, add e2e tests

- Fix pagination+JSONPath response to use "nextCursor" (camelCase) matching
  the CursorPaginatedToolsResponse alias contract. The hand-built dict
  bypassed Pydantic aliasing via ORJSONResponse, emitting "next_cursor"
  which broke any client relying on the documented field name.
  Confirmed regression against live server at localhost:8080.

- Sanitize user-supplied JSONPath expression in debug log via
  SecurityValidator.sanitize_log_message() to prevent CWE-117 log
  forging (newlines, ANSI escapes in crafted jsonpath values).

- Add end-to-end HTTP tests (TestApijsonpathHTTP) exercising the
  apijsonpath query parameter through FastAPI TestClient, covering
  the full HTTP binding path that __wrapped__() unit tests skip.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Rakhi Dutta <rakhibiswas@yahoo.com>
jonpspri pushed a commit that referenced this pull request May 8, 2026
* fix(api): MCP Tool Validation Fix

Signed-off-by: Gabriel Costa <gabrielcg@proton.me>

* test: improve differential test coverage for validation_strict gating

- Parameterize non-strict test to verify ALL 7 forbidden patterns pass
  through (was only testing "> ")
- Add test for break behavior: single warning logged even with multiple
  matching patterns in one description
- Move inline `import logging` to top-level imports
- Covers issue IBM#3711

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* linting

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(docs): replace misleading <placeholder> example in validation docs

The forbidden-pattern list matches "< " (with trailing space), not bare
"<".  Strings like "<placeholder>" never hit the forbidden-pattern check;
they are stripped by sanitize_display_text() as HTML-like content
regardless of VALIDATION_STRICT.  Replace with "< input" and "cmd | grep"
which actually match the gated patterns.

Addresses Codex review finding #1.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* test: add regression tests for empty exception str() fallback in gateway init

Cover the `raw_error = str(root_cause) or type(root_cause).__name__`
guard in _initialize_gateway: when an exception's str() is empty, the
GatewayConnectionError message now includes the type name instead of
a blank suffix.  Tests both plain exceptions and ExceptionGroup unwrapping.

Addresses Codex review finding IBM#2.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Gabriel Costa <gabrielcg@proton.me>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
jonpspri pushed a commit that referenced this pull request May 8, 2026
…dation (IBM#3750)

* test(rate-limiter): harden rate limiter plugin — gaps #1-IBM#8

Closes IBM#3740

## What changed

### Plugin fixes (plugins/rate_limiter/rate_limiter.py)
- Config validation at __init__: _validate_config() parses all rate
  strings at startup — bad config raises immediately, not mid-request
- Graceful degradation: both hooks wrapped in try/except; unexpected
  errors are logged and the request is allowed through (permissive)
- prompt_pre_fetch now enforces by_tool limits using prompt_id as key
- MemoryBackend: asyncio.Lock makes counter increments atomic
- MemoryBackend: background TTL sweep evicts expired windows (0.5s interval)
- RedisBackend: atomic INCR+EXPIRE via Lua script; shared state across
  all gateway instances; native TTL expiry; falls back to memory on error

### Test additions (tests/unit/.../test_rate_limiter.py)
- Gap tests: 4 xfail -> pass (shared state, eviction, prompt by_tool,
  graceful degradation); 1 xfail remains (fixed window burst, deferred)
- Edge case tests: malformed/unsupported config raises at init (not
  request time); runtime errors degrade gracefully via mock injection
- Redis backend test uses injected FakeRedis — no live server required

### Config changes
- plugins/config.yaml: RateLimiterPlugin enabled with enforce mode
- tests/performance/plugins/config.yaml: RateLimiterPlugin set to
  permissive for inclusion in cProfile benchmark runs

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

* chore(config): enable Redis backend for RateLimiterPlugin in plugins/config.yaml

Switch the default stack config from in-memory to Redis-backed rate limiting.
This ensures the 30/m per-user limit is enforced as a true shared limit across
all gateway instances rather than 30/m per process.

Validated via Redis MONITOR: all 3 gateway instances atomically increment the
same rl:user:<id>:60 counter via the Lua INCR+EXPIRE script.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

* test(rate-limiter): add benchmark-rate-limiter load test for multi-instance correctness

Adds locustfile_rate_limiter.py and a make benchmark-rate-limiter target to
demonstrate the multi-instance rate limit enforcement gap and its fix.

The test sends 1 req/s (60 req/min = 2x the 30/m limit) through 3 gateway
instances. With a memory backend each instance only sees ~20 req/min and
never fires the limiter (~0% failures). With the Redis backend the shared
counter reaches 30/min and blocks ~50% of requests — clearly showing the fix
works across instances.

Expected results:
  Memory backend:  ~0% blocked  (each instance sees 20 req/min < 30/m limit)
  Redis backend:  ~50% blocked  (shared counter: 60 req/min > 30/m limit)

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

* test(rate-limiter): add hardening tests, bypass resistance, PII fix, and updated docs

- Add 22 new unit tests (70 passed total, 4 xfailed):
  - Permissive vs enforce mode through PluginExecutor
  - Redis fallback: memory takeover when Redis is down, limit still enforced, no-fallback graceful degradation
  - Cross-tenant isolation: independent counters, no counter bleed between tenants
  - Header accuracy: Retry-After bounds, X-RateLimit-Reset future/consistency, Remaining decrement
  - Bypass resistance: None/whitespace user identity, tool name case sensitivity and whitespace (documented as xfail gaps)
  - PII: violation description must not contain user or tenant identifiers

- Fix PII leak in violation description: remove user/tenant from description string
  in both prompt_pre_fetch and tool_pre_invoke — identifiers appeared in log output
  via permissive-mode manager warning and enforce-mode PluginViolationError message

- Rewrite plugins/rate_limiter/README.md: was describing the old pre-fix
  implementation (in-memory only, no Redis, Redis as TODO). Now documents
  both backends, full config reference, response headers, examples, and
  accurate limitations table

- Update plugin-manifest.yaml description to reflect Redis backend support

Closes IBM#3740

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

* fix(rate-limiter): review fixes — dead code, test correctness, config validation

- Remove unused _allow() module-level function (dead code — plugin uses
  self._rate_backend.allow() directly)
- Fix test_graceful_degradation test: was patching _allow() which is never
  called by the plugin; now patches backend.allow() via patch.object so the
  try/except error path is actually exercised
- Add prompt_pre_fetch graceful degradation test (was only tested for
  tool_pre_invoke)
- Fix inconsistent by_tool lookup in tool_pre_invoke: remove unnecessary
  hasattr(__contains__) guard, align with prompt_pre_fetch pattern
- Add backend validation to _validate_config(): typo like 'reddis' now
  raises ValueError at startup instead of silently falling back to memory
- Add test for malformed by_tool rate string validation
- Add test for invalid backend name validation
- Change default config mode from enforce to permissive for safety
  (consistent with all other security plugins in the default config)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
jonpspri pushed a commit that referenced this pull request May 8, 2026
…fixes IBM#538 US-1) (IBM#3251)

* feat(security): enforce content size limits for resources and prompts

Implement configurable content size validation to prevent DoS attacks
via oversized content submissions. Resources are limited to 100KB and
prompt templates to 10KB by default (configurable via environment
variables CONTENT_MAX_RESOURCE_SIZE and CONTENT_MAX_PROMPT_SIZE).

Validation occurs at the service layer before database operations,
returning 413 Payload Too Large with structured error details.

Closes IBM#538

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(security): review fixes for content size limits PR

- Fix ResourceCreate.validate_content to remove MAX_CONTENT_LENGTH
  schema-level check (now handled by service layer, consistent with
  ResourceUpdate)
- Fix broken test assertions checking raw bytes in formatted messages
  (TestErrorMessageClarity was asserting "200000" in human-readable
  "195.3 KB" strings)
- Fix broken test_update_prompt_validation_and_integrity_errors which
  had its body accidentally deleted by the original PR
- Restore accidentally deleted assertions in test_list_resources and
  test_create_resource_endpoint
- Add global exception handler for ContentSizeError (defense-in-depth,
  consistent with existing handlers for ValidationError/IntegrityError)
- Add reset_content_security_service() for test isolation
- Remove "Made with Bob" watermark from test file
- Remove duplicate pytest import in integration tests
- Fix broken README link to non-existent migration guide
- Fix double-commented lines in .env.example

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: resolve flake8 lint issues in content size limits PR

- Move noqa:DUO138 to the line flake8 reports (validators.py)
- Add Args/Returns to content_size_exception_handler docstring (DAR101/DAR201)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(tests): update security tests for service-layer size validation

Content size validation moved from Pydantic schema to service layer
(returns 413 instead of 422). Update test_resource_create_content_validation
and test_zip_bomb_prevention to reflect that ResourceCreate no longer
rejects oversized content at schema level.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(security): address codex review findings for content size limits

1. Remove MAX_TEMPLATE_LENGTH check from SecurityValidator.validate_template
   so the configurable CONTENT_MAX_PROMPT_SIZE is not silently capped at 64KB
   by the schema-level validator (finding #1)

2. Add db.rollback() to ContentSizeError handlers in resource_service to
   prevent partial metadata mutations from being committed when content
   size validation fails after ORM fields are already updated (finding IBM#2)

3. Include actual_size and max_size in admin route 413 responses to match
   the structured error contract used by API routes (finding IBM#3)

4. Update security and validator tests that expected schema-level length
   rejection — size enforcement now happens at service layer

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* chore: fix trailing whitespace in test files

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docs: add content size limits to configuration docs, securing guide, and Helm values

- docs/docs/manage/configuration.md: add Content Security section with
  env var table, scope note, and error response example
- docs/docs/manage/securing.md: add section 9 (Content Size Limits)
  to production security checklist, renumber subsequent sections
- charts/mcp-stack/values.yaml: add CONTENT_MAX_RESOURCE_SIZE and
  CONTENT_MAX_PROMPT_SIZE to gateway env config

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Suresh Kumar Moharajan <suresh.kumar.m@ibm.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
jonpspri pushed a commit that referenced this pull request May 8, 2026
…ons (IBM#4050)

* feat: implement separate observability session pattern (best-effort)

Observability write operations now use independent database sessions that
commit immediately on a best-effort basis, separate from main request
transactions. This ensures observability data persists even when the main
request fails, providing visibility into partial failures.

Implementation details:
- Observability write methods (start_trace, end_trace, start_span, end_span,
  add_event, record_metric, record_token_usage, delete_old_traces) now create
  their own independent sessions using SessionLocal() instead of accepting a
  db parameter from the caller
- Context managers (trace_span, trace_tool_invocation, trace_a2a_request)
  create and manage a single session for the full span lifecycle
- Query methods still accept db: Session parameter for RBAC/token scoping
- ObservabilityMiddleware no longer creates or manages request.state.db
- Each observability operation creates its own short-lived session that
  commits immediately in a try/except/finally block
- Updated all callers across resource_service, prompt_service, tool_service,
  and instrumentation/sqlalchemy to use new API (removed db parameters)
- Added comprehensive test coverage verifying separate session behavior
- Fixed pre-existing bugs in add_event() and record_metric() where db.refresh()
  was called after commit failure

Benefits:
- Observability data persists independently of main request transaction success
- Provides visibility into partial failures and error states
- Follows pattern established in instrumentation/sqlalchemy.py (lines 58-87)
- Reduces transaction contention between observability and business logic

Trade-offs:
- NOT atomic with main request transaction (intentional design choice)
- Traces may show "in progress" or partial states for failed requests
- Provides eventual consistency rather than strict consistency

Documentation:
- Added "Observability Transaction Behavior" section to AGENTS.md explaining
  the separate session pattern, implementation details, and security notes
- Updated module docstrings with session management details

Tests: 96 observability tests passing, full suite 16,673 tests passing
Pylint: 10.00/10 rating maintained

Closes IBM#3883

Signed-off-by: Mohan Lakshmaiah <mohan.economist@gmail.com>

* fix: add type annotations to observability context managers

Add proper return type annotations to context manager methods:
- trace_span() -> Generator[str, None, None]
- trace_tool_invocation() -> Generator[Tuple[Optional[str], Dict[str, Any]], None, None]
- trace_a2a_request() -> Generator[Tuple[Optional[str], Dict[str, Any]], None, None]

These annotations improve type safety and make the return values explicit.
The trace_span context manager yields just the span_id string, while
trace_tool_invocation and trace_a2a_request yield a tuple of (span_id, result_dict)
for capturing operation results.

Tests: 96 observability tests passing
Pylint: 10.00/10 rating maintained
Signed-off-by: Mohan Lakshmaiah <mohan.economist@gmail.com>

* test: Add coverage tests for observability session close exception handling

Add 11 new test functions to achieve 100% coverage of exception handling
paths in observability_service.py. These tests verify that all write
methods and context managers handle session close failures gracefully:

- test_start_trace_session_close_failure
- test_end_trace_session_close_failure
- test_start_span_session_close_failure
- test_start_span_with_commit_true
- test_end_span_session_close_failure
- test_add_event_session_close_failure
- test_record_metric_session_close_failure
- test_delete_old_traces_session_close_failure
- test_trace_span_context_manager_session_close_failure
- test_trace_tool_invocation_context_manager_session_close_failure
- test_trace_a2a_request_context_manager_session_close_failure

Each test mocks the session close() method to raise an exception and
verifies that the operation completes successfully despite the close
failure. This covers the previously uncovered lines in finally blocks
where session close exceptions are caught and logged.

Resolves CI/CD coverage requirement for observability_service.py.

Signed-off-by: Mohan Lakshmaiah <mohan.economist@gmail.com>

* fix: complete separate session migration for delete_old_traces and record_transport_activity

- Fix runtime TypeError in cleanup_old_traces router: delete_old_traces()
  no longer accepts db parameter (missed call site from IBM#3883)
- Remove unused db parameter from record_transport_activity() for API
  consistency with other write methods
- Update docstring and doctest for delete_old_traces call in router
- Update test calls to match new record_transport_activity signature
- Add 3 tests for 100% differential coverage: error-path exception
  handlers in trace_tool_invocation, trace_a2a_request, and session
  close failure in record_token_usage

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: address review feedback for separate observability session pattern

This commit addresses valid concerns raised in PR review:

1. Fix class docstring showing old API (madhu-mohan-jaishankar)
   - Updated ObservabilityService class docstring to show correct API
   - Removed db parameter from examples (old: start_trace(db, name))
   - Added clarification about write vs query methods
   - Examples now demonstrate correct usage with proper parameters

2. Fix context manager atomicity (msureshkumar88 issue IBM#5)
   - Added optional obs_db parameter to add_event() method
   - Context managers now pass their session to add_event() in error paths
   - Ensures atomic commit of span updates and error events together
   - Prevents orphaned events when span update fails
   - Updated all three context managers:
     * trace_span() at line 644
     * trace_tool_invocation() at line 757
     * trace_a2a_request() at line 1112

3. Document connection pool sizing (msureshkumar88 issue IBM#7)
   - Added Connection Pool Sizing section to AGENTS.md
   - Explains 4-6 sessions per traced request pattern
   - Documents default pool (200+10 = 210 connections)
   - Supports ~35 concurrent traced requests by default
   - Guidance for high-traffic deployments (>50 req/sec)
   - Note about SQLite 50-connection cap

Implementation details:
- add_event() now accepts optional obs_db: Optional[Session] = None
- When obs_db is provided, uses caller's session (owned = False)
- When obs_db is None, creates independent session (owned = True)
- Only closes session if owned is True
- All existing tests pass (87 observability service tests, 23 middleware tests)

Tests: All 87 observability service tests passing
       All 23 observability middleware tests passing
       100% differential coverage maintained

Addresses review feedback from:
- @madhu-mohan-jaishankar (approved with comments)
- @msureshkumar88 (issues IBM#5 and IBM#7 from detailed review)

Signed-off-by: Mohan Lakshmaiah <mohan.economist@gmail.com>

* fix: address remaining review concerns with safety improvements

This commit addresses all remaining valid reviewer concerns:

1. Document dead code guard design choice (madhu-mohan-jaishankar issue #1)
   - Added Note section to _get_or_create_observability_session() docstring
   - Explains tuple return pattern enables future session reuse optimization
   - Guards remain for forward compatibility (maintainer approved pattern)

2. Move session creation inside try blocks (msureshkumar88 issue #1)
   - Initialize obs_db = None and owned = False before try blocks
   - Create session inside try to prevent theoretical resource leak
   - Update finally blocks to check `if owned and obs_db is not None:`
   - Applied to all 8 write methods:
     * start_trace() (line 317)
     * end_trace() (line 385)
     * trace_span() context manager (line 619)
     * trace_tool_invocation() context manager (line 719)
     * add_event() (line 836)
     * record_token_usage() (line 926)
     * trace_a2a_request() context manager (line 1087)
     * record_metric() (line 1306)
     * delete_old_traces() (line 1718)

3. Add header sanitization (msureshkumar88 issue IBM#10)
   - Added sanitize_header_for_storage() helper function
   - Removes control characters (prevents log injection)
   - Truncates to max_length (prevents DoS via large headers)
   - Applied to user_agent (max 500 chars) and http_url (max 2000 chars)
   - Defense-in-depth security improvement

Implementation details:
- sanitize_header_for_storage() removes non-printable chars except space/tab
- Returns "unknown" for None/empty values
- All 87 observability service tests passing
- All 23 observability middleware tests passing

Deferred to future PR (with justification):
- Code duplication refactoring (issue IBM#9): Too complex and risky for this PR.
  The three context managers share patterns but have significant domain-specific
  differences (attributes, yield signatures, sanitization). Refactoring would
  require extensive testing and carries risk of subtle bugs. Current code is
  correct and well-tested. Recommend separate focused PR for refactoring.

Tests: All 110 observability tests passing (87 service + 23 middleware)
       100% differential coverage maintained

Addresses remaining review feedback from:
- @madhu-mohan-jaishankar (issue #1 - dead code guards)
- @msureshkumar88 (issues #1, IBM#10 - resource leak, header sanitization)

Signed-off-by: Mohan Lakshmaiah <mohan.economist@gmail.com>

* fix: correct doctest examples in sanitize_header_for_storage

The function removes control characters completely (doesn't replace with spaces)
and truncates without adding ellipsis. Updated doctest examples to match
actual behavior:
- Control chars removed: 'Evil\x00\nInjection' -> 'EvilInjection'
- Truncation verified by length check instead of expecting '...'
- Added test for None input returning 'unknown'

All doctests now passing.

Signed-off-by: Mohan Lakshmaiah <mohan.economist@gmail.com>

* Update secrets following rebase

Signed-off-by: Brian Hussey <brian.hussey@ie.ibm.com>

---------

Signed-off-by: Mohan Lakshmaiah <mohan.economist@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Brian Hussey <brian.hussey@ie.ibm.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Brian Hussey <brian.hussey@ie.ibm.com>
jonpspri added a commit that referenced this pull request May 8, 2026
…IBM#4296)

* feat(runtime): runtime-mutable MCP and A2A mode via admin API

Add PATCH /admin/runtime/mcp-mode and PATCH /admin/runtime/a2a-mode that
flip the public /mcp ingress and registered-A2A invocation path between
shadow and edge at runtime without a restart. Boot env vars
(RUST_MCP_MODE, RUST_A2A_MODE) still select the initial mode; overrides
are in-memory only. When Redis is configured the override propagates
cluster-wide via pub/sub on contextforge:runtime:mode with a per-runtime
monotonic INCR counter and a 24h hint key per runtime so freshly-started
pods reconcile to the cluster's current desired override on boot.

Off and full boot modes are intentionally not flippable (409): off has
no Rust sidecar to swap to; full keeps Rust as the owner of MCP session/
event-store/resume cores and flipping to shadow would orphan that state.

The reverse-proxy caveat (admin flips don't propagate past an upstream
nginx until the proxy is reconfigured) is documented and tracked
separately in IBM#4278.

Closes IBM#4273

Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(runtime): preserve session-auth-reuse safety invariant under override

Codex stop-time review caught that `override="edge"` on a shadow-boot
deployment silently bypassed the documented safety invariant for routing
public traffic to Rust: `_should_mount_public_rust_transport` had been
loosened to require only `experimental_rust_mcp_runtime_enabled`, skipping
`experimental_rust_mcp_session_auth_reuse_enabled` (and the analogous
`delegate_enabled` for A2A). A boot=shadow deployment with an admin-issued
edge override would have started routing public MCP to a Rust sidecar
that never opted into session-auth-reuse.

Fix:
- `_should_mount_public_rust_transport` keeps the original invariant
  unchanged. Override="shadow" still forces Python; override="edge" now
  matches the default (honors the invariant). Same shape for
  `_should_delegate_a2a_to_rust`.
- Narrow `_FLIPPABLE_BOOT_MODES` to `{edge}`. Only boot=edge deployments
  have both flags true and can safely flip. boot=shadow PATCHes now
  return 409 with an operator-facing explanation of the invariant.
- Two belt-and-braces unit tests exercise the read side directly (even
  if a stale hint somehow landed override=edge in state, the transport
  function refuses to route to Rust without the safety flags).
- Two router tests cover the new 409 for shadow boot (MCP + A2A).
- Docs updated: architecture doc, ADR 043 addendum, crate README/STATUS,
  .env.example, docker-compose.yml.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(runtime): allow shadow override clearance and discard incompatible hints

Codex stop-time review caught a new dead-end introduced by the previous
safety-invariant fix: after restricting _FLIPPABLE_BOOT_MODES to {edge},
a shadow-boot deployment that inherited a stale override=edge via a
Redis hint (left by a prior edge-boot deploy) had no API path to clear
state — every PATCH returned 409. The operator's only recourse was
flushing Redis by hand.

Fix is two-sided:

Router:
- Gate on target mode, not boot mode. mode=shadow accepted from any
  boot that has a dispatcher mounted (shadow or edge). mode=edge still
  requires boot=edge (safety invariant). off and full still 409 for any
  PATCH (no dispatcher).
- Replaced _FLIPPABLE_BOOT_MODES with _DISPATCHER_BOOT_MODES and
  _EDGE_OVERRIDE_BOOT_MODES to make the two gates explicit.

Coordinator:
- _reconcile_from_hint now checks whether the hint's target mode can
  safely take effect on this deployment. Incompatible hints (e.g. a
  former edge-boot pod's hint replayed on a shadow-boot deploy) are
  discarded and boot_reconcile_status is set to the new INCOMPATIBLE_HINT
  value, surfaced via /health. Without this guard, shadow-boot pods
  would boot with override=edge in state while the transport layer
  refused to honor it — misleading diagnostics that required the
  escape-hatch PATCH to clear.
- Added _deployment_allows_mode helper that mirrors the safety-invariant
  checks in version.py (lazy-imported to avoid the version→settings cycle).

Tests:
- test_patch_mcp_mode_shadow_clears_stale_override_on_shadow_boot
  exercises the escape hatch end-to-end.
- test_reconcile_from_hint_discards_incompatible_mode verifies the
  coordinator refuses to apply a shadow-boot-incompatible hint.
- test_reconcile_from_hint_accepts_shadow_mode_on_shadow_boot covers
  the positive case (shadow is always compatible).

Docs updated to explain the per-target-mode gating and the
INCOMPATIBLE_HINT status.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(runtime): discard stranded hints on boot=off and boot=full pods

Codex stop-time review caught that the previous _deployment_allows_mode
check returned True for shadow unconditionally — but boot=off has no
Rust sidecar at all and boot=full mounts a plain RustMCPRuntimeProxy
with no dispatcher, so neither boot mode has a mechanism that could
observe an override. A shadow hint replayed onto either pod would
strand in state (effective_mode=shadow in diagnostics; transport keeps
routing to its boot-determined target), and the router 409s for any
PATCH on those boots so the operator would have no API path to clear it.

Tighten _deployment_allows_mode to match the router's admit logic:
shadow accepts only when a dispatcher is mounted (boot=shadow or edge
for MCP; boot=shadow or edge for A2A), and edge still requires the
session-auth-reuse / delegate-enabled invariant.

Three new tests:
- test_reconcile_from_hint_discards_shadow_on_full_boot
- test_reconcile_from_hint_discards_any_mode_on_off_boot
- test_reconcile_from_hint_discards_a2a_hint_on_a2a_off_boot

Architecture doc updated to spell out all three INCOMPATIBLE_HINT cases.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(runtime): gate live pub/sub on compat check + structured rejection reasons

Third-pass review identified a critical and several important issues:

C1: _listen_loop applied remote pub/sub messages without the same
compatibility check _reconcile_from_hint enforces. A remote pod's
incompatible flip would strand in local state. Mirror the same gate.

I2: _listen_loop's recovery branch fired on any non-exception return,
including get_message returning None. A pubsub broken in a way that
returns None reliably without raising would falsely re-promote
cluster_propagation from DEGRADED to REDIS. Gate the recovery on
message is not None — the existing idle-timeout branch already handles
the legit no-message case.

I1: Document the deliberate "discarded hint left in Redis" choice in
_reconcile_from_hint's comment.

S4+S5+S6 (combined refactor):
- Promote _deployment_allows_mode to return MoveCompatibility enum
  with structured rejection reasons (NO_DISPATCHER, BOOT_FULL_STRANDS,
  EDGE_NEEDS_SAFETY_FLAG, OK).
- Split BootReconcileStatus.INCOMPATIBLE_HINT into three granular
  values (INCOMPATIBLE_NO_DISPATCHER, INCOMPATIBLE_BOOT_FULL,
  INCOMPATIBLE_SAFETY_FLAG) so /health distinguishes the cases.
- Move the helper to version.py next to the safety-invariant functions
  with a new public deployment_allows_override_mode wrapper.
- Router now calls deployment_allows_override_mode directly, replacing
  the parallel _DISPATCHER_BOOT_MODES/_EDGE_OVERRIDE_BOOT_MODES sets
  with a single source of truth. The 409 detail string is generated
  from the MoveCompatibility reason and names the exact env var or
  flag the operator needs to flip.

Tests added (I3 + I4 + I5 + S1 + S2):
- test_reconcile_from_hint_discards_edge_on_full_boot
- test_reconcile_from_hint_discards_edge_on_off_boot
- test_incompatible_hint_does_not_delete_redis_key_or_downgrade_propagation
  (S1 + S2: assert hint key untouched AND cluster_propagation stays REDIS)
- test_patch_mcp_mode_edge_409_when_boot_mode_full
- test_deployment_allows_override_mode_mcp_table (8 parametrized rows)
- test_deployment_allows_override_mode_a2a_table (6 parametrized rows)

Test fixtures (edge_boot, off_boot, full_boot, shadow_boot) now patch
the underlying settings instead of just monkeypatching the boot-mode
helpers — the new deployment_allows_override_mode inspects raw
settings, so fixture drift would have masked real bugs.

Architecture doc updated to enumerate the three INCOMPATIBLE_* values
and document the live pub/sub compat check.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(runtime): drop self-contradicting 'full' recommendation from NO_DISPATCHER 409

Codex stop-time review caught that the NO_DISPATCHER 409 detail told
operators "Boot with RUST_..._MODE='shadow' or 'edge' (or 'full' for
MCP) to enable overrides." But boot=full returns BOOT_FULL_STRANDS
for the same reason (no dispatcher mounted) — switching to full would
just trade one 409 for another. Worse, A2A has no 'full' mode at all,
so the parenthetical was incoherent for A2A 409 responses.

Drop the "(or 'full' for MCP)" clause. Add an inline comment explaining
why full is intentionally excluded so a future maintainer doesn't
re-add it. Two existing tests now also assert "'shadow' or 'edge'" is
present and "'full'" is absent — pins the contract.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* chore(secrets): refresh detect-secrets baseline after rebase onto main

Line numbers in .secrets.baseline are tied to file content. After
rebasing this branch onto main (which itself touched .secrets.baseline
in upstream commits), the baseline needed a re-scan to match the
post-rebase final file state. No new secrets detected.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* docs(adr): defer generic cluster-wide settings propagation framework (ADR-050)

Records the decision to keep the runtime-mutable mode override
implementation purpose-built for MCP/A2A rather than extracting a
generic ClusterStatePrimitive framework today. Captures the alternative
abstraction sketch for future reference and enumerates the trigger
conditions that would justify revisiting (second concrete consumer in
flight, third independently-tracked use case on the horizon, or a need
for cross-consumer coordination like atomic multi-setting flips).

This is N=1 — CLAUDE.md's "three similar lines is better than a
premature abstraction" applies. The right shape will emerge from
copy-modifying for the second consumer and diffing the result.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* docs(adr): scrub AI-assistant name from ADR-050

Repo guideline forbids naming AI assistants in committed diffs. The
rationale section in ADR-050 cited 'Codex' as the source of stop-time
review iterations; rephrase to describe the review process without
naming the tool. Also drop the explicit 'CLAUDE.md' file citation in
favor of 'repo convention' since the supporting quote stands on its
own.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* docs(adr): tighten ADR-050 claims, deliver defenses checklist, add 0049 row

Comment-analyzer and code-reviewer review of ADR-050 surfaced several
claims that would rot, one promised section that didn't exist, and a
missing index row that visually framed ADR-050 as having skipped a
number.

ADR-050 fixes:

- Correct the "≈900 LOC together" claim to ~1.5 kLOC. Actual line counts
  are runtime_state.py 977 + runtime_admin_router.py 489 = 1,466.
- Replace "six stop-time review iterations" with the verifiable "five
  follow-up fix(runtime): commits on top of the initial feat(runtime):"
  so the count is grep-checkable from git log.
- Drop the "repo convention is explicit: three similar lines beats a
  premature abstraction" framing — that line is from contributor tooling,
  not anything written in the repo. The N=1 argument now stands on its
  own merits.
- Replace "the most-reviewed code on the branch" superlative with the
  timeless "received unusually heavy review during initial development."
- Caption the ClusterStatePrimitive code block as a hypothetical sketch
  so readers who grep-arrive mid-document don't think the class exists.
- Move the reverse-proxy WARN attribution from runtime_state.py to
  runtime_admin_router.py (it lives in the router).
- Name LISTEN_LOOP_DEGRADE_THRESHOLD = 5 and
  RUNTIME_STATE_HINT_TTL_SECONDS = 24 * 60 * 60 in the context section
  so future drift in those constants is detectable from this ADR.
- Reconcile trigger-#1's "build both consumers side-by-side" with the
  rationale's "copy-modify" guidance: copy-modify first, then extract
  the framework from the resulting two-implementation diff.
- Deliver the "non-obvious defenses" checklist that the Negative
  consequence promised but never wrote. The new "Defenses checklist for
  the copy-modify path" section enumerates 12 defenses, each tied to a
  bug class caught during this PR's review iterations.

Index fix:

- Add the missing 0049 row for "Multi-Protocol Virtual Servers"
  (Accepted, Architecture, 2026-04-16). The 048→050 jump made ADR-050
  look like it had skipped a number when ADR-049 was just un-indexed.
- Other duplicate ADR numbers on disk (005, 034, 038, 048) and other
  index gaps are pre-existing and tracked separately in IBM#4290.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* feat(transports): swappable MCPIngressMount + nginx-style Rust public proxy (ADR-051)

Replace MCPStreamableHTTPModeDispatcher (closed-set, two transports,
policy baked into dispatch) with MCPIngressMount: a thin ASGI
indirection that holds a registry of named ingresses + a swappable
selector. Adding a new ingress is one register() call; selection
policy is one function.

Three ingresses registered today:

- "python" — the existing MCPRuntimeHeaderTransportWrapper. Always
  registered.
- "rust-internal" — the existing RustMCPRuntimeProxy that forwards
  via httpx to MCP_RUST_LISTEN_HTTP / UDS (trusted-internal listener).
  Registered for boot=shadow/edge.
- "rust-public" — new RustMCPPublicProxyApp that does nginx-style
  reverse-proxy to MCP_RUST_PUBLIC_LISTEN_HTTP. Adds X-Forwarded-*,
  RFC 7239 Forwarded, X-Real-IP. Streams both directions, preserves
  Authorization/Cookie. Registered only for boot=edge.

Selector chooses between rust-internal and rust-public based on the
new settings.mcp_rust_ingress (default "internal" preserves prior
behavior bit-for-bit). Selector preserves all existing safety
invariants — _should_mount_public_rust_transport() is consulted first,
shadow override / boot=full / boot=off behave identically to before,
the runtime-mode override admin API requires no changes.

Test results: 1839 passing across the touched modules. Lint clean.

ADR-051 documents the design decision, alternatives considered, and
migration notes from MCPStreamableHTTPModeDispatcher. Index updated.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(transports): close XFF spoof + harden Rust public proxy ingress

Response to review of 1967770 (MCPIngressMount + nginx-style Rust
public proxy). Closes a security issue and several diagnostic gaps in
the public proxy and ingress selector.

Security fix — RustMCPPublicProxyApp:

- _build_forwarded_headers now drops client-supplied Forwarded /
  X-Forwarded-* / X-Real-IP from the inbound iteration before
  re-deriving them from request.client. The prior code appended to
  the client-supplied X-Forwarded-For, letting any caller spoof their
  own source IP to Rust's /_internal/mcp/authenticate callback in the
  no-nginx-in-front deployment this proxy targets. New
  _FORWARDED_CHAIN_HEADERS frozenset enumerates the strip set.
- _format_forwarded_for emits RFC 7239 §6.3-conformant `for=unknown`
  (no synthetic `:0`) for absent clients, brackets IPv6 addresses
  exactly once, and tolerates already-bracketed input.

Selector hardening — _select_mcp_ingress / config:

- mcp_rust_ingress is now Literal["internal", "public"] so Pydantic
  rejects typos at boot.
- Selector only returns "rust-public" when boot_mcp_runtime_mode() ==
  "edge" — the public listener isn't bound on shadow boot, so a stale
  edge override + mcp_rust_ingress=public combination would otherwise
  return an unregistered name and silently fall back to Python at
  debug-only log severity.
- _build_mcp_transport_app emits a logger.error (not warning) when
  mcp_rust_ingress=public is set on shadow boot, so the misconfig
  survives the project-default LOG_LEVEL=ERROR.

Error handling — RustMCPPublicProxyApp:

- __call__ now distinguishes asyncio.CancelledError (re-raised),
  httpx.HTTPError (logged with type name + 502), and any other
  exception (logger.exception + 500), so a non-HTTPError no longer
  bypasses the handler and produces a silent Starlette 500.
- _body_iter logs CancelledError at debug (common for SSE client
  disconnect) and other exceptions at error (rare upstream death),
  before re-raising. The finally still calls aclose to return the
  upstream connection to the pool.
- Auth-relevant or server-error upstream statuses (401/403/5xx) now
  log a warning with method+path+status so a credential-rotation flood
  is visible without raising the global log level.

MCPIngressMount diagnostics:

- dispatch's selector-miss paths (fallback used, 503-on-no-fallback)
  now log warning with the unknown name AND self.names(), instead of
  debug-only on the fallback path with no log on the 503 path. A
  misrouted ingress should never be silent.

Tests:

- New tests/unit/mcpgateway/transports/test_rust_mcp_public_proxy.py
  (24 tests). Covers _format_forwarded_for parametrized over IPv4
  with/without port, IPv6, already-bracketed IPv6, None/empty host;
  _build_forwarded_headers spoof rejection + hop-by-hop strip + auth
  preservation + unknown-client emission; proxy ASGI behavior across
  status passthrough, 502 on httpx.HTTPError, 500 on unexpected,
  parametrized warning on (401, 403, 500, 503), non-HTTP scope → 404,
  CancelledError re-raise on send and mid-stream debug-branch, lazy
  client lifecycle, multi-chunk SSE streaming, and aclose-on-exit.
- Two new tests in test_mcp_ingress_mount.py pin the warning-level
  log severity on selector-miss + 503 paths.
- Two new tests in test_main_extended.py pin the boot-time error log
  for the shadow + mcp_rust_ingress=public misconfig and the selector
  downgrade to rust-internal under the same condition.

Doc rot fixes:

- mcp_ingress_mount.py module docstring uses the full ADR filename
  (was an ellipsis).
- version.py docstring references MCPIngressMount instead of the
  removed MCPStreamableHTTPModeDispatcher.
- ADR-051 migration notes use test-group descriptors instead of
  pinned line numbers / test names.

Test results: 652 passing across the touched modules. Lint clean
(black, ruff, interrogate 100%).

Signed-off-by: Jonathan Springer <jps@s390x.com>

* chore(quality): fix pylint W0706/W2301/R0401 and close coverage gaps

PR review feedback:

Pylint findings:

- W0706 (try-except-raise) at runtime_state.py:933 and
  rust_mcp_public_proxy.py:237: removed the redundant
  ``except asyncio.CancelledError: raise`` blocks. CancelledError is
  BaseException in Python 3.8+, so the surrounding ``except Exception``
  / ``except httpx.HTTPError`` arms never caught it; the explicit
  re-raise was no-op and made pylint complain. Behavior is unchanged
  (asyncio still cancels both coroutines naturally), and the intent is
  preserved as a comment.
- W2301 (unnecessary-ellipsis) at mcp_ingress_mount.py:37: dropped the
  ``...`` from the ASGIApp Protocol's ``__call__`` body — the
  docstring already counts as a body for Protocol methods.
- R0401 (cyclic-import) on the runtime_state ↔ version pair: added
  per-line ``# pylint: disable=cyclic-import`` to the two lazy imports
  in runtime_state.py. Pylint follows function-body imports for cycle
  detection; the lazy imports break the runtime cycle but pylint still
  flags the static graph. The full extract (move MoveCompatibility +
  coercers to a shared module) is bigger refactor than this PR
  warrants.

Coverage gaps closed (per-file delta):

- routers/runtime_admin_router.py 97.3% → 100%. Added three tests for
  ``_warn_if_behind_reverse_proxy``: header detection emits warning
  with header names, no warning when no forwarded headers, defensive
  early return when ``request.headers`` is None.
- transports/mcp_ingress_mount.py 94.7% → 100%. Added test for
  ``set_fallback`` covering swap mid-life and ``None`` restoring
  503-on-miss behavior.
- transports/rust_mcp_public_proxy.py 96.5% → 100%. Added test for
  ``_body_iter`` taking the ``except Exception`` arm on a non-cancel
  mid-stream error: asserts ``logger.exception`` fires with traceback.
- runtime_state.py 89.9% → 99%. Added 9 tests covering: defensive
  ValueError branches in 4 accessor methods (parametrized), idempotent
  ``coordinator.start()``, ``_reconcile_from_hint`` no-redis and
  malformed-mode paths, ``_cleanup_pubsub`` timeout / exception /
  fallback-to-close paths, listen-loop bytes-payload decode + idle
  recovery + malformed-payload + empty-data + uncoerceable-fields
  branches.
- version.py: targeted user-noted gaps (379, 434, 509, 570, 623) all
  closed. Added tests for last_change population in
  _mcp_runtime_status_payload / _a2a_runtime_status_payload, A2A
  shadow override forcing _should_delegate_a2a_to_rust False, and
  the boot_mcp_transport_mount public wrapper. Line 509
  (NO_DISPATCHER fallthrough — _coerce_runtime rejects unknown kinds
  upstream) marked ``# pragma: no cover`` as a defensive default.

Other defensive defaults marked ``# pragma: no cover``:

- runtime_admin_router.py:99 (OK fallthrough — caller already returned
  for OK upstream).
- runtime_state.py:164, 171 in ``_move_compat_to_reconcile_status``
  (OK case shouldn't reach here per docstring; final return is a
  defensive default for future MoveCompatibility variants).

Test results: 1272 passing across the touched modules. Lint clean
(black, ruff, pylint 10.00/10, interrogate 100%).

Signed-off-by: Jonathan Springer <jps@s390x.com>

---------

Signed-off-by: Jonathan Springer <jps@s390x.com>
jonpspri added a commit that referenced this pull request May 8, 2026
* fix: update scope['modified_path'] for server_id extraction with root_path (#4270)

Fixes #4266

The /mcp endpoint was returning ALL tools instead of server-scoped tools
when accessed via /servers/{id}/mcp in reverse proxy deployments with
root_path configured. The /sse endpoint worked correctly.

Root cause: MCPPathRewriteMiddleware computed app_path by stripping
root_path prefix (line 3026) but never updated scope['modified_path'].
When streamablehttp_transport.py:2898 reads modified_path to extract
server_id via regex (r'^/servers/(?P<server_id>[^/]+)/mcp'), the regex
fails to match paths like '/prefix/servers/123/mcp', causing server_id
to be None and tools/list to return all tools instead of server-scoped.

Solution: Update scope['modified_path'] = app_path after computing the
app-relative path. This ensures streamablehttp_transport receives the
path without root_path prefix for correct server_id extraction.

Changes:
- Set scope['modified_path'] to app_path after stripping root_path
- Add regression tests verifying modified_path is app-relative
- All existing MCPPathRewriteMiddleware tests pass

Signed-off-by: Mohan Lakshmaiah <mohan.economist@gmail.com>
Signed-off-by: Brian Hussey <brian.hussey@ie.ibm.com>

* fix(transport): return 405 on session-less GET /mcp (#4205) (#4284)

* fix(docker): install wget in a2a-echo-agent image for healthcheck

The compose healthcheck for a2a_echo_agent runs `wget -qO- http://localhost:9100/health`,
but the debian:bookworm-slim runtime stage only installed ca-certificates and tzdata —
so every probe failed with "exec: wget: executable file not found in $PATH" and the
container stayed perpetually unhealthy despite the agent itself listening correctly.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(transport): return 405 on session-less GET /mcp (#4205)

A strict MCP client opens a passive SSE stream via GET right after initialize
to receive server-initiated messages. When ContextForge cannot anchor that
stream to a session (stateful sessions disabled, session-core disabled in
Rust, or no Mcp-Session-Id presented), it previously either committed to an
empty event stream (Rust live-stream relay) or handed off to the SDK which
held the GET open without emitting events. Strict clients then timed out on
the empty stream and reported "Failed to initialize server session".

The MCP Streamable HTTP spec allows the server to return 405 for such GETs;
this change takes that route in both gateways.

Python (mcpgateway/transports/streamablehttp_transport.py):
  - SessionManagerWrapper.handle_streamable_http now returns 405 with
    Allow: POST, DELETE for GET when either settings.use_stateful_sessions
    is false or no Mcp-Session-Id was presented.
  - The check runs after _validate_server_id() and the RBAC gate so bogus
    server IDs still 404 and unauthorized callers still 403, preserving the
    "RBAC before processing" invariant.
  - Branch-specific detail messages distinguish "stateful sessions disabled"
    from "no session id presented" so clients don't retry in a loop.
  - logger.info on rejection with path + flags for ops visibility.

Rust (crates/mcp_runtime/src/lib.rs):
  - forward_transport_request returns 405 for GET when either
    session_core_enabled is false or no session id is presented in header
    or session_id query parameter. Without session_core, Rust cannot anchor
    an SSE stream; without a session id, there is nothing to anchor to.
  - The 405 response carries the same capability headers as every other
    terminal response in the function.
  - tracing::info! on rejection and a RuntimeStats.transport.get_rejected_no_session
    counter so ops can distinguish "no traffic" from "every client rejected".
  - One pre-existing test (server_scoped_transport_wrappers_inject_server_header)
    updated to provide a session id + record so it still exercises the
    server-id header injection path.

Tests:
  - Python: 3 new parametrized tests covering stateless/no-session/pass-through
    and both Mcp-Session-Id header aliases; server-scoped variant confirms
    405 fires only after server-id validation (and that 404 still wins for
    nonexistent servers).
  - Rust: 3 new tests covering 405 without session id, 405 when session_core
    disabled even with header, and live-stream branch still reached when a
    valid session id is presented.
  - Also un-hollows two pre-existing header-parsing tests
    (test_handle_streamable_http_non_tuple_header_skipped,
    test_handle_streamable_http_non_bytes_header_skipped) that would have
    been silently short-circuited by the new 405 path.

Note: This fixes the downstream GET-SSE cascade described in the comment
on #4205. The upstream per-client session-isolation problem from the
original reproducer (shared pooled upstream session leaking state between
downstream clients) is a separate architectural change and will be
addressed in a follow-up.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(transport): address review feedback on GET /mcp 405 path (#4205)

Follow-up to 252e8c228. Addresses review feedback surfaced by the
PR-review agents:

- Rust `runtime_session_id_from_request` now also accepts
  `x-mcp-session-id`, restoring parity with the Python gateway so a
  client isn't accepted by one runtime and 405'd by the other.
- Split the Rust 405 counter: `get_rejected_no_session` stays as the
  total and a new `get_rejected_session_core_disabled` subcounter
  isolates the deployment-config branch, so ops can distinguish client
  misuse from session-core being off.
- Added a Python Prometheus counter (`transport_get_rejected_total`)
  with matching `no_session_id` / `stateful_disabled` outcomes to close
  the observability gap between runtimes.
- Split log levels on both sides: `warn` for the operator-facing
  disabled-session / stateful-disabled branch, `debug` for the
  no-session-id branch (routine strict-client probing).
- Added a regression guard test that GET with RBAC denial returns 403
  (not 405) and emits no `Allow` header.
- Session-affinity `except Exception` in the Python gateway now warns
  instead of silently rendering infrastructure failures as 405s.
- 404 on unknown server no longer allowed to leak an `Allow` header.
- GET OAuth-enabled server test asserts 401 before the 405 path.
- Rust happy-path tests snapshot counters to assert no 405 tick.
- "Why 405 (not 404/400)" rationale added to both comment blocks.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(transport): gate Rust 405 inside live-stream branch (#4205)

Integration tests `get_and_delete_mcp_routes_forward_to_internal_transport_bridge`
and `live_stream_core_disabled_falls_back_to_python_transport_get` were
failing in CI: the standalone 405 block in `forward_transport_request`
pre-empted the Python transport-bridge fallback path, even for GETs the
runtime could have cleanly proxied.

The empty-stream cascade the 405 was defending against only arises when
the live-stream relay is entered without a session anchor. Move the 405
guard inside that branch so GETs that don't request SSE (or where
live_stream_core is disabled) fall through to the existing Python-proxy
fallback, which has its own 405 logic. Update the matching unit test to
set `Accept: text/event-stream` so it still reaches the gate.

No change to the counter semantics or Python-side behaviour. Full
workspace test suite (Rust + Python) passes.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* chore(rust): apply clippy 1.95.0 lint fixes in non-PR crates

Rustc stable moved from 1.94.1 to 1.95.0 between the last main Rust CI
run (#3704, 2026-04-16) and PR #4284, enabling two additional lints at
workspace `-D warnings` level. The new lints flag pre-existing code in
crates outside this PR's surface area, blocking Rust clippy CI for every
branch that touches any Rust file.

Apply clippy's exact suggestions:

- `crates/wrapper/src/json_rpc_id_fast.rs`: collapse a `FieldName`
  match arm's nested `if` into a match guard (`collapsible_match`).
- `crates/a2a_runtime/src/event_store.rs`: drop a redundant
  `.into_iter()` in a `.zip()` argument (`useless_conversion`).
- `crates/a2a_runtime/src/server.rs`: collapse two match-arm nested
  `if`s (`get`|`cancel` and `list`) into match guards.

Verified with `cargo clippy --workspace --all-targets -- -D warnings`
and `cargo test --workspace` on rustc 1.95.0. No behaviour change.

Signed-off-by: Jonathan Springer <jps@s390x.com>

---------

Signed-off-by: Jonathan Springer <jps@s390x.com>
Signed-off-by: Brian Hussey <brian.hussey@ie.ibm.com>

* feat(runtime): runtime-mutable RUST_MCP_MODE + A2A_MODE via admin API (#4296)

* feat(runtime): runtime-mutable MCP and A2A mode via admin API

Add PATCH /admin/runtime/mcp-mode and PATCH /admin/runtime/a2a-mode that
flip the public /mcp ingress and registered-A2A invocation path between
shadow and edge at runtime without a restart. Boot env vars
(RUST_MCP_MODE, RUST_A2A_MODE) still select the initial mode; overrides
are in-memory only. When Redis is configured the override propagates
cluster-wide via pub/sub on contextforge:runtime:mode with a per-runtime
monotonic INCR counter and a 24h hint key per runtime so freshly-started
pods reconcile to the cluster's current desired override on boot.

Off and full boot modes are intentionally not flippable (409): off has
no Rust sidecar to swap to; full keeps Rust as the owner of MCP session/
event-store/resume cores and flipping to shadow would orphan that state.

The reverse-proxy caveat (admin flips don't propagate past an upstream
nginx until the proxy is reconfigured) is documented and tracked
separately in #4278.

Closes #4273

Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(runtime): preserve session-auth-reuse safety invariant under override

Codex stop-time review caught that `override="edge"` on a shadow-boot
deployment silently bypassed the documented safety invariant for routing
public traffic to Rust: `_should_mount_public_rust_transport` had been
loosened to require only `experimental_rust_mcp_runtime_enabled`, skipping
`experimental_rust_mcp_session_auth_reuse_enabled` (and the analogous
`delegate_enabled` for A2A). A boot=shadow deployment with an admin-issued
edge override would have started routing public MCP to a Rust sidecar
that never opted into session-auth-reuse.

Fix:
- `_should_mount_public_rust_transport` keeps the original invariant
  unchanged. Override="shadow" still forces Python; override="edge" now
  matches the default (honors the invariant). Same shape for
  `_should_delegate_a2a_to_rust`.
- Narrow `_FLIPPABLE_BOOT_MODES` to `{edge}`. Only boot=edge deployments
  have both flags true and can safely flip. boot=shadow PATCHes now
  return 409 with an operator-facing explanation of the invariant.
- Two belt-and-braces unit tests exercise the read side directly (even
  if a stale hint somehow landed override=edge in state, the transport
  function refuses to route to Rust without the safety flags).
- Two router tests cover the new 409 for shadow boot (MCP + A2A).
- Docs updated: architecture doc, ADR 043 addendum, crate README/STATUS,
  .env.example, docker-compose.yml.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(runtime): allow shadow override clearance and discard incompatible hints

Codex stop-time review caught a new dead-end introduced by the previous
safety-invariant fix: after restricting _FLIPPABLE_BOOT_MODES to {edge},
a shadow-boot deployment that inherited a stale override=edge via a
Redis hint (left by a prior edge-boot deploy) had no API path to clear
state — every PATCH returned 409. The operator's only recourse was
flushing Redis by hand.

Fix is two-sided:

Router:
- Gate on target mode, not boot mode. mode=shadow accepted from any
  boot that has a dispatcher mounted (shadow or edge). mode=edge still
  requires boot=edge (safety invariant). off and full still 409 for any
  PATCH (no dispatcher).
- Replaced _FLIPPABLE_BOOT_MODES with _DISPATCHER_BOOT_MODES and
  _EDGE_OVERRIDE_BOOT_MODES to make the two gates explicit.

Coordinator:
- _reconcile_from_hint now checks whether the hint's target mode can
  safely take effect on this deployment. Incompatible hints (e.g. a
  former edge-boot pod's hint replayed on a shadow-boot deploy) are
  discarded and boot_reconcile_status is set to the new INCOMPATIBLE_HINT
  value, surfaced via /health. Without this guard, shadow-boot pods
  would boot with override=edge in state while the transport layer
  refused to honor it — misleading diagnostics that required the
  escape-hatch PATCH to clear.
- Added _deployment_allows_mode helper that mirrors the safety-invariant
  checks in version.py (lazy-imported to avoid the version→settings cycle).

Tests:
- test_patch_mcp_mode_shadow_clears_stale_override_on_shadow_boot
  exercises the escape hatch end-to-end.
- test_reconcile_from_hint_discards_incompatible_mode verifies the
  coordinator refuses to apply a shadow-boot-incompatible hint.
- test_reconcile_from_hint_accepts_shadow_mode_on_shadow_boot covers
  the positive case (shadow is always compatible).

Docs updated to explain the per-target-mode gating and the
INCOMPATIBLE_HINT status.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(runtime): discard stranded hints on boot=off and boot=full pods

Codex stop-time review caught that the previous _deployment_allows_mode
check returned True for shadow unconditionally — but boot=off has no
Rust sidecar at all and boot=full mounts a plain RustMCPRuntimeProxy
with no dispatcher, so neither boot mode has a mechanism that could
observe an override. A shadow hint replayed onto either pod would
strand in state (effective_mode=shadow in diagnostics; transport keeps
routing to its boot-determined target), and the router 409s for any
PATCH on those boots so the operator would have no API path to clear it.

Tighten _deployment_allows_mode to match the router's admit logic:
shadow accepts only when a dispatcher is mounted (boot=shadow or edge
for MCP; boot=shadow or edge for A2A), and edge still requires the
session-auth-reuse / delegate-enabled invariant.

Three new tests:
- test_reconcile_from_hint_discards_shadow_on_full_boot
- test_reconcile_from_hint_discards_any_mode_on_off_boot
- test_reconcile_from_hint_discards_a2a_hint_on_a2a_off_boot

Architecture doc updated to spell out all three INCOMPATIBLE_HINT cases.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(runtime): gate live pub/sub on compat check + structured rejection reasons

Third-pass review identified a critical and several important issues:

C1: _listen_loop applied remote pub/sub messages without the same
compatibility check _reconcile_from_hint enforces. A remote pod's
incompatible flip would strand in local state. Mirror the same gate.

I2: _listen_loop's recovery branch fired on any non-exception return,
including get_message returning None. A pubsub broken in a way that
returns None reliably without raising would falsely re-promote
cluster_propagation from DEGRADED to REDIS. Gate the recovery on
message is not None — the existing idle-timeout branch already handles
the legit no-message case.

I1: Document the deliberate "discarded hint left in Redis" choice in
_reconcile_from_hint's comment.

S4+S5+S6 (combined refactor):
- Promote _deployment_allows_mode to return MoveCompatibility enum
  with structured rejection reasons (NO_DISPATCHER, BOOT_FULL_STRANDS,
  EDGE_NEEDS_SAFETY_FLAG, OK).
- Split BootReconcileStatus.INCOMPATIBLE_HINT into three granular
  values (INCOMPATIBLE_NO_DISPATCHER, INCOMPATIBLE_BOOT_FULL,
  INCOMPATIBLE_SAFETY_FLAG) so /health distinguishes the cases.
- Move the helper to version.py next to the safety-invariant functions
  with a new public deployment_allows_override_mode wrapper.
- Router now calls deployment_allows_override_mode directly, replacing
  the parallel _DISPATCHER_BOOT_MODES/_EDGE_OVERRIDE_BOOT_MODES sets
  with a single source of truth. The 409 detail string is generated
  from the MoveCompatibility reason and names the exact env var or
  flag the operator needs to flip.

Tests added (I3 + I4 + I5 + S1 + S2):
- test_reconcile_from_hint_discards_edge_on_full_boot
- test_reconcile_from_hint_discards_edge_on_off_boot
- test_incompatible_hint_does_not_delete_redis_key_or_downgrade_propagation
  (S1 + S2: assert hint key untouched AND cluster_propagation stays REDIS)
- test_patch_mcp_mode_edge_409_when_boot_mode_full
- test_deployment_allows_override_mode_mcp_table (8 parametrized rows)
- test_deployment_allows_override_mode_a2a_table (6 parametrized rows)

Test fixtures (edge_boot, off_boot, full_boot, shadow_boot) now patch
the underlying settings instead of just monkeypatching the boot-mode
helpers — the new deployment_allows_override_mode inspects raw
settings, so fixture drift would have masked real bugs.

Architecture doc updated to enumerate the three INCOMPATIBLE_* values
and document the live pub/sub compat check.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(runtime): drop self-contradicting 'full' recommendation from NO_DISPATCHER 409

Codex stop-time review caught that the NO_DISPATCHER 409 detail told
operators "Boot with RUST_..._MODE='shadow' or 'edge' (or 'full' for
MCP) to enable overrides." But boot=full returns BOOT_FULL_STRANDS
for the same reason (no dispatcher mounted) — switching to full would
just trade one 409 for another. Worse, A2A has no 'full' mode at all,
so the parenthetical was incoherent for A2A 409 responses.

Drop the "(or 'full' for MCP)" clause. Add an inline comment explaining
why full is intentionally excluded so a future maintainer doesn't
re-add it. Two existing tests now also assert "'shadow' or 'edge'" is
present and "'full'" is absent — pins the contract.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* chore(secrets): refresh detect-secrets baseline after rebase onto main

Line numbers in .secrets.baseline are tied to file content. After
rebasing this branch onto main (which itself touched .secrets.baseline
in upstream commits), the baseline needed a re-scan to match the
post-rebase final file state. No new secrets detected.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* docs(adr): defer generic cluster-wide settings propagation framework (ADR-050)

Records the decision to keep the runtime-mutable mode override
implementation purpose-built for MCP/A2A rather than extracting a
generic ClusterStatePrimitive framework today. Captures the alternative
abstraction sketch for future reference and enumerates the trigger
conditions that would justify revisiting (second concrete consumer in
flight, third independently-tracked use case on the horizon, or a need
for cross-consumer coordination like atomic multi-setting flips).

This is N=1 — CLAUDE.md's "three similar lines is better than a
premature abstraction" applies. The right shape will emerge from
copy-modifying for the second consumer and diffing the result.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* docs(adr): scrub AI-assistant name from ADR-050

Repo guideline forbids naming AI assistants in committed diffs. The
rationale section in ADR-050 cited 'Codex' as the source of stop-time
review iterations; rephrase to describe the review process without
naming the tool. Also drop the explicit 'CLAUDE.md' file citation in
favor of 'repo convention' since the supporting quote stands on its
own.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* docs(adr): tighten ADR-050 claims, deliver defenses checklist, add 0049 row

Comment-analyzer and code-reviewer review of ADR-050 surfaced several
claims that would rot, one promised section that didn't exist, and a
missing index row that visually framed ADR-050 as having skipped a
number.

ADR-050 fixes:

- Correct the "≈900 LOC together" claim to ~1.5 kLOC. Actual line counts
  are runtime_state.py 977 + runtime_admin_router.py 489 = 1,466.
- Replace "six stop-time review iterations" with the verifiable "five
  follow-up fix(runtime): commits on top of the initial feat(runtime):"
  so the count is grep-checkable from git log.
- Drop the "repo convention is explicit: three similar lines beats a
  premature abstraction" framing — that line is from contributor tooling,
  not anything written in the repo. The N=1 argument now stands on its
  own merits.
- Replace "the most-reviewed code on the branch" superlative with the
  timeless "received unusually heavy review during initial development."
- Caption the ClusterStatePrimitive code block as a hypothetical sketch
  so readers who grep-arrive mid-document don't think the class exists.
- Move the reverse-proxy WARN attribution from runtime_state.py to
  runtime_admin_router.py (it lives in the router).
- Name LISTEN_LOOP_DEGRADE_THRESHOLD = 5 and
  RUNTIME_STATE_HINT_TTL_SECONDS = 24 * 60 * 60 in the context section
  so future drift in those constants is detectable from this ADR.
- Reconcile trigger-#1's "build both consumers side-by-side" with the
  rationale's "copy-modify" guidance: copy-modify first, then extract
  the framework from the resulting two-implementation diff.
- Deliver the "non-obvious defenses" checklist that the Negative
  consequence promised but never wrote. The new "Defenses checklist for
  the copy-modify path" section enumerates 12 defenses, each tied to a
  bug class caught during this PR's review iterations.

Index fix:

- Add the missing 0049 row for "Multi-Protocol Virtual Servers"
  (Accepted, Architecture, 2026-04-16). The 048→050 jump made ADR-050
  look like it had skipped a number when ADR-049 was just un-indexed.
- Other duplicate ADR numbers on disk (005, 034, 038, 048) and other
  index gaps are pre-existing and tracked separately in #4290.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* feat(transports): swappable MCPIngressMount + nginx-style Rust public proxy (ADR-051)

Replace MCPStreamableHTTPModeDispatcher (closed-set, two transports,
policy baked into dispatch) with MCPIngressMount: a thin ASGI
indirection that holds a registry of named ingresses + a swappable
selector. Adding a new ingress is one register() call; selection
policy is one function.

Three ingresses registered today:

- "python" — the existing MCPRuntimeHeaderTransportWrapper. Always
  registered.
- "rust-internal" — the existing RustMCPRuntimeProxy that forwards
  via httpx to MCP_RUST_LISTEN_HTTP / UDS (trusted-internal listener).
  Registered for boot=shadow/edge.
- "rust-public" — new RustMCPPublicProxyApp that does nginx-style
  reverse-proxy to MCP_RUST_PUBLIC_LISTEN_HTTP. Adds X-Forwarded-*,
  RFC 7239 Forwarded, X-Real-IP. Streams both directions, preserves
  Authorization/Cookie. Registered only for boot=edge.

Selector chooses between rust-internal and rust-public based on the
new settings.mcp_rust_ingress (default "internal" preserves prior
behavior bit-for-bit). Selector preserves all existing safety
invariants — _should_mount_public_rust_transport() is consulted first,
shadow override / boot=full / boot=off behave identically to before,
the runtime-mode override admin API requires no changes.

Test results: 1839 passing across the touched modules. Lint clean.

ADR-051 documents the design decision, alternatives considered, and
migration notes from MCPStreamableHTTPModeDispatcher. Index updated.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(transports): close XFF spoof + harden Rust public proxy ingress

Response to review of 19677706b (MCPIngressMount + nginx-style Rust
public proxy). Closes a security issue and several diagnostic gaps in
the public proxy and ingress selector.

Security fix — RustMCPPublicProxyApp:

- _build_forwarded_headers now drops client-supplied Forwarded /
  X-Forwarded-* / X-Real-IP from the inbound iteration before
  re-deriving them from request.client. The prior code appended to
  the client-supplied X-Forwarded-For, letting any caller spoof their
  own source IP to Rust's /_internal/mcp/authenticate callback in the
  no-nginx-in-front deployment this proxy targets. New
  _FORWARDED_CHAIN_HEADERS frozenset enumerates the strip set.
- _format_forwarded_for emits RFC 7239 §6.3-conformant `for=unknown`
  (no synthetic `:0`) for absent clients, brackets IPv6 addresses
  exactly once, and tolerates already-bracketed input.

Selector hardening — _select_mcp_ingress / config:

- mcp_rust_ingress is now Literal["internal", "public"] so Pydantic
  rejects typos at boot.
- Selector only returns "rust-public" when boot_mcp_runtime_mode() ==
  "edge" — the public listener isn't bound on shadow boot, so a stale
  edge override + mcp_rust_ingress=public combination would otherwise
  return an unregistered name and silently fall back to Python at
  debug-only log severity.
- _build_mcp_transport_app emits a logger.error (not warning) when
  mcp_rust_ingress=public is set on shadow boot, so the misconfig
  survives the project-default LOG_LEVEL=ERROR.

Error handling — RustMCPPublicProxyApp:

- __call__ now distinguishes asyncio.CancelledError (re-raised),
  httpx.HTTPError (logged with type name + 502), and any other
  exception (logger.exception + 500), so a non-HTTPError no longer
  bypasses the handler and produces a silent Starlette 500.
- _body_iter logs CancelledError at debug (common for SSE client
  disconnect) and other exceptions at error (rare upstream death),
  before re-raising. The finally still calls aclose to return the
  upstream connection to the pool.
- Auth-relevant or server-error upstream statuses (401/403/5xx) now
  log a warning with method+path+status so a credential-rotation flood
  is visible without raising the global log level.

MCPIngressMount diagnostics:

- dispatch's selector-miss paths (fallback used, 503-on-no-fallback)
  now log warning with the unknown name AND self.names(), instead of
  debug-only on the fallback path with no log on the 503 path. A
  misrouted ingress should never be silent.

Tests:

- New tests/unit/mcpgateway/transports/test_rust_mcp_public_proxy.py
  (24 tests). Covers _format_forwarded_for parametrized over IPv4
  with/without port, IPv6, already-bracketed IPv6, None/empty host;
  _build_forwarded_headers spoof rejection + hop-by-hop strip + auth
  preservation + unknown-client emission; proxy ASGI behavior across
  status passthrough, 502 on httpx.HTTPError, 500 on unexpected,
  parametrized warning on (401, 403, 500, 503), non-HTTP scope → 404,
  CancelledError re-raise on send and mid-stream debug-branch, lazy
  client lifecycle, multi-chunk SSE streaming, and aclose-on-exit.
- Two new tests in test_mcp_ingress_mount.py pin the warning-level
  log severity on selector-miss + 503 paths.
- Two new tests in test_main_extended.py pin the boot-time error log
  for the shadow + mcp_rust_ingress=public misconfig and the selector
  downgrade to rust-internal under the same condition.

Doc rot fixes:

- mcp_ingress_mount.py module docstring uses the full ADR filename
  (was an ellipsis).
- version.py docstring references MCPIngressMount instead of the
  removed MCPStreamableHTTPModeDispatcher.
- ADR-051 migration notes use test-group descriptors instead of
  pinned line numbers / test names.

Test results: 652 passing across the touched modules. Lint clean
(black, ruff, interrogate 100%).

Signed-off-by: Jonathan Springer <jps@s390x.com>

* chore(quality): fix pylint W0706/W2301/R0401 and close coverage gaps

PR review feedback:

Pylint findings:

- W0706 (try-except-raise) at runtime_state.py:933 and
  rust_mcp_public_proxy.py:237: removed the redundant
  ``except asyncio.CancelledError: raise`` blocks. CancelledError is
  BaseException in Python 3.8+, so the surrounding ``except Exception``
  / ``except httpx.HTTPError`` arms never caught it; the explicit
  re-raise was no-op and made pylint complain. Behavior is unchanged
  (asyncio still cancels both coroutines naturally), and the intent is
  preserved as a comment.
- W2301 (unnecessary-ellipsis) at mcp_ingress_mount.py:37: dropped the
  ``...`` from the ASGIApp Protocol's ``__call__`` body — the
  docstring already counts as a body for Protocol methods.
- R0401 (cyclic-import) on the runtime_state ↔ version pair: added
  per-line ``# pylint: disable=cyclic-import`` to the two lazy imports
  in runtime_state.py. Pylint follows function-body imports for cycle
  detection; the lazy imports break the runtime cycle but pylint still
  flags the static graph. The full extract (move MoveCompatibility +
  coercers to a shared module) is bigger refactor than this PR
  warrants.

Coverage gaps closed (per-file delta):

- routers/runtime_admin_router.py 97.3% → 100%. Added three tests for
  ``_warn_if_behind_reverse_proxy``: header detection emits warning
  with header names, no warning when no forwarded headers, defensive
  early return when ``request.headers`` is None.
- transports/mcp_ingress_mount.py 94.7% → 100%. Added test for
  ``set_fallback`` covering swap mid-life and ``None`` restoring
  503-on-miss behavior.
- transports/rust_mcp_public_proxy.py 96.5% → 100%. Added test for
  ``_body_iter`` taking the ``except Exception`` arm on a non-cancel
  mid-stream error: asserts ``logger.exception`` fires with traceback.
- runtime_state.py 89.9% → 99%. Added 9 tests covering: defensive
  ValueError branches in 4 accessor methods (parametrized), idempotent
  ``coordinator.start()``, ``_reconcile_from_hint`` no-redis and
  malformed-mode paths, ``_cleanup_pubsub`` timeout / exception /
  fallback-to-close paths, listen-loop bytes-payload decode + idle
  recovery + malformed-payload + empty-data + uncoerceable-fields
  branches.
- version.py: targeted user-noted gaps (379, 434, 509, 570, 623) all
  closed. Added tests for last_change population in
  _mcp_runtime_status_payload / _a2a_runtime_status_payload, A2A
  shadow override forcing _should_delegate_a2a_to_rust False, and
  the boot_mcp_transport_mount public wrapper. Line 509
  (NO_DISPATCHER fallthrough — _coerce_runtime rejects unknown kinds
  upstream) marked ``# pragma: no cover`` as a defensive default.

Other defensive defaults marked ``# pragma: no cover``:

- runtime_admin_router.py:99 (OK fallthrough — caller already returned
  for OK upstream).
- runtime_state.py:164, 171 in ``_move_compat_to_reconcile_status``
  (OK case shouldn't reach here per docstring; final return is a
  defensive default for future MoveCompatibility variants).

Test results: 1272 passing across the touched modules. Lint clean
(black, ruff, pylint 10.00/10, interrogate 100%).

Signed-off-by: Jonathan Springer <jps@s390x.com>

---------

Signed-off-by: Jonathan Springer <jps@s390x.com>
Signed-off-by: Brian Hussey <brian.hussey@ie.ibm.com>

* build(container): consolidate on Containerfile.lite; deprecate others (#4297)

Closes #2613.

The top-level `Containerfile` duplicated the .lite build with weaker layer
caching and shipped build tooling in the runtime image. `Containerfile.scratch`
had no frontend-builder stage (admin UI bundle absent), failed under QEMU
(amd64-only), and had no consumer in CI, docs, or the Makefile. Both are
removed; `Containerfile.lite` is now the only canonical gateway image.

Apply the same layer-consolidation and caching pass across `Containerfile.lite`
and the per-component Containerfiles:

- Merge adjacent RUN steps (microdnf + ln + useradd; profile.d setup + chmod;
  Rust MCP + A2A cargo builds; ...) to reduce layer churn.
- Reorder COPYs from stable to volatile so frontend/Rust-binary changes no
  longer invalidate the heavy venv-install layer.
- Use COPY --chown=1001:0 in the UBI nodejs frontend-builder stage — fixes
  EACCES on package-lock.json under the non-root UBI default user — and
  COPY --chmod=0755 on run scripts to drop follow-up chmod layers.
- Replace \`npm install --frozen-lockfile\` (a yarn/pnpm flag npm silently
  ignores) with \`npm ci\`.
- Silence pre-existing hadolint DL3013/DL3041 findings with targeted ignore
  comments consistent with the file-header policy.

Update every touchpoint of the deprecated images: Makefile podman/docker[-dev]
targets, fly.toml, docker-compose commented migration examples, the
docker-scan workflow matrix and trigger paths, .travis.yml,
.pre-commit-config.yaml, .bumpversion.cfg, MANIFEST.in, and user-facing docs
(packaging, release-management, compose, openshift, container, fly-io, argocd
tutorial, MULTIPLATFORM, roadmap). Update the docker-scan unit test to match
the trimmed workflow. Incidentally fix the stale \$(CONTAINERFILE) typo in
the snyk-iac Makefile target.

Signed-off-by: Jonathan Springer <jps@s390x.com>
Signed-off-by: Brian Hussey <brian.hussey@ie.ibm.com>

* feat: add HCS-14 UAID support for A2A agents (#4125)

* add HCS-14 UAID support for A2A agents

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* migration

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* lint error fix

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* test case fix

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* pyporject.toml

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* test coverage

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* feat: add base58 dependency and UAID routing config
  - Add base58>=2.1.1 dependency for UAID generation support
  - Add uaid_allowed_domains config setting for cross-gateway routing security
  - Document UAID_ALLOWED_DOMAINS in .env.example
  Related to #3956

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* a2a_agent_id column size change

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* feat: add UAID support to Admin UI
  - Add UAID generation checkbox to create/edit A2A agent forms
  - Add UAID registry and protocol configuration fields
  - Display UAID vs UUID badge in agents table
  - Show UAID metadata section in agent view modal
  - Add JavaScript toggle logic for UAID fields
  - Update form population for editing UAID agents
  - Rebuild UI bundle with UAID changes
  Completes HCS-14 UAID feature with full UI support.
  Related to #3956

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* uaid ui change

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* revert id column length

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* uaid in admin ui create form

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* edit form UAID

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* test

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* length change

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* Add security documentation for UAID-based cross-gateway routing feature

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* documentation

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* test case

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* test: Add coverage for UAID cross-gateway routing edge cases

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* alembic head

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* test coverage

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* base58 and its deps resolved

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* security: fix SSRF and domain allowlist bypass in UAID cross-gateway routing

Fixes two critical security vulnerabilities in UAID-based cross-gateway routing:

1. Domain Allowlist Bypass (CVE-level)
   - Previous: endpoint.endswith(d) allowed "evilexample.com" to bypass "example.com" allowlist
   - Fixed: Require exact match OR proper subdomain prefix (.example.com)
   - Now correctly blocks: "evilexample.com", "notexample.com"
   - Still allows: "gateway.example.com", "example.com", "sub.gateway.example.com"

2. SSRF Injection via nativeId (CVE-level)
   - Previous: nativeId field directly interpolated into URL without validation
   - Attack vectors blocked:
     * Protocol injection: file:///etc/passwd, gopher://internal
     * User-info bypass: evil@127.0.0.1 (routes to internal IP)
     * Path injection: gateway.com/admin (access internal endpoints)
   - Legitimate ports still allowed: gateway.example.com:8443

Security validations added before URL construction:
- Reject protocol prefixes (://)
- Reject @ character (user-info bypass)
- Reject path components (/)
- Validate hostname format via urlparse
- Extract domain from endpoint:port for allowlist checking

Test coverage (8 new tests):
- test_invoke_remote_agent_domain_allowlist_bypass_attack
- test_invoke_remote_agent_domain_allowlist_subdomain_valid
- test_invoke_remote_agent_domain_allowlist_exact_match
- test_invoke_remote_agent_ssrf_protocol_injection
- test_invoke_remote_agent_ssrf_userinfo_bypass
- test_invoke_remote_agent_ssrf_path_injection
- test_invoke_remote_agent_ssrf_internal_ip (documents behavior)
- test_invoke_remote_agent_port_allowed

Related: #3956

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* style: apply black and isort formatting to schemas and UAID tests

- black: Fix line length in PluginPolicyItem config field description
- isort: Move First-Party imports to correct section in test files

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* security

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* uaid split

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* ruff

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* test cases

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* security test cases

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* varibale correct

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* ruff

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* rebase fix

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

* feat(a2a): A2A-1.0 UAID path in Rust sidecar + federation-loop guard

Splits UAID cross-gateway routing between the two runtimes by intent:
the Python gateway keeps the ContextForge-to-ContextForge federation
shape (target `/a2a/{uaid}/invoke` on a peer CF gateway), and the Rust
a2a_runtime gets a first-class A2A-1.0-compliant path that POSTs the
original JSON-RPC 2.0 body directly to the agent URL encoded in the
UAID's `nativeId`. The two paths are intentionally not reverse-compatible.

Key pieces:

- `crates/a2a_runtime/src/uaid.rs` (new): HCS-14 parser with SSRF
  rejection (user-info, path injection, unsupported scheme),
  case-insensitive domain allowlist (exact or `.suffix`), DoS length
  cap, sealed `RoutingTarget` with private fields and a `Proto` enum
  (`FromStr` + exhaustiveness test), stable `UaidError::code()` for
  structured logs.
- `crates/a2a_runtime/src/server.rs`: UAID dispatch in
  `handle_a2a_invoke` with local-resolve-first and cross-gateway
  fall-through; SSE proxy (`handle_uaid_cross_gateway_streaming`)
  with Last-Event-ID replay and a bounded JSON fallback that enforces
  `max_response_body_bytes` via a streaming `read_bounded_body` helper
  (per-chunk + cumulative cap); startup validation and rate-limited
  warning for an empty allowlist; explicit one-time security warning
  for the unforwarded-auth tradeoff.
- `mcpgateway/main.py` + `mcpgateway/services/a2a_service.py`: kind-
  dispatch resolve (name/id/uaid) with 403 for visibility-denial so
  the sidecar can distinguish hidden-from-absent; URL-encoded UAID in
  the CF-federation target path; dropped redundant `agent_id` body
  field; structured error logging without leaking 5xx internals.

Federation-loop prevention covers both A->B->A and self-referential
`endpoint_url` loops:

- `X-Contextforge-UAID-Hop: N` counter replaces the earlier binary
  marker. Every outbound (cross-gateway, local-agent, and the
  backward-compat `POST /invoke`) stamps `N+1`; every inbound rejects
  at `uaid_max_federation_hops` (default 5).
- Strict parser with byte-identical rules in both runtimes: pure
  ASCII digits only (no Unicode digits), no whitespace on single
  values, saturating at `2**31 - 1`; RFC 7230 section 3.2.2 coalesced
  form `"0, 10"` parses each element with OWS trim and takes the max
  so smuggled low values can't mask a real high value.
- Max-of-variants defense against header-case smuggling: `HashMap`
  iteration order can't be used to select a low value paired with a
  high one, and axum's multi-valued `HeaderMap` is drained via
  `get_all` + max.
- Named-agent paths translate resolve's 403 to 404 at the axum edge
  so private-agent existence never leaks through the status code or
  the error text (`map_agent_resolve_err` helper).

Config additions:

- Python `uaid_allowed_domains` (list), `uaid_max_length` (2048),
  `uaid_max_federation_hops` (default 5).
- Rust `A2A_RUST_UAID_ALLOWED_DOMAINS`, `A2A_RUST_UAID_MAX_LENGTH`,
  `A2A_RUST_UAID_MAX_FEDERATION_HOPS` (same defaults).

Tests:

- Python unit/integration: kind-dispatch resolve (uaid/id/name fall-
  back), parse_hop_count parametric cases including Unicode digits,
  whitespace, overflow, coalesced RFC 7230 form and smuggling, local-
  agent hop stamping on `_execute_agent_request`, 3-hop chain locking
  in the stamp-N+1 vs reject-at-max contract, TestClient-level
  `/a2a/{agent_name}/invoke` header plumbing.
- Rust integration: 12+ UAID dispatch tests (local hit, cross-gateway
  forward, streaming SSE proxy, body-size cap on chunked responses,
  per-chunk cap, hop-limit at max for UAID and plain names,
  multi-variant smuggling, coalesced header, 403->404 on named paths,
  case-insensitive `/invoke` hop guard).
- Rust uaid.rs: 16 unit tests covering parse/validate/SSRF/allowlist
  round-trip plus `Proto` exhaustiveness and `UaidError::code()`
  stability contracts.

Closes #3956

Signed-off-by: Jonathan Springer <jps@s390x.com>

---------

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
Signed-off-by: Jonathan Springer <jps@s390x.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-authored-by: Jonathan Springer <jps@s390x.com>
Signed-off-by: Brian Hussey <brian.hussey@ie.ibm.com>

* fix(#4205): isolate upstream MCP sessions per downstream session (#4299)

* feat(services): scaffold UpstreamSessionRegistry for 1:1 upstream binding (#4205)

Introduces UpstreamSessionRegistry, which maps
(downstream_session_id, gateway_id) to a single upstream MCP ClientSession.
Replaces the identity-keyed MCPSessionPool's sharing semantics: two
downstream MCP sessions — even carrying the same user identity — can no
longer receive the same pooled upstream session. This is the core
isolation change #4205's reproducer needs.

Design:
  - 1:1 per (downstream_session_id, gateway_id). Never shared.
  - Connection reuse preserved WITHIN one downstream session: a client
    making many tool calls against gateway G through session S still
    hits the upstream's initialize exactly once.
  - Health probe on reuse after idle_validation_seconds (60s default),
    chain is ping → list_tools → list_prompts → list_resources → skip.
    Failed probe recreates the upstream session.
  - Transport + ClientSession live inside a dedicated asyncio.Task whose
    anyio cancel scope is bound to that task, not to the request handler
    (#3737). Shutdown is signal-driven via an asyncio.Event; cancellation
    only as a last resort with a timeout.
  - No configurable settings. The tuning surface of the old pool
    (max_per_key, TTL, etc.) collapses under the 1:1 constraint.
  - Purely in-process. Multi-worker stickiness for a downstream session
    is the session-affinity layer's concern (to be extracted next) —
    each worker's registry only sees requests affinity has already
    routed to it.

API surface:
  - acquire() async context manager keyed by downstream_session_id,
    gateway_id, url, headers, transport_type. Rejects empty ids.
  - evict_session(downstream_session_id) for DELETE /mcp paths.
  - evict_gateway(gateway_id) for gateway rotation/removal.
  - close_all() for app shutdown.
  - snapshot() -> RegistrySnapshot for /admin and logs.
  - Module-level init/get/shutdown singleton accessors matching the
    shape of get_mcp_session_pool() so call-sites migrate cleanly.

Tests (16, all passing):
  isolation across downstream sessions (the #4205 invariant), reuse
  within a session, distinct upstreams per gateway for one session,
  concurrent acquires collapse to one create via the per-key lock,
  idle probe + reuse, failed probe + recreate, evict_session /
  evict_gateway / close_all teardown, dead owner-task detection,
  gateway-internal header stripping, and the singleton accessors.
  A FakeClientSession + fake SessionFactory keep the tests hermetic.

Not yet wired: nothing in the codebase calls the registry yet. The
follow-up commits will extract session-affinity to its own module,
wire startup/shutdown + DELETE eviction, migrate tool/prompt/resource
services, refactor gateway health checks, delete MCPSessionPool, and
remove the associated feature flags.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* feat(services): wire UpstreamSessionRegistry into app lifecycle (#4205)

Adds startup init, shutdown drain, and DELETE-triggered eviction for
the registry introduced in the previous commit. All additive; the old
MCPSessionPool still runs alongside for now.

main.py:
  - init_upstream_session_registry() at startup, unconditionally (no
    feature flag — the registry is always on).
  - shutdown_upstream_session_registry() in the teardown block,
    ordered between pool shutdown and SharedHttpClient shutdown to
    ensure upstream sessions close before their HTTP transports go.

cache/session_registry.py (downstream session registry, distinct from
the upstream one this PR introduces):
  - remove_session() now calls get_upstream_session_registry().evict_session(id)
    as its last step. Fires on every path that drops a downstream session:
    explicit DELETE /mcp, internal /_internal/mcp/session DELETE, SSE
    disconnect housekeeping, database-backed session expiry. Wrapped so
    a missing singleton (tests, early shutdown) or an eviction exception
    never masks downstream teardown.

Tests (5 new, all passing):
  remove_session → evict_session forwarding; remove_session tolerating
  an uninitialized singleton; remove_session surviving an evict_session
  that raises; shutdown close_all() call + singleton clear; re-init
  after shutdown returns a fresh instance. Existing session_registry
  coverage tests still green (72 tests, no regressions).

Signed-off-by: Jonathan Springer <jps@s390x.com>

* feat(tool_service): route MCP tool calls through UpstreamSessionRegistry (#4205)

Both MCP call-sites in tool_service.invoke_tool (SSE at ~L5048, StreamableHTTP
at ~L5230) now acquire their upstream ClientSession from the registry when a
downstream Mcp-Session-Id is in scope. This is the first visible behavioural
change of the #4205 fix: two downstream MCP sessions served by the same user
now build SEPARATE upstream sessions, so stateful upstream servers (counter,
etc.) no longer leak state between downstream clients.

Changes:
  - Replace the conditional pool path with an unconditional registry path,
    gated on the presence of a downstream Mcp-Session-Id (read from the
    transport's request_headers_var via a new _downstream_session_id_from_request
    helper) AND a gateway_id AND not tracing_active.
  - Drop the `settings.mcp_session_pool_enabled` check at the call-sites.
    The registry is always on; its applicability is determined by whether a
    downstream session id is in scope.
  - Keep the per-call fallback path for callers without a downstream
    session id: admin UI test-invoke, internal /rpc, and anything that
    drives the tool_service outside the streamable-http transport.
  - Preserve the tracing trade-off: when tracing_active, skip the registry
    to allow per-request traceparent/X-Correlation-ID injection.
  - Remove the now-unused `get_mcp_session_pool, TransportType` import from
    mcp_session_pool; TransportType is now imported from the registry module.
  - _downstream_session_id_from_request uses a lazy import of
    streamablehttp_transport.request_headers_var to avoid a circular
    dependency at module load time.

Tests:
  - NEW test_invoke_tool_mcp_two_downstream_sessions_hit_registry_with_distinct_ids:
    the direct-consequence #4205 test — two invocations with different
    downstream session ids must produce two acquire() calls with distinct
    (session_id, gateway_id) keys, same gateway, different sessions.
  - Rewrote test_invoke_tool_mcp_pooled_path_does_not_inject_trace_headers
    as the equivalent registry-path test (same invariant: reused upstream
    transports must not receive per-request trace headers).
  - Rewrote 4 pool-hit tests in test_tool_service_coverage.py to use the
    registry API (and set request_headers_var with a session id).
  - 870 related tests pass; no regressions.

This migration leaves the old pool in place — it simply isn't called from
tool_service anymore. Prompt/resource/gateway call-sites still point at the
pool and will migrate in the next commits.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* feat(prompts, resources): route MCP calls through UpstreamSessionRegistry (#4205)

Mirrors the tool_service migration: prompt_service and resource_service acquire
their upstream MCP ClientSession from the registry when a downstream Mcp-Session-Id
is in scope, falling back to per-call sessions otherwise. Same 1:1 isolation,
same connection reuse within a session, same trace-header trade-off.

Changes:
  - prompt_service._fetch_prompt_from_gateway: replace pool path with registry
    path; drop the `settings.mcp_session_pool_enabled` gate; drop the now-unused
    `pool_user_identity` local.
  - resource_service.invoke_resource SSE + StreamableHTTP helpers: same
    rewrite. Also delete the `pool_user_identity` normalization block at
    line ~1708 (no longer referenced).
  - upstream_session_registry: add `downstream_session_id_from_request_context()`
    so the three services share one implementation. tool_service now aliases
    the shared helper rather than carrying its own copy.
  - TransportType is imported from upstream_session_registry in all three
    services; the pool's copy becomes unused and disappears in the
    hollow-and-rename step.

Tests:
  - Rewrote 4 pool-hit tests in test_resource_service.py as registry-path
    tests (set request_headers_var with a downstream session id, patch
    get_upstream_session_registry). Renamed for accuracy:
    test_sse_session_pool_used_and_signature_validated →
      test_sse_registry_used_and_signature_validated
    test_invoke_resource_streamablehttp_uses_session_pool_when_available →
      test_invoke_resource_streamablehttp_uses_registry_when_available
    The two "pool not initialized falls back" tests now simulate an
    uninitialized registry via RuntimeError from get_upstream_session_registry.
  - 1173 service-layer tests pass; no regressions.

Two holdouts still touch the pool: gateway_service's health check (task #19)
and the pool itself (task #23 deletes its upstream-session code and renames
the file to session_affinity.py).

Signed-off-by: Jonathan Springer <jps@s390x.com>

* refactor(gateway_service): drop pool from health checks (#4205)

Gateway health checks are system operations — no downstream MCP session id
is in scope — so they can't key the UpstreamSessionRegistry. They also
don't benefit meaningfully from connection reuse: each probe is a one-shot
initialize round-trip to detect reachability. Use a fresh per-call session
unconditionally.

Changes:
  - _check_single_gateway_health (streamablehttp branch): replaces the
    pool-or-fallback block with a single streamablehttp_client +
    ClientSession per probe.
  - Drop the `mcp_session_pool_explicit_health_rpc` feature flag usage
    (the setting itself disappears in the config cleanup commit). The
    initialize() round-trip is the probe; no optional list_tools() needed.
  - Imports: drop `get_mcp_session_pool, TransportType` from the pool
    import; keep `register_gateway_capabilities_for_notifications`
    (still used by three other call-sites in gateway_service). Also drop
    the now-unused `anyio` import (was used only for the pool branch's
    fail_after on list_tools).

Tests:
  - Rewrote test_streamablehttp_pool_not_initialized_falls_back_to_per_call_session
    as test_streamablehttp_health_uses_per_call_session (since per-call is
    now unconditional, that's the whole behavior the test pins).
  - Deleted test_streamablehttp_pool_used_and_explicit_health_rpc_calls_list_tools
    (exercised a code path that no longer exists).
  - Deleted tests/unit/mcpgateway/services/test_gateway_explicit_health_rpc.py
    entirely — it only tested the MCP_SESSION_POOL_EXPLICIT_HEALTH_RPC
    feature flag's on/off branches.
  - 168 gateway_service tests pass; no regressions.
Signed-off-by: Jonathan Springer <jps@s390x.com>

* refactor(services): rename mcp_session_pool.py → session_affinity.py (#4205)

Pure file rename + import updates. No behavioural changes. The
MCPSessionPool class, its methods, and the module-level init/get/close
helpers keep their names for this commit — those renames happen in the
follow-up once the pool machinery inside the file is hollowed out.

Rationale: the file's true cargo is cluster affinity (Redis-backed
session→worker mapping, heartbeat, SET NX ownership claim, Lua CAS
reclaim, session-owner forwarding, RPC listener). The MCP upstream-session
pooling part — which the UpstreamSessionRegistry has replaced — is now
dead weight, and naming the file "session pool" no longer reflects what
this module actually does for the codebase.

Changes:
  - git mv mcpgateway/services/mcp_session_pool.py → session_affinity.py
    (history-preserving; git blame / log --follow still work).
  - Bulk-update every `mcpgateway.services.mcp_session_pool` import across
    production (7 files) and tests (12 files) to
    `mcpgateway.services.session_affinity`. Mechanical sed, reviewable
    as one pass.
  - No class/method/function renames in this commit — that's the next
    one, after the hollow.

Tests: 1481 pass, 2 skipped (pre-existing). No regressions.

Next: hollow out the pool-only code (PooledSession, acquire/release/
session() context manager, pool queue, health chain, identity hashing,
max-per-key semaphore, circuit breaker) from session_affinity.py. The
affinity surface stays. After that, a final commit renames MCPSessionPool
→ SessionAffinity and updates method names to drop the pool/streamablehttp
prefixes where they're now misleading.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* refactor(services): hollow out pool machinery from session_affinity.py (#4205)

Deletes the upstream-MCP session-pooling code that no service-layer
caller uses anymore — the UpstreamSessionRegistry replaced it four
commits back. What remains in session_affinity.py is the multi-worker
cluster-affinity surface that streamablehttp_transport.py still relies
on: Redis-backed downstream-session→worker mapping, worker heartbeat,
atomic SET NX ownership claim, Lua CAS reclaim from dead workers,
cross-worker session-owner HTTP/RPC forwarding, the pub/sub RPC
listener, and the is_valid_mcp_session_id validator.

Net: session_affinity.py goes from 2417 → 1030 lines. Class name,
method names, and module-level accessor names are unchanged in this
commit so the diff is purely "delete dead code"; the rename is the
next commit.

Deleted from the file:
  - class TransportType (moved to upstream_session_registry earlier)
  - class PooledSession
  - _get_cleanup_timeout helper (read an obsolete setting)
  - PoolKey / HttpxClientFactory / IdentityExtractor type aliases
  - DEFAULT_IDENTITY_HEADERS frozenset
  - MCPSessionPool.__init__'s 16 pool-specific parameters and their
    corresponding self._* fields (pools, active sets, locks, semaphores,
    circuit breaker state, pool_last_used, eviction throttling, pool-hit
    metrics, identity_headers / identity_extractor, health-check
    config, max_total_keys / max_total_sessions)
  - __aenter__ / __aexit__ (pool session context manager)
  - _compute_identity_hash, _make_pool_key
  - _get_or_create_lock, _get_or_create_pool
  - _is_circuit_open / _record_failure / _record_success (circuit breaker)
  - acquire(), release()
  - _maybe_evict_idle_pool_keys
  - _validate_session, _run_health_check_chain (pool health chain)
  - _session_owner_coro, _create_session, _close_session
  - get_metrics() (pool-level stats)
  - session() context manager

Rewritten:
  - close_all() — now stops the heartbeat and RPC-listener tasks and
    clears the in-memory session mapping. No pool state to drain.
  - drain_all() — clears the in-memory session mapping only. Kept for
    the SIGHUP handler contract; upstream-session lifetime is owned by
    the registry now.
  - init_mcp_session_pool() — signature collapses from 18 parameters
    to 3 (message_handler_factory, enable_notifications,
    notification_debounce_seconds). All pool tunables are gone.

Callers updated:
  - main.py startup: single init_mcp_session_pool() with no args,
    gated only on settings.mcpgateway_session_affinity_enabled (the
    `or settings.mcp_session_pool_enabled` fork is redundant now).
  - main.py post-startup: notification service start moves under the
    affinity guard alongside heartbeat and RPC-listener startup
    (they share lifecycle).
  - main.py shutdown: same simplification.
  - _create_jwt_identity_extractor in main.py is now unused; it'll
    disappear in the config-cleanup commit (#20) along with the
    mcp_session_pool_jwt_identity_extraction setting.

Tests: 2064 pass across service + transport + cache suites. 2 pre-existing
skips. No regressions.

Next: the mechanical class/method rename (MCPSessionPool → SessionAffinity,
plus the corresponding method rename-downs of the `pool_` / `streamable_http_`
prefixes that no longer reflect what this module does).

Signed-off-by: Jonathan Springer <jps@s390x.com>

* refactor(services): rename MCPSessionPool → SessionAffinity (#4205)

Mechanical rename across 22 files. No behavioural changes. Follows the
hollow commit that removed the pool machinery; what survives is
affinity-only, so the class name and public API now match the module name.

Renames applied verbatim (order matters — longer substrings first):

  Class:
    MCPSessionPool                         → SessionAffinity

  Module-level accessors:
    get_mcp_session_pool                   → get_session_affinity
    init_mcp_session_pool                  → init_session_affinity
    close_mcp_session_pool                 → close_session_affinity
    drain_mcp_session_pool                 → drain_session_affinity
    start_pool_notification_service        → start_affinity_notification_service
    _mcp_session_pool (module global)      → _session_affinity

  Methods (public surface that callers use):
    register_pool_session_owner            → register_session_owner
    cleanup_streamable_http_session_owner  → cleanup_session_owner
    get_streamable_http_session_owner      → get_session_owner
    forward_streamable_http_to_owner       → forward_to_owner

  Internal helpers:
    _get_pool_session_owner                → _get_session_owner
    _cleanup_pool_session_owner            → _cleanup_session_owner
    _pool_owner_key                        → _session_owner_key

Kept as-is (not misleading after the hollow):
  - is_valid_mcp_session_id   (validates the downstream Mcp-Session-Id)
  - forward_request_to_owner  (already transport-agnostic)
  - register_gateway_capabilities_for_notifications
  - unregister_gateway_from_notifications
  - _session_mapping_redis_key, _worker_heartbeat_key
  - Redis key string literals (mcpgw:pool_owner:*, mcpgw:worker_heartbeat:*,
    mcpgw:session_mapping:*) — left untouched so a worker rolling between
    this branch and main's tip stays interoperable with in-flight Redis
    state. The Python identifiers move; the wire format does not.

Files touched:
  mcpgateway/services/session_affinity.py, upstream_session_registry.py,
  notification_service.py, server_classification_service.py,
  mcpgateway/main.py, admin.py, cache/session_registry.py,
  handlers/signal_handlers.py, transports/streamablehttp_transport.py,
  plus 13 corresponding test files.

Tests: 2064 pass, 2 pre-existing skips (unchanged). No regressions.

Closes task #23 (hollow-and-rename). Next up: task #20 (delete the 18
obsolete mcp_session_pool_* settings from config.py now that nothing
reads them), task #21 (delete orphan pool tests), task #22 (the #4205
counter-server e2e reproducer that lets the issue close).

Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(gateway): evict upstream sessions on gateway delete and connect-field update (#4205)

Codex stop-time review caught a gap in the upstream-session isolation
work: after an admin deletes a gateway, or updates its URL / auth
fields, the UpstreamSessionRegistry still holds ClientSessions pinned
to the stale gateway_id. Subsequent acquire() calls keep returning
them, so in-flight downstream sessions keep talking to the old URL
with the old credentials until the downstream session itself ends.

Fix:
  - New module-level helper _evict_upstream_sessions_for_gateway(id)
    in gateway_service.py. Forwards to registry.evict_gateway(id).
    Best-effort: tolerates an uninitialized registry (tests, early
    startup) and swallows unexpected registry errors so gateway
    mutations never block on upstream teardown failures.
  - GatewayService.delete_gateway — calls the helper right after
    db.commit(), before cache invalidation / notification. Captures
    every upstream session bound to the now-gone gateway.
  - GatewayService.update_gateway — captures original_auth_value,
    original_auth_query_params, original_oauth_config alongside the
    existing original_url / original_auth_type. After db.commit(),
    if ANY connect-affecting field changed, calls the helper. Non-
    connect changes (name, description, tags, passthrough_headers,
    visibility, etc.) deliberately leave upstream sessions alone so
    the 1:1 downstream-session connection-reuse benefit survives
    cosmetic edits.

Tests (3 new in test_upstream_session_registry_lifecycle.py):
  - helper forwards gateway_id to registry.evict_gateway and returns
    the eviction count.
  - helper returns 0 and does not raise when the registry singleton
    is not initialized.
  - helper swallows unexpected registry exceptions (e.g. Redis down)
    so gateway mutation paths stay robust.

1619 service-layer tests pass. 2 pre-existing skips unchanged.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(gateway): evict upstream sessions on TLS / mTLS field update (#4205)

Codex stop-time review flagged that the previous eviction-on-update
commit handled url + auth fields but missed the TLS/mTLS material that
equally changes the upstream connection envelope. Admin rotating a CA
bundle, updating its signature, switching the signing algorithm, or
pushing new mTLS client cert/key would leave upstream sessions pinned
to the pre-rotation TLS context — the next acquire would hand the
stale ClientSession back and keep using the old cert material until
the downstream session died.

Change:
  - update_gateway now snapshots original_{ca_certificate,
    ca_certificate_sig, signing_algorithm, client_cert, client_key}
    alongside the existing URL/auth originals, and adds them to the
    "did any connect-affecting field change?" disjunction that decides
    whether to call _evict_upstream_sessions_for_gateway(gateway.id).
  - Non-connect fields (name, description, tags, passthrough_headers,
    visibility) still skip eviction, so cosmetic edits keep the 1:1
    connection-reuse benefit.

Plus one contract test:
  - test_connect_field_inventory_matches_gateway_model ties
    _CONNECT_FIELD_NAMES to both (a) the source of update_gateway (via
    a grep for "original_<field>" and the bare field name) and (b) the
    Gateway ORM columns. Adding a new TLS / auth / URL column to the
    Gateway model without wiring it through the eviction check — or
    renaming one of the existing originals — now fails this test with
    a specific message rather than silently regressing #4205's intent.

317 service-layer tests pass (+1 over the previous commit's 316).

Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(gateway): evict upstream sessions on transport update (#4205)

Codex stop-time review flagged one more connect-affecting field the
previous two eviction commits missed: gateway.transport. Switching a
gateway between SSE and STREAMABLE_HTTP pins a different upstream MCP
client class in the registry — tool_service/prompt_service/resource_service
map gateway.transport into the TransportType passed to registry.acquire,
so stale sessions returned for the wrong transport would continue to
speak the old protocol against a server now expecting the new one.

Changes:
  - Capture original_transport alongside the other connect originals.
  - Add it to the change-detection check.
  - The 11-expression disjunction tripped ruff's PLR0916 (too-many-bool);
    refactored the comparison into a tuple of (current, original) pairs
    evaluated via any(), which is also easier to extend next time a new
    connect-affecting column is added to Gateway.
  - Contract test's _CONNECT_FIELD_NAMES grows to 11, now including
    "transport". The grep-and-ORM-…
jonpspri added a commit that referenced this pull request May 8, 2026
* feat: Add grpc_service_id to Tool schema

IBM#2854
Branch: GrpcMethodsAsTools-2854

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

AI-usage: full

* feat: Add tool registration and deletion for registering/deleting gRPC services

IBM#2854
Branch: GrpcMethodsAsTools-2854

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

AI-usage: full

* feat: Add load_file_descriptors for grpc serialized proto descriptor bytes

IBM#2854
Branch: GrpcMethodsAsTools-2854

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

AI-usage: full

* feat: Add gRPC tool invocation support

IBM#2854
Branch: GrpcMethodsAsTools-2854

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

AI-usage: full

* test: Unit tests for gRPC tool registration

IBM#2854
Branch: GrpcMethodsAsTools-2854

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

AI-usage: full

* style: make lint

IBM#2854
Branch: GrpcMethodsAsTools-2854

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* test: Fix missing grpc_service_id in tool unit test

IBM#2854
Branch: GrpcMethodsAsTools-2854

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

AI-usage: full

* test: Flesh out test coverage for new gRPC code paths

IBM#2854
Branch: GrpcMethodsAsTools-2854

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

AI-usage: full

* fix: Better error handling for protobuf descriptor pool adding

IBM#2854
Branch: GrpcMethodsAsTools-2854

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Inherit visibility from parent gRPC service when mapping to tools

IBM#2854
Branch: GrpcMethodsAsTools-2854
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix alembic migration after rebase

IBM#2854
Branch: GrpcMethodsAsTools-2854
AI-usage: full
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Repoint alembic migration parent to current head after rebase

Updates down_revision from x7h8i9j0k1l2 to aa1b2c3d4e5f, the current
head on origin/main. Several new migrations have landed since this PR
was last rebased, so the previous parent is no longer the head.

Verified single head: alembic heads -> w7x8y9z0a1b2 (head)

IBM#2854
Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

* test: Add grpc_service_id=None to TestRustMcpExecutionPlan SimpleNamespace mocks

Two tests added to TestRustMcpExecutionPlan after this PR was last
rebased construct tool mocks via SimpleNamespace. The PR's change to
_build_tool_cache_payload reads tool.grpc_service_id, so those mocks
now AttributeError. Mirrors b576ec4's earlier fix for the same
pattern in MagicMock(spec=DbTool) helpers.

Affected:
- TestRustMcpExecutionPlan.test_prepare_rust_mcp_tool_execution_uses_live_gateway_auth_fields_for_loaded_tools
- TestRustMcpExecutionPlan.test_prepare_rust_mcp_tool_execution_uses_live_gateway_string_auth_values

IBM#2854
Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

* style: make black isort

Tightens whitespace and import grouping per project lint config:
- mcpgateway/services/grpc_service.py: black removes a stray blank line
- tests/unit/mcpgateway/{services/test_grpc_service,services/test_tool_service,test_translate_grpc}.py: isort regroups stdlib/third-party/first-party/local imports

IBM#2854
Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(grpc): use protobuf 6.x kwarg in MessageToDict and isolate descriptor pool

Two issues addressed in mcpgateway/translate_grpc.py:

1. Pre-existing runtime bug: invoke()/invoke_streaming() called
   json_format.MessageToDict(..., including_default_value_fields=True),
   the parameter name from protobuf 4.x. The repo pins protobuf>=6.33.6
   (see pyproject.toml), where the parameter was renamed to
   always_print_fields_with_no_presence. Calls raised TypeError at
   runtime, breaking gRPC tool invocation. Renamed the kwarg and
   replaced the # pylint: disable=unexpected-keyword-arg suppression
   with an anti-regression comment.

2. Descriptor pool poisoning: GrpcEndpoint shared
   descriptor_pool.Default(), the process-wide singleton. Reflected
   FileDescriptorProto from one (untrusted) upstream service could
   leak across requests and cause symbol collisions or type confusion
   in subsequent calls. Switched to a per-endpoint
   descriptor_pool.DescriptorPool(); MessageFactory now binds to that
   private pool.

Updated tests/unit/mcpgateway/test_translate_grpc.py monkeypatches to
expose DescriptorPool() instead of Default() and MessageFactory(pool=...)
instead of MessageFactory().

Closes review: B2, B11

IBM#2854
Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(grpc): security and robustness hardening from review

Addresses BLOCKING findings from a multi-agent code review of PR IBM#3202:

Security:
* Reject local-only gRPC schemes (unix:, unix-abstract:, vsock:, fd:)
  in _validate_grpc_target since they bypass the IP-based SSRF model.
  Also normalize dns:/// and ipv4:/ipv6: name-resolver prefixes and
  parse bracketed IPv6 hosts correctly so the host-extraction step
  cannot be tricked. (review B1)
* Add _enforce_descriptor_limits to bound reflected
  FileDescriptorProto count, per-blob size, and aggregate size before
  storing or loading. Hardcoded constants (not settings) so a config
  change cannot silently weaken the DoS defense. (review B3)
* Add _validate_reflected_tool_name that runs reflected tool names
  through SecurityValidator.validate_tool_name() before persisting.
  Reflected names from upstream gRPC servers no longer bypass the
  injection / length / character checks applied to user-registered
  tools. (review B4)
* Strip metadata pseudo-keys (anything starting with '_') when
  copying discovered_services into endpoint._services so the
  _file_descriptors entry can never be confused with a service.
* Decode stored descriptors with base64.b64decode(..., validate=True)
  to fail fast on tampered DB content.

Robustness / Layer 1 invariants:
* _sync_tools_from_reflection now runs each method through a
  per-tool try/except so a single bad method cannot abort the
  entire reflection sync. Counts created / updated / failed.
  (review B9)
* _sync_tools_from_reflection now propagates parent service
  visibility, team_id, and owner_email to existing reflected tools,
  not just to newly created ones. Closes a Layer 1 token-scoping gap
  where a public->team visibility change on the parent service was
  silently ignored on already-discovered tools. (review B5)

Operational:
* invoke_method gains a timeout parameter (default settings.tool_timeout)
  and wraps endpoint.start() / endpoint.invoke() in asyncio.wait_for so
  a slow upstream cannot tie up a worker indefinitely. (review B6)
* tool_service.invoke_tool gRPC branch wraps the call in
  asyncio.wait_for, surfaces ToolTimeoutError + post-invoke timeout
  hook (matches A2A branch), preserves CancelledError instead of
  swallowing it, and logs with %-style + exc_info=True. (review B6/B7/B8)
* invoke_method preserves CancelledError, logs with %-style +
  exc_info=True, re-raises GrpcServiceError/GrpcServiceNotFoundError
  unwrapped instead of double-wrapping them. (review B8)

IBM#2854
Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

* refactor(grpc): drop redundant inline settings import

The inline 'from mcpgateway.config import settings' in
_validate_grpc_target was made redundant by hoisting the same import
to module top in the security hardening commit.

Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

* test(grpc): cover security-hardening helpers and Layer 1 invariants

Adds two test classes exercising the helpers introduced by the
security/robustness hardening commit:

TestSecurityHardening:
* _validate_grpc_target rejects unix:/unix-abstract:/vsock:/fd: schemes
* _validate_grpc_target normalizes dns:///, dns://, dns:, ipv4:, ipv6:
  name-resolver prefixes
* _validate_grpc_target accepts bracketed IPv6 literals and rejects
  malformed bracket syntax
* _enforce_descriptor_limits enforces count, per-blob, and aggregate
  size caps; happy path passes within bounds
* _validate_reflected_tool_name rejects empty / over-length / injection
  patterns and accepts proto-style identifiers

TestVisibilityPropagation:
* _sync_tools_from_reflection propagates parent service visibility,
  team_id, and owner_email to existing reflected tools (Layer 1
  invariant).
* _sync_tools_from_reflection isolates per-method failures: a method
  with an invalid name is skipped while the rest of the sync proceeds
  (B9 regression test).

All 13 new tests pass; the broader gRPC + tool_service + translate_grpc
suite is green at 520 tests.

IBM#2854
Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(grpc): close remaining review gaps from iteration-2 oracle pass

Addresses BLOCKING items the second-pass review identified as still
present after the first round of fixes.

* B1 (DNS resolution): _validate_grpc_target now delegates the
  hostname/IP-network/DNS policy to SecurityValidator._validate_ssrf so
  hostnames like metadata.google.internal are resolved and the resolved
  IPs go through the same blocklist as HTTP targets. Removes the
  duplicate hand-rolled IP-network logic that previously lived inline.

* B5 (visibility on update): GrpcService.update_service now snapshots
  visibility / team_id / owner_email before applying the update and
  bulk-updates every child Tool with grpc_service_id == service.id in
  the same transaction whenever any of those fields change. Closes the
  Layer 1 token-scoping gap where a public->team change on the parent
  service was silently ignored on already-discovered tools.

* B6 (real gRPC deadlines): GrpcEndpoint.start, _discover_services,
  and _discover_service_details now accept a timeout kwarg that is
  threaded into ServerReflectionInfo, and GrpcEndpoint.invoke binds
  the deadline onto the underlying unary_unary call so a slow upstream
  cannot keep an executor thread alive after asyncio.wait_for cancels
  the wrapping coroutine. invoke_method passes the effective tool
  timeout into both endpoint.start and endpoint.invoke.

* B7 (cancellation propagation in outer except BaseException): adds
  except asyncio.CancelledError: raise immediately before the three
  outer except BaseException as e blocks in tool_service.invoke_tool
  so a cancellation is never wrapped as a ToolInvocationError.

* B12 (protobuf 6.x compatibility): MessageFactory.GetPrototype was
  removed in protobuf >= 5.x. Both the unary and streaming invocation
  paths in translate_grpc now call message_factory.GetMessageClass(...)
  instead. Updated the corresponding unit-test mocks.

* Cleanup: drops unused import ipaddress that became dead after
  the SecurityValidator delegation.

Test mocks updated to match the new signatures (timeout kwarg on the
populate_service / _populate callbacks; GetMessageClass at the module
level instead of MessageFactory.GetPrototype).

Lint clean (black, isort, ruff E3/E4/E7/E9/F/D1). 520 targeted tests
pass.

IBM#2854
Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(grpc): preserve timeout=0 fail-fast semantics in deadline ternaries

The 'if timeout' ternary in three reflection/unary call sites silently
drops timeout=0, but gRPC accepts timeout=0 as the explicit 'fail
immediately' deadline. Switched to 'if timeout is not None' so a
caller-supplied 0.0 reaches the underlying gRPC call with the right
semantics.

Found in iteration-3 oracle review (final pass, no new BLOCKING items).

Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(grpc): apply non-blocking review nits flagged across the three review passes

Bundle of low-risk, non-blocking improvements that surfaced during the
deep review iterations. Each item below was explicitly classified as
fix-now in the review summary; broader / design-dependent items were
filed as follow-ups under IBM#4612 and IBM#4613.

* Schema parity (#1): add grpc_service_id to ToolRead in schemas.py
  alongside the existing gateway_id field so API consumers can identify
  gRPC-discovered tools (review S2.4 / TD1).

* db.py cleanup (IBM#2): drop redundant nullable=True from
  Tool.grpc_service_id; Mapped[Optional[str]] already implies it and
  the sibling gateway_id column omits it (review TD2).

* translate_grpc.load_file_descriptors (IBM#3): widen the parameter type
  from List[bytes] to Sequence[bytes] and reject a single bytes object
  passed by mistake. Without the guard Python would silently iterate
  byte-by-byte (review TD3).

* translate_grpc descriptor pool conflict (IBM#4): replace the inaccurate
  'no-op if already added' comment with an explicit TypeError handler.
  protobuf raises TypeError when a file with the same name has
  conflicting content; the existing descriptor stays authoritative
  (review S3.3).

* test_sync_tools_removes_stale_tools (IBM#5): replace the brittle
  string-match assertion ('DELETE' in str(call)) with an explicit
  call_count == 4 (1 select + 3 deletes), which no longer depends on
  the SQLAlchemy Delete object repr (review S2.6).

* translate_grpc close() (IBM#6): convert the f-string log to lazy
  %-style for consistency with the rest of this PR's logging
  (review N2.1).

* grpc_service _sync_tools_from_reflection (IBM#7): document why
  input_schema['properties'] is empty by design — gRPC arg shape is
  validated at the protobuf invocation layer, not the MCP tool-call
  layer; the actual proto types live in the x-grpc-* extensions
  (review N2.2).

* TestInvokeMethodGuards (IBM#8): add 5 edge-case tests covering
  invoke_method paths the previous suite did not exercise:
  - service-not-found  -> GrpcServiceNotFoundError
  - disabled service   -> GrpcServiceError('is disabled')
  - invalid method format (no dot) -> GrpcServiceError
  - _validate_grpc_target spy called with service.target
  - _validate_tls_path spy called for both cert and key paths

525 targeted tests pass (was 520). Lint clean.

Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

* style: clean lint cycle (make autoflake/isort/black/ruff/bandit/interrogate/pylint)

Remaining fixes the project lint cycle surfaced after the fix-now bundle landed:

* isort: re-sort sibling imports in test_grpc_service.py
  (_enforce_descriptor_limits before _GRPC_MAX_*; case-insensitive order).

* ruff PLW0108 (unnecessary lambda) in test_translate_grpc.py: align the
  4 mocks I added with the 7 pre-existing patterns in the same file
  (lambda **_kw: MagicMock() / lambda *_a, **_kw: MagicMock()).

* ruff RET501 in test_grpc_service.py::tls_spy: drop the redundant
  trailing 'return None'.

* pylint no-member on the alembic migration: add the project-standard
  '# pylint: disable=no-member' header that 9 other migrations use to
  silence alembic.op's dynamic-attribute false positives.

* pylint try-except-raise on tool_service.invoke_tool gRPC branch:
  the 'except asyncio.CancelledError: raise' clause LOOKS like a no-op
  but exists specifically so the LATER 'except Exception' cannot swallow
  cancellation (PR IBM#3202 review B7). Added an inline disable plus a
  comment citing B7 so the next lint sweep doesn't delete the
  anti-regression code.

Verified after fix:
  make autoflake / isort / black -> clean
  make ruff (E3,E4,E7,E9,F,D1)  -> All checks passed
  make bandit                   -> 0 issues, 18022 LoC
  make interrogate              -> 100% docstring coverage
  make pylint                   -> 10.00/10 on all 6 production files
  pytest (targeted gRPC suite)  -> 525/525 passing

Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix: Re-parent alembic migration to current head after second rebase

Updates down_revision from aa1b2c3d4e5f to 926d3e07d098. Origin/main
gained the CPEX plugin framework replacement (IBM#3754) which introduced
two new migrations and a mergepoint, leaving our grpc_service_id
migration as a sibling head. Re-pointing to the new head returns the
chain to single-headed.

Verified single head:
  alembic heads -> w7x8y9z0a1b2 (head)
  alembic upgrade head -> clean on a fresh sqlite DB

IBM#2854
Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix(grpc): restore reserved/multicast guard and update test_grpc_service_no_grpc

Two related fixes for the no-grpc test file that surfaced after rebase:

1. Restore is_reserved / is_multicast pre-check in _validate_grpc_target.
   The earlier delegation to SecurityValidator._validate_ssrf inadvertently
   dropped this guard because the shared validator only checks
   blocked-networks / localhost / private. IP literals like 224.0.0.1
   (multicast) or 0.0.0.0 (reserved) would have slipped past unless they
   were also in ssrf_blocked_networks. The restored check excludes
   loopback so ::1 (which Python flags as both is_loopback AND
   is_reserved) still passes through the localhost gate as intended.

2. Update test_grpc_service_no_grpc.py:
   - Add timeout=None kwarg to the three FakeEndpoint.start() and
     invoke() methods so they accept the per-RPC deadline added in the
     security/robustness hardening commit.
   - Update test_validate_grpc_target_enforces_ssrf_rules regex
     patterns to match the SecurityValidator messages reached via
     delegation (blocked hostname X, localhost address which is
     blocked, private network address which is blocked).

All 4 gRPC test files pass: 528 / 528.

IBM#2854
Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

* test: increase diff coverage for gRPC PR from 92% to 100% on grpc_service+translate_grpc

Adds 11 targeted tests + a decision-record comment to address the diff-coverage
report that flagged 18 previously-uncovered lines:

  grpc_service.py:    95.7% -> 100%  (7/7 covered)
  translate_grpc.py:  91.7% -> 100%  (3/3 covered)
  tool_service.py:    69.2% -> 62%   (5/8 covered + 3 documented)

New tests in tests/unit/mcpgateway/services/test_grpc_service.py:

* test_validate_grpc_target_empty_string                        -> line 134
* test_invoke_method_propagates_cancelled_error                 -> line 1021
* test_invoke_method_re_raises_timeout                          -> lines 1023-1024
* test_invoke_method_re_raises_grpc_service_error_unwrapped     -> line 1026
* test_update_service_propagates_visibility_to_child_tools      -> lines 503-504

New tests in tests/unit/mcpgateway/test_translate_grpc.py:

* test_load_file_descriptors_rejects_single_bytes               -> line 409
* test_load_file_descriptors_skips_pool_conflict                -> lines 420, 425

New tests in tests/unit/mcpgateway/services/test_tool_service.py:

* test_invoke_grpc_tool_propagates_cancellation                 -> line 5847
* test_invoke_grpc_tool_timeout_raises_tool_timeout_error       -> lines 5849-5850, 5852

Restored a regression: the earlier B1 refactor (delegating to
SecurityValidator._validate_ssrf) inadvertently dropped the is_reserved /
is_multicast guard, since SecurityValidator only checks blocked-networks /
localhost / private. Restored the local guard with an is_loopback exclusion
so IPv6 ::1 (which Python flags as both is_loopback AND is_reserved)
still passes through the localhost gate.

Coverage decision-record left in test_tool_service.py for the 3 remaining
B7 anti-regression lines (5446 REST, 5632 MCP, 5851 gRPC timeout-with-pm):
they are byte-identical to the gRPC variant at line 5847 (which is fully
exercised). Equivalent REST/MCP tests require coaxing CancelledError
through asyncio.wait_for, which Python's event loop converts to TimeoutError
in some states. Protecting one branch is sufficient to detect a regression
that would affect all three.

541 tests passing, ruff/black/isort/pylint clean.

IBM#2854
Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix: re-parent alembic migration onto current head + restore interrogate 100%

Two CI failures triggered by main moving forward during this PR's review:

* alembic multi-head: PR IBM#4501 ("add ondelete=CASCADE to FK constraints")
  introduced 9fb98535724d as a sibling of our w7x8y9z0a1b2, both parented
  to 926d3e07d098. Re-parents w7x8y9z0a1b2 -> 9fb98535724d so the chain
  has a single head again and bootstrap_db / SQLite+Postgres upgrade
  validation pass. Verified single head with alembic heads.

* interrogate --fail-under=100: three nested helper functions had no
  docstrings, dropping the package to 99.9%. One is mine
  (GrpcEndpoint.invoke._call introduced by the gRPC timeout fix in
  PR IBM#3202 review B6); the other two (_user_obj_to_dict._iso,
  _user_dict_to_obj._dt in services/email_auth_service.py) landed
  on main via PR IBM#4595 and were already failing main's CI. All three
  added one-line docstrings.

Cascade fixed by these two changes:
  - SQLite + PostgreSQL Fresh/Upgrade (alembic head check)
  - pytest (py3.12) (bootstrap_db on alembic head check)
  - playwright-ci-smoke (depends on bootstrap)
  - sql-sanitizer E2E (depends on bootstrap)
  - interrogate (mcpgateway) (docstring coverage)
  - Run pre-commit hooks (interrogate is a pre-commit hook)

The 2 email_auth_service docstrings are technically out of scope for
this PR but the local lint cycle scoping I had been using
(make interrogate TARGET=<my-files>) didn't catch them. Folding
them in here unblocks PR-level CI rather than waiting for a separate
fix to main.

Verified locally:
  alembic heads      -> single head w7x8y9z0a1b2
  interrogate        -> 100% on full mcpgateway/
  black --check      -> clean
  ruff check         -> clean
  pylint             -> 10.00/10
  targeted gRPC tests -> 541 passing

IBM#2854
Branch: GrpcMethodsAsTools-2854
Signed-off-by: Jonathan Springer <jps@s390x.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Signed-off-by: Jonathan Springer <jps@s390x.com>
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Jonathan Springer <jps@s390x.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.