Skip to content

Benchmark docker compose deployment fix#4003

Closed
dima-zakharov wants to merge 4 commits intomainfrom
fix-benchmark2
Closed

Benchmark docker compose deployment fix#4003
dima-zakharov wants to merge 4 commits intomainfrom
fix-benchmark2

Conversation

@dima-zakharov
Copy link
Copy Markdown
Collaborator

@dima-zakharov dima-zakharov commented Apr 2, 2026

🔗 Related Issue

The deployment and benchmark creates situation when tools not found:

make testing-rebuild-rust-full

As a result of this one cannot test with make benchmark-mcp-tools


📝 Summary

The PR fixes the gateway deployment issue, the tools list is empty when deployed with : make testing-up-rust-full


🏷️ Type of Change

  • Bug fix
  • Feature / Enhancement
  • Documentation
  • Refactor
  • Chore (deps, CI, tooling)
  • Other (describe below)

🧪 Verification

Check Command Status
Lint suite make lint
Unit tests make test
Coverage ≥ 80% make coverage

✅ Checklist

  • Code formatted (make black isort pre-commit)
  • Tests added/updated for changes
  • Documentation updated (if applicable)
  • No secrets or credentials committed

📓 Notes (optional)

Screenshots, design decisions, or additional context.

Signed-off-by: Dmitry Zakharov <zakharov@ibm.com>
Copy link
Copy Markdown
Collaborator

@madhu-mohan-jaishankar madhu-mohan-jaishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix correctly addresses the stale cache issue after gateway registration by invalidating the registry and tool lookup caches immediately after db.flush().

LGTM

@dima-zakharov dima-zakharov changed the title Benchmark docker compose deploymtn fix Benchmark docker compose deployment fix Apr 3, 2026
@brian-hussey brian-hussey self-assigned this Apr 3, 2026
Copy link
Copy Markdown
Collaborator

@Lang-Akshay Lang-Akshay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks you @dima-zakharov for the PR. Please implement following changes.

Finding #1: Race Condition in Cache Invalidation (CWE-362)

Problem

Cache invalidation broadcasts fire before the database transaction commits in register_gateway(). Other workers receiving the invalidation signal may query the DB before the commit completes, potentially caching empty/stale data.

Current Flow:

  1. db.flush() — writes to DB (uncommitted)
  2. Cache invalidation + Redis broadcast (lines 485–495)
  3. Return to caller
  4. db.commit() happens later in get_db() dependency

Caution

Impact: Workers can cache stale data during the window between invalidation broadcast and commit.

Solution

Move cache invalidation to occur after db.commit(), matching the pattern already used in update_gateway() (lines 1634–1646). Either:

  • Add explicit db.commit() before invalidation in register_gateway()
  • Use post-commit hooks/events to trigger invalidation

Finding #2: Unconditional Expensive Cache Operations (CWE-400)

Problem

invalidate_gateway() is called unconditionally on every gateway registration, even when the gateway has zero tools. This triggers O(N) Redis operations (key scans + pub/sub broadcast) unnecessarily.

Caution

Impact:

  • Wasted resources on empty gateways
  • No rate limiting on POST /gateways
  • Potential thundering herd on Redis pub/sub channel

Solution

Quick fix: Only call invalidate_gateway() when tools exist:

if tools:
    await tool_lookup_cache.invalidate_gateway(str(db_gateway.id))

Additional improvements:

  • Add rate limiting to gateway registration endpoint
  • Optimize invalidate_gateway() to check for cached data before scanning
  • Consider lazy invalidation strategies

Add tests for this

@crivetimihai
Copy link
Copy Markdown
Member

Thanks for identifying this stale-cache issue, @dima-zakharov — the root cause you found was real and impactful for benchmark workflows.

However, this fix has already been superseded by #3839 (feat(discovery): add automatic tool discovery with hot/cold classification), which landed on main as 506151675. That PR independently addressed the same root cause with a more comprehensive approach:

  • db.flush()db.commit() — resolves the race condition where cache invalidation fired before the transaction committed (the concern @Lang-Akshay raised in Finding Configure Renovate #1).
  • Unconditional cache invalidation for gateways, tools, resources, prompts, tool lookup, and admin stats — covers all cases regardless of whether the gateway registered with tools/resources/prompts.
  • Additional invalidations (invalidate_gateways(), admin_stats_cache.invalidate_tags()) that this PR was missing.

After rebasing this branch onto current main, the net diff is zero — every change here is already present in a stricter form.

@Lang-Akshay — both of your review findings are also addressed on main:

  1. Race condition: Fixed by the flushcommit change.
  2. Unconditional expensive ops: Main chose unconditional invalidation, which is the safer default for a low-frequency registration operation where correctness matters more than saving a few Redis round-trips.

Closing as already fixed. Thanks again for the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants