Skip to content

test(rate-limiter): toggle convergence-time probe + TTL-expiry cycle#4511

Draft
gandhipratik203 wants to merge 7 commits intomainfrom
test/rate-limiter-toggle-convergence
Draft

test(rate-limiter): toggle convergence-time probe + TTL-expiry cycle#4511
gandhipratik203 wants to merge 7 commits intomainfrom
test/rate-limiter-toggle-convergence

Conversation

@gandhipratik203
Copy link
Copy Markdown
Collaborator

@gandhipratik203 gandhipratik203 commented Apr 29, 2026

Summary

Two integration tests covering rate-limiter behaviour under runtime mode toggle via the admin API. Marked draft because they raise design questions about the rate limiter's toggle-latency contract that need team input before a final shape lands.

The rate limiter is the only stateful plugin we ship — counter state lives in Redis, mode is embedded in a per-worker cached plugin instance. That makes its response to a runtime mode toggle eventually consistent, not instantaneous, in a way the other supported plugins (Output Length Guard, Secrets Detection) aren't.

What's in this PR

Two tests, both currently passing:

TestRateLimiterToggleAfterTTLExpiry — correctness contract

tests/integration/test_rate_limiter_dynamic_behavior.py

Three-phase cycle: enforce → wait past Redis counter TTL → disabled → wait → enforce. Validates that re-enforcement works after a quiet period long enough for counters to age out. Tagged @pytest.mark.slow because of two ~75s sleeps.

TestRateLimiterToggleLiveState::test_measure_convergence_time_for_mode_toggle — latency probe

tests/integration/test_rate_limiter_toggle_live_state.py

Flips mode, then polls every second until traffic actually reflects the new mode, recording the elapsed time and the per-poll trajectory. No threshold-based assertion on the convergence time — the test fails only if convergence never happens within MAX_CONVERGENCE_S. Useful both as a regression pin (sudden slowdown caught) and as a measurement tool.

Sample observation

One local run, single replica × 24 workers, post-current-main image:

enforce → disabled : converged in 22.4s
  t= 0.00s   4/6 allowed,  2/6 blocked
  t= 2.84s   2/6 allowed,  4/6 blocked
  t= 7.20s   1/6 allowed,  5/6 blocked  (got worse before better)
  t=11.30s   0/6 allowed,  6/6 blocked
  t=15.42s   5/6 allowed,  1/6 blocked  (oscillates)
  t=20.86s   4/6 allowed,  2/6 blocked
  t=22.37s   6/6 allowed,  0/6 blocked  ← converged

disabled → enforce : converged in 9.7s

The ~22s disabled-convergence with mid-trajectory oscillation is suspicious — looks like pub/sub-driven invalidation is correct for most workers within a few seconds, but a fraction takes much longer to converge, possibly relying on the 30s TTL-based cache safety net rather than the pub/sub eviction. Convergence happens; just not as fast as expected.

Why this is different from stateless plugins

Stateless plugins decide based only on the current request — the moment a worker's mode flips to disabled, every subsequent request reflects it. Rate limiter decides based on (mode, persistent counter state). When mode flips, the counter is still hot in Redis, so any worker still serving requests under stale-mode-enforce will still block them. The convergence time is a property of how long it takes every worker's plugin-manager cache to rebuild with the new mode.

Open design questions for review

  1. Toggle latency expectation. What's the SLA we want to commit to? "Within N seconds" feels right, but what's N? Operators flipping a plugin off in an incident response need to know.
  2. Counter window on disabled → enforce. Today we don't reset the window — flipping back to enforce on an already-hot bucket immediately blocks. Is that the right call? Some operators might expect a fresh window on re-enable. Either is defensible.
  3. Should disabled mode actually reset Redis state? Today it stops checking counters but doesn't clear them. Same question from the other direction.
  4. Eventual-consistency contract for stateful plugins more generally. Rate limiter is the only stateful one today, but cpex-plugins#49 has follow-ups (quota accounting, cascade limits) that would be similarly stateful. Worth deciding the convergence-time SLA at the framework level once.

What this PR is not

Not a code fix. The rate limiter works; the question is how it behaves under toggle and what we want that behaviour to be. The tests document current behaviour and provide a measurement harness for any future fix.

How to run

Prerequisites

  • Local docker-compose stack from this repo brought up via make compose-up (3 gateway replicas behind nginx on :8080, Redis, Postgres, fast-test-server registered).
  • Gateway image built from current main — run make docker-build if your local image is stale.
  • Default dev credentials and JWT secret from your local .env (the .env.example defaults work for this test).
  • uv sync --extra plugins so cpex-rate-limiter is importable in the test env.

Run

# convergence probe — ~1 minute, emits the trajectory on every run
uv run pytest tests/integration/test_rate_limiter_toggle_live_state.py \
  --with-integration -s -v

# TTL-expiry cycle — ~3 minutes due to two ~75s sleeps; tagged @pytest.mark.slow
uv run pytest \
  tests/integration/test_rate_limiter_dynamic_behavior.py::TestRateLimiterToggleAfterTTLExpiry \
  --with-integration -s -v

Both tests skip cleanly if http://localhost:8080/health isn't reachable.


Update — empirical evidence and layered coverage

Since the original draft, this branch has grown into a layered toggle-behaviour story. Adding here, not editing the above — the original sections remain the right snapshot of the problem statement and findings at the time the PR was opened.

Layered toggle-behaviour coverage now on this branch

Layer Pin File
A reload_tenant → manager.shutdown invocation tests/unit/mcpgateway/plugins/framework/test_tenant_plugin_manager_tool_scoped.py
A-init reload_tenant → new manager.initialize invocation same
B-0 manager-mediated plugin wipe via direct factory.reload_tenant tests/integration/test_plugin_manager_wipe_on_disable.py
B-0.5 publisher↔subscriber round-trip via real Redis pub/sub same
C (existing) end-to-end convergence probe via HTTP+MCP tests/integration/test_rate_limiter_toggle_live_state.py
Multi-replica probe end-to-end convergence via Redis pub/sub only, no HTTP amplification tests/integration/test_rate_limiter_toggle_via_redis_pubsub.py

Empirical answer to design question #3

Question #3 asked whether disabled mode should clear Redis counter state. The answer landed on "yes, clear them," and the implementation lives on cpex-plugins branch feat/rate-limiter-wipe-on-disable-only. Running the new multi-replica pub/sub probe against a wipe-enabled gateway image (3 replicas × 24 workers, plain docker-compose stack):

Convergence trajectory (counter_exists per poll):
  t=  0.00s  counter_exists=True
  t=  0.11s  counter_exists=False  ← converged

✓ Wipe converged in 0.11s

vs. the ~22s baseline observed in the existing live-state probe at the top of this PR — a ~170× improvement on the disabled-direction convergence. Note the wipe doesn't accelerate the cache rebuild; it makes the rebuild's user-visible latency irrelevant — a stale-mode worker with empty counters cannot block any request.

Pre-conditions discovered while empirically running

Two things had to be in place for the wipe path to actually fire from a clean stack — worth flagging on the cpex-plugins wipe-on-disable PR description as deployment caveats:

  1. Plugin must be loaded for the wipe to fire. plugins/config.yaml ships rate-limiter with mode: "disabled", and the framework loader skips disabled plugins. The probe first toggles to enforce via the admin API so the plugin is genuinely loaded.
  2. At least one cached manager required. factory.invalidate_all() iterates only cached managers; a fresh gateway with zero traffic has zero cached managers, so the broadcast triggers no reload_tenant and no plugin.shutdown. The probe sends a 20-request warm-up burst before the disable to ride gunicorn's load balancer across multiple workers.

Status of original design questions

Two complementary integration tests covering rate-limiter behaviour when
the plugin's mode is toggled at runtime via the admin API. The rate
limiter is the only stateful plugin in the gateway's catalog (counter
state lives in Redis; mode is embedded into a per-worker cached plugin
instance), so its response to a runtime mode toggle is eventually
consistent rather than instantaneous. These tests describe both halves
of that contract.

TestRateLimiterToggleAfterTTLExpiry (in test_rate_limiter_dynamic_behavior.py)
is the *correctness* test: enforce -> wait past Redis counter TTL ->
disabled -> wait -> enforce. Validates that re-enforcement works after a
quiet period long enough for counters to age out. Tagged @pytest.mark.slow
because of the two ~75s sleeps. Acts as a regression pin for the happy
path.

TestRateLimiterToggleLiveState::test_measure_convergence_time_for_mode_toggle
(new file) is the *latency* probe: flip enforce -> disabled, then poll
every second until traffic actually reflects the new mode, recording the
elapsed time. Same pattern for disabled -> enforce. No threshold-based
assertion on the convergence time — the test only fails if convergence
never happens within MAX_CONVERGENCE_S. Records the per-poll observation
trajectory so the actual convergence path (and any oscillation) is
visible in the test output. Useful both as a regression pin (sudden
slowdown caught) and as a measurement tool for informing a future
toggle-latency SLA on the rate-limiter plugin.

Also fixes _send_tool_burst in test_rate_limiter_dynamic_behavior.py to
do an MCP initialize handshake (required by the streamable HTTP
transport), set the MCP-Protocol-Version + Accept: text/event-stream
headers, and use the trailing-slash form of /servers/{id}/mcp/ to avoid
a 307 redirect.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
When an operator flips a plugin's mode via the admin API, the rebuild
path runs through TenantPluginManagerFactory.reload_tenant(), which
evicts the cached manager, awaits old.shutdown() (propagating to
plugin.shutdown(), where stateful plugins react), and builds a new
manager.

The "new manager built" step is already pinned (test_factory_reload_-
tool_context).  The "old.shutdown() invoked" step is *not* — only its
exception-handling behaviour is.  test_factory_reload_shutdown_-
exception and test_factory_build_manager_old_shutdown_fails would
both pass if reload_tenant silently dropped the shutdown call
entirely.  test_factory_shutdown_propagates_exceptions pins shutdown
invocation, but on the whole-factory shutdown path, not the toggle
path.

A regression at this boundary would silently break the operator-
visible toggle contract for stateful plugins — wipe-on-disable
(#4576), audit logging on disable, etc. would
no-op despite every surrounding test staying green.  This commit
fills that gap with a single test that wraps the manager's shutdown
in an AsyncMock(wraps=...) spy and asserts it was awaited exactly
once on reload.

Layered toggle-behaviour coverage on this branch (after this commit):
  Layer A — framework boundary (reload_tenant lifecycle)   [unit, this commit]
  Layer C — convergence measurement harness                [integration, existing]

Layer B (plugin-side propagation: manager.shutdown -> plugin.shutdown
fires for the rate limiter under the manager) and Layer D (convergence
threshold under wipe-on-disable) will land as follow-on commits on
this same branch as their prerequisites are met.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Symmetric counterpart to the previous commit
(test_factory_reload_tenant_invokes_old_manager_shutdown).  That commit
pinned the disable-direction boundary — old manager's shutdown fires
on reload.  This commit pins the enable-direction boundary — new
manager's initialize fires on reload.

test_factory_reload_tool_context already proves a new manager
instance is constructed (manager2 is not manager1), but a regression
that constructed without initializing would still pass that test —
instances exist, they would just be unable to dispatch hooks.
Without initialize firing, every plugin under the new manager would
silently no-op after a toggle to enforce: hooks wouldn't run,
mode-change-on-init logic (e.g. the rate limiter re-arming its
counter window after wipe-on-disable) wouldn't fire, and the
operator-visible "I just re-enabled this plugin" promise would
silently break.

Captures via an instance list rather than AsyncMock because the
assertion needs to identify *which* manager was initialized (must
match manager2), not just count calls — the initial get_manager
above already triggers one initialize that we deliberately exclude
by installing the spy *after* it.

Layered toggle-behaviour coverage on this branch (after this commit):
  Layer A  — framework boundary, shutdown side    [unit, prev commit]
  Layer Ai — framework boundary, initialize side  [unit, this commit]
  Layer C  — convergence measurement harness      [integration, existing]

Layer B (single-replica integration: admin toggle -> rate-limiter
plugin shutdown fires -> counter cleared) and Layer D (convergence
threshold under wipe-on-disable) are next on the ladder.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
@gandhipratik203 gandhipratik203 marked this pull request as ready for review May 3, 2026 07:59
… real plugin + Redis

Adds the integration-test rung between the framework-boundary unit
tests in test_tenant_plugin_manager_tool_scoped.py (which use stub
plugins) and the full HTTP-driven tests in
test_rate_limiter_dynamic_behavior.py (which require a running gateway
with admin auth).

This test loads a real cpex-rate-limiter plugin into a real
TenantPluginManagerFactory against a real Redis, deposits counter
keys via manager.invoke_hook, simulates an operator mode toggle to
"disabled" by setting the framework's mode key in Redis directly,
then drives factory.reload_tenant and asserts the counter keys were
wiped — proving the rebuild path reaches the plugin's wipe-on-disable
code path end-to-end without HTTP, pubsub, or admin-handler
involvement.

What this pins (the boundary between Layer A and the HTTP layer):
  * factory.reload_tenant -> manager.shutdown -> registry.shutdown ->
    plugin.shutdown all fire in sequence against a real plugin.
  * The rate-limiter plugin's wipe-on-disable code path is reached,
    reads the mode key, and clears its counters from Redis.

Out of scope (covered by existing tests at adjacent layers):
  * Admin HTTP -> reload_tenant pathway.
  * Pubsub-driven multi-worker fan-out.
  * Convergence timing across multiple workers.

Layered toggle-behaviour coverage on this branch (after this commit):
  Layer A   — framework boundary, shutdown side    [unit, prev commit]
  Layer Ai  — framework boundary, initialize side  [unit, prev commit]
  Layer B-0 — manager-mediated plugin wipe         [integration, this commit]
  Layer C   — convergence measurement harness      [integration, existing]

Layer D (convergence threshold under wipe-on-disable) is the next
rung once empirical validation against a wipe-enabled gateway image
confirms the hypothesis.

CI behaviour: this test lives in tests/integration/ and is therefore
auto-skipped by the pytest_integration_mark plugin unless the runner
passes --with-integration.  CI today does NOT pass that flag, so the
test does not run automatically on PR CI — same as every other
integration test in this directory.  It is exercised locally by
developers who explicitly opt in.  Once the cpex-plugins
wipe-on-disable PR merges and the wipe-enabled wheel is published,
local --with-integration runs will start exercising the wipe
assertions; CI inclusion would require wiring up an integration-test
job, which is framework-level work tracked separately.

Skip-guards:
  * Skips when docker is unavailable (cannot stand up the test Redis).
  * Skips when the installed cpex-rate-limiter wheel does not include
    the wipe-on-disable code path (no _wipe_my_counters attribute on
    the plugin class) — i.e. when running against the PyPI baseline
    before the cpex-plugins wipe PR merges.

Test infrastructure additions:
  * tests/integration/fixtures/configs/rate_limiter_redis_only.yaml
    — minimal plugin config loading just the rate limiter with
    backend=redis, sourcing redis_url via "{{ env.REDIS_URL }}"
    (Jinja substitution at load time; matches the pattern the
    gateway itself uses for its own Redis URL).
  * Dedicated host port 16380 so the test container is hermetic and
    never collides with a developer's local Redis on 6379 or with
    other smoke-test stacks.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Adds B-0.5 — the next rung past B-0 — which exercises the real
publish/subscribe round-trip that B-0 deliberately skipped.  B-0
sets the mode key in Redis directly and calls factory.reload_tenant
directly; this test calls publish_plugin_mode_change (the function
the admin handler invokes), captures the resulting message off the
real Redis channel, and feeds it through _handle_invalidation_message
(the function the gateway's listener calls).

What B-0.5 pins that B-0 cannot:

  * The publisher's actual wire format on the channel — channel name,
    JSON keys, value types — matches what _handle_invalidation_message
    accepts and routes to invalidate_all_plugin_managers ->
    factory.invalidate_all -> reload_tenant -> manager.shutdown ->
    plugin.shutdown -> wipe.
  * A regression where publish_plugin_mode_change and the handler
    drift on format (e.g. one renames a JSON key without the other)
    would still pass every existing mocked-JSON pubsub test, because
    those tests construct the message bytes themselves.  This test
    PUBLISHes via the real publisher and receives off the real
    channel, so any drift breaks the round-trip.

Wiring fix discovered while writing this test: the framework's
shared-redis access goes through a dependency-inversion shim
(set_shared_redis_provider in mcpgateway/plugins/framework/_redis.py),
not via mcpgateway.utils.redis_client._client.  Documented in the
test's comment so future readers don't go down the wrong path.

Layered toggle-behaviour coverage on this branch (after this commit):
  Layer A    — framework boundary, shutdown side    [unit]
  Layer Ai   — framework boundary, initialize side  [unit]
  Layer B-0  — manager-mediated plugin wipe         [integration]
  Layer B-0.5 — pubsub round-trip wipe              [integration, this commit]
  Layer C    — convergence measurement harness      [integration, existing]

Layer D (convergence threshold under wipe-on-disable) remains gated
on empirical validation against a wipe-enabled gateway image.

CI behaviour and skip-guards: identical to B-0 (auto-skipped by the
pytest_integration_mark plugin unless --with-integration is passed;
skips when cpex-rate-limiter wheel lacks the wipe-on-disable code
path; runs locally against a wipe-enabled wheel).

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
…lti-replica wipe probe via Redis pubsub

Two coupled changes:

1. Align the test fixtures' plugin name with the gateway's own
   plugins/config.yaml convention.  Before this commit, the
   wipe-on-disable yaml fixture and the boundary tests used
   name="RateLimiter" while the gateway's production plugin uses
   name="RateLimiterPlugin", creating two disjoint mode-key
   namespaces.  Touches the yaml fixture and the explicit mode-key
   writes in B-0 / B-0.5.  No behavioural change to either test —
   they were internally consistent before; this just makes them
   match the gateway convention.

2. Add tests/integration/test_rate_limiter_toggle_via_redis_pubsub.py
   — a multi-replica wipe-on-disable convergence probe that bypasses
   HTTP entirely.  Talks only to Redis: SETs the mode key, PUBLISHes
   the invalidation frame, then polls until the counter key
   disappears.  Records elapsed time as the convergence measurement.
   Companion to test_rate_limiter_toggle_live_state.py (Layer C),
   which drives the same toggle via HTTP+MCP and is therefore
   amplified by the STREAMABLEHTTP transport.  This new test
   measures the wipe-on-disable claim itself, not "the claim plus
   measurement overhead."

   Skip behaviour: skips when the gateway stack isn't running or
   Redis isn't reachable.  Does not skip when wipe-on-disable code
   is missing from the running gateway — fails with a clear
   timeout message instead, which is the right semantics for a
   deployment that has wipe enabled.

   Known caveat (not fixed by this commit, surfaced empirically
   while testing): factory.invalidate_all() iterates *cached*
   managers only.  A fresh gateway with no traffic has zero
   cached managers, so the wipe-on-disable path is unreachable
   until a request warms a manager for some context.  Worth
   flagging in the cpex-plugins wipe-on-disable PR description as
   a deployment caveat.  Following commits will add a one-request
   warm-up step to the test so it can run from a clean stack;
   this commit lands the test as written and the rename together.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
@araujof
Copy link
Copy Markdown
Member

araujof commented May 3, 2026

DO NOT MERGE before #3754 is merged.
Touches files under mcpgateway/plugins

…stack

Adds the two pre-condition steps the test needed to actually exercise
the wipe path against a fresh gateway, discovered empirically while
running the test against the wipe-enabled multi-replica stack:

  1. _set_plugin_mode("enforce") via the existing admin-API helper.
     plugins/config.yaml ships with mode=disabled for the rate limiter,
     and the framework loader does not instantiate disabled plugins
     under any cached manager.  No plugin instance ⇒ no shutdown to
     fire ⇒ no wipe path to run, regardless of how many subscribers
     the broadcast reaches.  Toggling to enforce first writes the
     mode key, publishes the invalidation frame, and primes every
     worker's local override map so the next manager-build
     instantiates the plugin in enforce mode.

  2. _send_tool_burst(server_id, tool_name, 20).  factory.invalidate_-
     all() iterates only *cached* managers.  Each gunicorn worker has
     its own factory; without prior traffic on a worker, that
     factory's _managers dict is empty and the broadcast triggers no
     reload_tenant.  20 MCP tools/call requests ride gunicorn's load
     balancer across many of the 72 workers (3 replicas × 24 workers),
     so multiple workers cache a manager — one warm worker firing its
     plugin's shutdown is enough since the wipe runs SCAN+DEL against
     shared Redis, but more warm workers raises the floor against
     transient races.  These are amplified by the STREAMABLEHTTP
     transport but are a one-off pre-condition; the convergence
     measurement loop below them stays HTTP-free.

Empirically confirmed against a multi-replica wipe-enabled stack:
the wipe converges in ~0.11s — vs. the ~22s baseline observed in
this PR's existing live-state probe.  ~170x improvement, consistent
with the hypothesis that wipe-on-disable doesn't speed up the cache
rebuild but makes the rebuild's user-visible latency irrelevant
(stale-mode workers with empty counters cannot block any request).

Two iterations took to land this:
  * 1-warm-up run: didn't fire (only 1 of 72 workers warmed; not
    determinative on its own — see below).
  * 20-warm-ups + enforce-pre-condition run: ✓ converged in ~0.11s.

The 1-warm-up failure was almost certainly because mode=disabled in
plugins/config.yaml prevented the plugin from being loaded under
the warmed manager (path B in the iterative diagnosis), not because
1 warm worker was insufficient.  The enforce pre-condition was the
load-bearing fix; the bump from 1 to 20 warm-ups is defensive
against load-balancing races.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
@gandhipratik203 gandhipratik203 marked this pull request as draft May 3, 2026 12:30
@gandhipratik203
Copy link
Copy Markdown
Collaborator Author

@araujof Noted. Won't merge with mcpgateway/plugins changes. Will be removing the files under mcpgateway/plugins. So the PR will only have the changes for tests/integration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants