feat: add retry-with-exponential-backoff plugin#3774
Conversation
2ab60e5 to
8b4910e
Compare
|
@madhu-mohan-jaishankar PR #3774 Review Findings SummaryOverviewPR: #3774 — Test, harden and document retry with exponential backoff plugin Findings by Category🔴 Critical Issues (Must Fix Before Merge)1. Resource Fetch Retry Not Implemented
2. No Timeout Handling During Retry Sleep
🟡 Medium Severity Issues3. Potential Retry Count Off-by-One
🟢 Low Severity Issues4. Missing Hook in Plugin Manifest
5. Inconsistent Plugin Descriptions
6. Missing Config Field in Manifest
7. Magic Number in Conversion
📊 Performance Opportunities8. Python Success Path Inefficiency (10-15% gain)
9. Config Merging Overhead (5-8% gain)
10. State Key String Allocation (1-3% gain)
11. JSON Parsing Overhead (20-30% gain when enabled)
🏗️ Refactoring Opportunities12. Extract Retry Orchestration Service
13. Consolidate Plugin Configuration
14. Extract Failure Detection Logic
🔒 Security Findings✅ No Security Issues Found
📋 Implementation Gaps15. No Retry Budget Tracking
16. No Retry Reason Tracking
17. No Retry Metrics
18. No State Cleanup on Restart
19. No Client Disconnect Detection
20. No Retry Policy Validation
21. Missing Usage Examples
22. No Migration Guide
Summary by PriorityMust Fix (Before Merge)
Should Fix (Follow-up PR)
Nice to Have (Future Enhancements)10-22. Performance micro-optimizations, refactoring opportunities, implementation gaps Overall AssessmentQuality: 🟢 HIGH — Well-tested, secure, production-ready |
PR Review: feature/retry-with-exponential-backoffSummaryAdds a new active retry-with-exponential-backoff plugin with optional Rust extension for improved performance on failure detection paths. The plugin detects transient tool failures and signals the gateway to retry with computed delays, replacing a previous advisory-only implementation. Closes #3746. Findings
Fixes Applied
Remaining Issues
RecommendationReady to merge after fixing remaining issues |
|
I would be very convenient to have benchmark comparison script for this plugin in order to check possible improvements in current version or in future versions. |
@msureshkumar88 please find the comments for critical issues Timeout on Retry Count Off-by-One: |
483bedf
909e0aa to
96a0b65
Compare
dima-zakharov
left a comment
There was a problem hiding this comment.
No changes since last review. Approve.
✅ PR #3774 Approval CompleteSummaryCreated comprehensive approval documentation for PR #3774 with future improvements tracked. Files Created
Approval Decision✅ APPROVED - Ready to MergeKey Points:
Status of msureshkumar88's CommentsCritical Issues (RESOLVED)
Low Severity (TRACKED FOR FUTURE)4-7. Documentation inconsistencies → Follow-up PR Future Improvements TrackedHigh Priority (Next Sprint):
Medium Priority (Future Sprints):
Low Priority (Backlog):
RecommendationMERGE NOW with future improvements tracked in follow-up issues. The PR delivers significant value and is production-ready. Remaining items are enhancements, not bugs. |
- Add retry_delay_ms field to PluginResult in models.py - Add recursive retry loop in tool_service.py invoke_tool (retry_attempt param) - Fix manager.py to propagate retry_delay_ms signal across plugin chain - Add RetryWithBackoffPlugin with full-jitter exponential backoff - Add plugin-manifest.yaml and package __init__.py - Add 35 unit tests covering all components Signed-off-by: Madhu Mohan Jaishankar <madhu.mohan.jaishankar@ibm.com>
Signed-off-by: Madhu Mohan Jaishankar <madhu.mohan.jaishankar@ibm.com>
Signed-off-by: Madhu Mohan Jaishankar <madhu.mohan.jaishankar@ibm.com>
Signed-off-by: Madhu Mohan Jaishankar <madhu.mohan.jaishankar@ibm.com>
…f Rust plugin Signed-off-by: Madhu Mohan Jaishankar <madhu.mohan.jaishankar@ibm.com>
Signed-off-by: Madhu Mohan Jaishankar <madhu.mohan.jaishankar@ibm.com>
…l raises Signed-off-by: Madhu Mohan Jaishankar <madhu.mohan.jaishankar@ibm.com>
Signed-off-by: Dmitry Zakharov <zakharov@ibm.com>
…koff Signed-off-by: Madhu Mohan Jaishankar <madhu.mohan.jaishankar@ibm.com>
Signed-off-by: Madhu Mohan Jaishankar <madhu.mohan.jaishankar@ibm.com>
Signed-off-by: Dmitry Zakharov <zakharov@ibm.com>
Signed-off-by: Dmitry Zakharov <zakharov@ibm.com>
…source retry limitation Signed-off-by: Madhu Mohan Jaishankar <madhu.mohan.jaishankar@ibm.com>
…x Rust monotonic clock - Refactor retry logic into _run_timeout_post_invoke and _retry_tool_invocation helpers, eliminating 4 copies of timeout post-invoke and 3 copies of retry invocation code - Add retry support for timeout path via ToolTimeoutError.retry_delay_ms - Add status-code-aware isError handling: non-transient HTTP errors (400, 401, 404) are no longer blindly retried when the gateway can extract the status code from httpx.HTTPStatusError - Switch Rust state TTL from SystemTime to Instant (monotonic clock) to match Python's time.monotonic() and avoid wall-clock jump issues - Add defensive guard in _run_timeout_post_invoke for None plugin_manager - Add state TTL eviction, docstrings, and check_text_content to plugin manifest - Add tests: timeout retry path, timeout no-retry re-raise, _run_timeout_post_invoke hook invocation, HTTP status forwarding to plugin, non-/servers/ MCP path passthrough - Remove unused ToolHookType imports in test_tool_service_coverage.py Signed-off-by: Jonathan Springer <jps@s390x.com>
03bbd99
96a0b65 to
03bbd99
Compare
🔗 Related Issue
Closes #3746
📝 Summary
This PR implements an active retry-with-exponential-backoff plugin (
RetryWithBackoffPlugin) for the MCP ContextForge gateway. Unlike the previous advisory-only stub (which simply annotated metadata without retrying), this is a fully active plugin that:retry_delay_msfield inToolPostInvokeResult.max_tool_retriesceiling, clamping both global and per-tool override configs.retry_with_backoff_rust) while transparently falling back to a pure-Python implementation.Architecture: Separation of Concerns
The design cleanly separates responsibilities between the plugin and the gateway:
_is_failure)_compute_delay_ms)_STATEdict keyed bytool:request_id)tool_service.pyinvoke_tool)🏷️ Type of Change
🧪 Verification
make lintmake testmake coverage✅ Checklist
make black isort pre-commit)⚡ Performance Benchmarks (Rust vs Python)
The Rust extension (
retry_with_backoff_rust) is an optional drop-in accelerator for the hot failure-detection path. If the.sois not installed, the plugin transparently falls back to pure Python with no behavioural difference.Sequential latency (200,000 iterations each)
is_error=True)sc=500)is_error=False)sc=200)Concurrent throughput (50 tools × 40 reqs, 8 threads, 300 reps)
Key callouts:
Mutex-protectedHashMap— no GIL dependency, correct under true parallelism..sois not installed, the plugin transparently falls back to pure Python with no behaviour change.📓 Notes (optional)
Design Decisions
Gateway owns the sleep and loop; plugin owns detection and delay — This avoids blocking the plugin chain and keeps the plugin stateless from the gateway's perspective. The gateway's existing invoke_tool method was extended with a retry_attempt counter parameter and a recursive tail-call for retries.
retry_delay_ms = 0 on success — On a successful tool invocation, the plugin resets the per-invocation state and returns retry_delay_ms=0. The gateway treats 0 as "no retry needed."
State keyed by (tool_name, request_id) — Each independent tool invocation gets fresh retry state. Retries of the same invocation share state because the gateway passes the same GlobalContext (and thus the same request_id) on every retry attempt. State is cleaned up on success or max-retry exhaustion to avoid memory leaks.
check_text_content is off by default — Signal 3 (text content JSON parsing) is an opt-in escape hatch for pre-2025-spec MCP servers. It is disabled by default because it can false-positive on tools that legitimately return dictionaries containing status_code as informational payload (e.g., monitoring or proxy tools).
Rust extension is optional and zero-cost when absent — The try/except ImportError pattern is the standard Python idiom for optional compiled extensions. The import cost is zero after the first module load.