fix(retry): retry on socket connection closed unexpectedly (#1881)#1882
Merged
Conversation
Adding .gitkeep for PR creation (default mode). This file will be removed when the task is complete. Issue: #1881
Classify socket/connection-level network disconnects as transient, retryable errors in classifyRetryableError(). Previously the Claude/Codex CLI's synthetic "API Error: The socket connection was closed unexpectedly" message matched no known transient pattern, so a single dropped socket aborted the whole solve session (exit code 1, zero retries), discarding all in-progress work and API spend. Now socket connection was closed unexpectedly, socket hang up, ECONNRESET, connection reset, Connection error, fetch failed and network connection lost are retryable, so the session resumes via --resume on the existing exponential-backoff path. Because classifyRetryableError is the single shared classifier, the fix covers the Claude, Codex and Agent execution loops. Adds tests/test-issue-1881-socket-error-retry.mjs (19 assertions) and a deep case study under docs/case-studies/issue-1881/.
Contributor
Author
Working session summaryEverything is already complete. Nothing to do. Final state of PR #1882 (#1882):
What was delivered:
This summary was automatically extracted from the AI working session output. |
Contributor
Author
🤖 Solution Draft LogThis log file contains the complete execution trace of the AI solution draft process. 💰 Cost estimation:
📊 Context and tokens usage:Claude Opus 4.8: (2 sub-sessions)
Total: (24.8K new + 210.2K cache writes + 4.0M cache reads) input tokens, 30.9K output tokens, $4.197557 cost Claude Opus 4.7:
Total: (7 new + 73.0K cache writes + 72.0K cache reads) input tokens, 1.2K output tokens, $0.521708 cost 🤖 Models used:
📎 Log file uploaded as Gist (1957KB)Now working session is ended, feel free to review and add any feedback on the solution draft. |
Contributor
Author
✅ Ready to mergeThis pull request is now ready to be merged:
Monitored by hive-mind with --auto-restart-until-mergeable flag |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1881.
A
solverun aborted mid-session with:The Claude CLI surfaced this as a synthetic assistant message
(
"model": "<synthetic>") plus aresultevent withis_error: true. hive-minddetected the error but — because the message matched no known transient
pattern — failed the whole session immediately (exit code 1, zero retries),
discarding ~35 turns of work and ~$5.28 of API spend.
Root cause
classifyRetryableError()insrc/tool-retry.lib.mjsrecognisedOverloaded,Request timed out,stream disconnected before completion,503and500asretryable, but not socket-level network disconnects.
The socket connection was closed unexpectedlytherefore classified as non-retryable, so the unifiedtransient-retry path was skipped even though this is a textbook transient network
error and the session was fully resumable via
--resume.The fix
Add a socket/connection branch to
classifyRetryableError(), matchingsocket connection was closed unexpectedly,socket hang up,ECONNRESET,connection reset,network connection lost,Connection errorandfetch failed. These flow through to the existing retry-with---resume,exponential-backoff path (
isCapacity: false, so no spurious model switch).Because
classifyRetryableErroris the single shared classifier used by theClaude (
src/claude.lib.mjs), Codex (src/codex.lib.mjs) and Agent(
src/agent.lib.mjs) execution loops, one change fixes every tool's retry path.A repo-wide search confirmed no other code special-cases socket strings.
How to reproduce / test
tests/test-issue-1881-socket-error-retry.mjs(default suite, 19 assertions):Socket/connection closed unexpectedlylabel,isCapacity: false, and is stillretryable when wrapped in the SDK
{ message }object shape.socket hang up,ECONNRESET,connection reset,Connection error.,fetch failed,network connection lost) are retryable.ENOENT,SyntaxError,Permission denied,context_length_exceeded) stay non-retryable; pre-existingtransient patterns (
Overloaded,Request timed out,stream disconnected before completion,503) still work.Full default suite: 235 test files pass.
Case study
A deep analysis (timeline, evidence, root cause, online/upstream research) lives in
docs/case-studies/issue-1881/, with the raw solve log and error excerpt underdocs/case-studies/issue-1881/data/.Upstream
The socket connection was closed unexpectedlyis a known upstream Claude Code /Anthropic SDK issue, already extensively reported (anthropics/claude-code#48837,
#51107, #54287, #60133, #56711, #49761) — root causes: VPN/firewall/QUIC
interference, HTTP/3 no-fallback, and idle-socket teardown (missing
SO_KEEPALIVE). All are transient, client-side/network failures. No new,non-duplicate upstream issue is warranted; the actionable in-our-control defect is
hive-mind aborting instead of retrying a resumable session, which this PR fixes.