Skip to content

fix(retry): retry on socket connection closed unexpectedly (#1881)#1882

Merged
konard merged 2 commits into
mainfrom
issue-1881-bf6cf2d08eda
Jun 10, 2026
Merged

fix(retry): retry on socket connection closed unexpectedly (#1881)#1882
konard merged 2 commits into
mainfrom
issue-1881-bf6cf2d08eda

Conversation

@konard

@konard konard commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #1881.

A solve run aborted mid-session with:

API Error: The socket connection was closed unexpectedly. For more information,
pass `verbose: true` in the second argument to fetch()

The Claude CLI surfaced this as a synthetic assistant message
("model": "<synthetic>") plus a result event with is_error: true. hive-mind
detected the error but — because the message matched no known transient
pattern
— failed the whole session immediately (exit code 1, zero retries),
discarding ~35 turns of work and ~$5.28 of API spend.

Root cause

classifyRetryableError() in src/tool-retry.lib.mjs recognised Overloaded,
Request timed out, stream disconnected before completion, 503 and 500 as
retryable, but not socket-level network disconnects. The socket connection was closed unexpectedly therefore classified as non-retryable, so the unified
transient-retry path was skipped even though this is a textbook transient network
error and the session was fully resumable via --resume.

The fix

Add a socket/connection branch to classifyRetryableError(), matching
socket connection was closed unexpectedly, socket hang up, ECONNRESET,
connection reset, network connection lost, Connection error and
fetch failed. These flow through to the existing retry-with---resume,
exponential-backoff path (isCapacity: false, so no spurious model switch).

Because classifyRetryableError is the single shared classifier used by the
Claude (src/claude.lib.mjs), Codex (src/codex.lib.mjs) and Agent
(src/agent.lib.mjs) execution loops, one change fixes every tool's retry path.
A repo-wide search confirmed no other code special-cases socket strings.

How to reproduce / test

tests/test-issue-1881-socket-error-retry.mjs (default suite, 19 assertions):

  • The exact issue-CLAUDE execution failed with API Error: The socket connection was closed unexpectedly. #1881 message classifies retryable, gets the
    Socket/connection closed unexpectedly label, isCapacity: false, and is still
    retryable when wrapped in the SDK { message } object shape.
  • Related signatures (socket hang up, ECONNRESET, connection reset,
    Connection error., fetch failed, network connection lost) are retryable.
  • Regression guards: non-transient errors (ENOENT, SyntaxError,
    Permission denied, context_length_exceeded) stay non-retryable; pre-existing
    transient patterns (Overloaded, Request timed out,
    stream disconnected before completion, 503) still work.
node tests/test-issue-1881-socket-error-retry.mjs   # 19 passed, 0 failed

Full default suite: 235 test files pass.

Case study

A deep analysis (timeline, evidence, root cause, online/upstream research) lives in
docs/case-studies/issue-1881/, with the raw solve log and error excerpt under
docs/case-studies/issue-1881/data/.

Upstream

The socket connection was closed unexpectedly is a known upstream Claude Code /
Anthropic SDK issue, already extensively reported (anthropics/claude-code#48837,
#51107, #54287, #60133, #56711, #49761) — root causes: VPN/firewall/QUIC
interference, HTTP/3 no-fallback, and idle-socket teardown (missing
SO_KEEPALIVE). All are transient, client-side/network failures. No new,
non-duplicate upstream issue is warranted; the actionable in-our-control defect is
hive-mind aborting instead of retrying a resumable session, which this PR fixes.

Adding .gitkeep for PR creation (default mode).
This file will be removed when the task is complete.

Issue: #1881
@konard konard self-assigned this Jun 10, 2026
Classify socket/connection-level network disconnects as transient,
retryable errors in classifyRetryableError(). Previously the Claude/Codex
CLI's synthetic "API Error: The socket connection was closed unexpectedly"
message matched no known transient pattern, so a single dropped socket
aborted the whole solve session (exit code 1, zero retries), discarding all
in-progress work and API spend.

Now socket connection was closed unexpectedly, socket hang up, ECONNRESET,
connection reset, Connection error, fetch failed and network connection lost
are retryable, so the session resumes via --resume on the existing
exponential-backoff path. Because classifyRetryableError is the single shared
classifier, the fix covers the Claude, Codex and Agent execution loops.

Adds tests/test-issue-1881-socket-error-retry.mjs (19 assertions) and a deep
case study under docs/case-studies/issue-1881/.
@konard konard changed the title [WIP] CLAUDE execution failed with API Error: The socket connection was closed unexpectedly. fix(retry): retry on socket connection closed unexpectedly (#1881) Jun 10, 2026
@konard konard marked this pull request as ready for review June 10, 2026 01:02
@konard

konard commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Working session summary

Everything is already complete. Nothing to do.

Final state of PR #1882 (#1882):

  • Branch: issue-1881-bf6cf2d08eda ✓ pushed, clean working tree
  • HEAD: e921b34b fix(retry): retry on "socket connection was closed unexpectedly" (#1881)
  • Title: fix(retry): retry on socket connection closed unexpectedly (#1881)
  • Status: OPEN, not draft, MERGEABLE
  • All 10 non-skipped CI checks: SUCCESS (lint, test-compilation, test-execution, test-suites, memory-check-linux, validate-docs, check-file-line-limits, Check for Changesets, Check for Manual Version Changes, detect-changes)

What was delivered:

  • src/tool-retry.lib.mjs — added socket/connection branch to classifyRetryableError() (matches socket connection was closed unexpectedly, socket hang up, ECONNRESET, connection reset, network connection lost, Connection error, fetch failed); isCapacity: false so no spurious model switch. Because this is the single shared classifier, the fix covers Claude, Codex and Agent paths at once.
  • tests/test-issue-1881-socket-error-retry.mjs — 19 assertions covering the exact issue message, related signatures, non-transient regression guards, and pre-existing transient patterns. Part of the default suite (235 files pass).
  • docs/case-studies/issue-1881/ — deep case study with timeline, root cause, online/upstream research, and raw evidence under data/.
  • .changeset/issue-1881-socket-error-retry.md — patch bump.
  • .prettierignore — case-study data folder added (matching repo convention).

This summary was automatically extracted from the AI working session output.

@konard

konard commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

🤖 Solution Draft Log

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost estimation:

  • Public pricing estimate: $4.719265
  • Calculated by Anthropic: $0.521708
  • Difference: $-4.197557 (-88.95%)

📊 Context and tokens usage:

Claude Opus 4.8: (2 sub-sessions)

  1. 109.7K / 1M (11%) input tokens, 23.3K / 128K (18%) output tokens
  2. 73.0K / 1M (7%) input tokens, 8.8K / 128K (7%) output tokens

Total: (24.8K new + 210.2K cache writes + 4.0M cache reads) input tokens, 30.9K output tokens, $4.197557 cost

Claude Opus 4.7:

  • 73.0K / 1M (7%) input tokens, 1.2K / 128K (1%) output tokens

Total: (7 new + 73.0K cache writes + 72.0K cache reads) input tokens, 1.2K output tokens, $0.521708 cost

🤖 Models used:

  • Tool: Anthropic Claude Code
  • Requested: opus
  • Main model: Claude Opus 4.8 (claude-opus-4-8)
  • Additional models:
    • Claude Opus 4.7 (claude-opus-4-7)

📎 Log file uploaded as Gist (1957KB)


Now working session is ended, feel free to review and add any feedback on the solution draft.

@konard

konard commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

✅ Ready to merge

This pull request is now ready to be merged:

  • All CI checks have passed
  • No merge conflicts
  • No pending changes

Monitored by hive-mind with --auto-restart-until-mergeable flag

@konard konard merged commit 4b9c2b7 into main Jun 10, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CLAUDE execution failed with API Error: The socket connection was closed unexpectedly.

1 participant