Skip to content

Extract IO errors from h2 for streaming retries of Connection Reset#15675

Merged
konstin merged 1 commit into
mainfrom
konsti/h2ohwherearemyerrors
Sep 4, 2025
Merged

Extract IO errors from h2 for streaming retries of Connection Reset#15675
konstin merged 1 commit into
mainfrom
konsti/h2ohwherearemyerrors

Conversation

@konstin

@konstin konstin commented Sep 4, 2025

Copy link
Copy Markdown
Member

Our streaming retries were missing connection reset errors as h2 was shadowing IO errors (hyperium/h2#862).

Test plan

In one terminal:

cargo python uninstall 3.12 && cargo run python install 3.12 -vv

In another:

sudo tcpkill -i wlp2s0 port 443

Output:

error: Failed to install cpython-3.12.11-linux-x86_64-gnu
  Caused by: Request failed after 3 retries
  Caused by: Failed to download https://github.com/astral-sh/python-build-standalone/releases/download/20250902/cpython-3.12.11%2B20250902-x86_64-unknown-linux-gnu-install_only_stripped.tar.gz
  Caused by: error sending request for url (https://github.com/astral-sh/python-build-standalone/releases/download/20250902/cpython-3.12.11%2B20250902-x86_64-unknown-linux-gnu-install_only_stripped.tar.gz)
  Caused by: client error (SendRequest)
  Caused by: connection error
  Caused by: connection reset

I don't know how to test that from inside Rust.

Fix #14171 (again, hopefully)

Our streaming retries were missing connection reset errors as h2 was shadowing IO errors (hyperium/h2#862).

**Test plan**

```
cargo python uninstall 3.12 && cargo run python install 3.12 -vv
```

In another:

```
sudo tcpkill -i wlp2s0 port 443
```

Output:

```
error: Failed to install cpython-3.12.11-linux-x86_64-gnu
  Caused by: Request failed after 3 retries
  Caused by: Failed to download https://github.com/astral-sh/python-build-standalone/releases/download/20250902/cpython-3.12.11%2B20250902-x86_64-unknown-linux-gnu-install_only_stripped.tar.gz
  Caused by: error sending request for url (https://github.com/astral-sh/python-build-standalone/releases/download/20250902/cpython-3.12.11%2B20250902-x86_64-unknown-linux-gnu-install_only_stripped.tar.gz)
  Caused by: client error (SendRequest)
  Caused by: connection error
  Caused by: connection reset
```

I don't know how to test that from inside Rust.

Fix #14171 (again, hopefully)
@konstin konstin requested a review from zanieb September 4, 2025 11:26
@konstin konstin added bug Something isn't working network Network connectivity e.g. proxies, DNS, and SSL labels Sep 4, 2025
@konstin konstin temporarily deployed to uv-test-registries September 4, 2025 11:28 — with GitHub Actions Inactive

@zanieb zanieb left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😭

@konstin konstin merged commit 4a1813f into main Sep 4, 2025
157 checks passed
@konstin konstin deleted the konsti/h2ohwherearemyerrors branch September 4, 2025 12:45
tmeijn pushed a commit to tmeijn/dotfiles that referenced this pull request Sep 12, 2025
This MR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [astral-sh/uv](https://github.com/astral-sh/uv) | patch | `0.8.15` -> `0.8.17` |

MR created with the help of [el-capitano/tools/renovate-bot](https://gitlab.com/el-capitano/tools/renovate-bot).

**Proposed changes to behavior should be submitted there as MRs.**

---

### Release Notes

<details>
<summary>astral-sh/uv (astral-sh/uv)</summary>

### [`v0.8.17`](https://github.com/astral-sh/uv/blob/HEAD/CHANGELOG.md#0817)

[Compare Source](astral-sh/uv@0.8.16...0.8.17)

Released on 2025-09-10.

##### Enhancements

- Improve error message for HTTP validation in auth services ([#&#8203;15768](astral-sh/uv#15768))
- Respect `PYX_API_URL` when suggesting `uv auth login` on 401 ([#&#8203;15774](astral-sh/uv#15774))
- Add pyx as a supported PyTorch index URL ([#&#8203;15769](astral-sh/uv#15769))

##### Bug fixes

- Avoid initiating login flow for invalid API keys ([#&#8203;15773](astral-sh/uv#15773))
- Do not search for a password for requests with a token attached already ([#&#8203;15772](astral-sh/uv#15772))
- Filter pre-release Python versions in `uv init --script` ([#&#8203;15747](astral-sh/uv#15747))

### [`v0.8.16`](https://github.com/astral-sh/uv/blob/HEAD/CHANGELOG.md#0816)

[Compare Source](astral-sh/uv@0.8.15...0.8.16)

##### Enhancements

- Allow `--editable` to override `editable = false` annotations ([#&#8203;15712](astral-sh/uv#15712))
- Allow `editable = false` for workspace sources ([#&#8203;15708](astral-sh/uv#15708))
- Show a dedicated error for virtual environments in source trees on build ([#&#8203;15748](astral-sh/uv#15748))
- Support Android platform tags ([#&#8203;15646](astral-sh/uv#15646))
- Support iOS platform tags ([#&#8203;15640](astral-sh/uv#15640))
- Support scripts with inline metadata in `--with-requirements` and `--requirements` ([#&#8203;12763](astral-sh/uv#12763))

##### Preview features

- Support `--no-project` in `uv format` ([#&#8203;15572](astral-sh/uv#15572))
- Allow `uv format` in unmanaged projects ([#&#8203;15553](astral-sh/uv#15553))

##### Bug fixes

- Avoid erroring when `match-runtime` target is optional ([#&#8203;15671](astral-sh/uv#15671))
- Ban empty usernames and passwords in `uv auth` ([#&#8203;15743](astral-sh/uv#15743))
- Error early for parent path in build backend ([#&#8203;15733](astral-sh/uv#15733))
- Retry on IO errors during HTTP/2 streaming ([#&#8203;15675](astral-sh/uv#15675))
- Support recursive requirements and constraints inclusion ([#&#8203;15657](astral-sh/uv#15657))
- Use token store credentials for `uv publish` ([#&#8203;15759](astral-sh/uv#15759))
- Fix virtual environment activation script compatibility with latest nushell ([#&#8203;15272](astral-sh/uv#15272))
- Skip Python interpreters that cannot be queried with permission errors ([#&#8203;15685](astral-sh/uv#15685))

##### Documentation

- Clarify that `uv auth` commands take a URL ([#&#8203;15664](astral-sh/uv#15664))
- Improve the CLI help for options that accept requirements files ([#&#8203;15706](astral-sh/uv#15706))
- Adds example for caching for managed Python downloads in Docker builds ([#&#8203;15689](astral-sh/uv#15689))

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever MR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this MR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this MR, check this box

---

This MR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0MS45OC4xIiwidXBkYXRlZEluVmVyIjoiNDEuOTkuNiIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOlsiUmVub3ZhdGUgQm90Il19-->
swoboda1337 added a commit to swoboda1337/esphome that referenced this pull request Feb 10, 2026
Upgrade uv from 0.6.14 to 0.10.1 to pick up the fix for HTTP/2
connection reset retry handling (astral-sh/uv#15675, released in
0.8.16). Also set UV_HTTP_RETRIES=10 (default 3) to better handle
transient network errors during PlatformIO penv bootstrap.

Remove the UV_CACHE_DIR override since pioarduino now handles this
upstream (pioarduino/platform-espressif32#386).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
danielhanchen added a commit to unslothai/unsloth that referenced this pull request Jun 16, 2026
…oad failures (#6281)

* Make Studio installer resilient to transient uv download failures

Updating an existing Studio install via install.sh could hard-fail and roll
back when a wheel download (torch, unsloth) hit a transient connection reset:

  x Failed to download unsloth==2026.6.6
  error decoding response body -> error reading a body from connection
  -> connection reset
  restoring previous environment after failed install...

Root cause: that error chain is a mid-stream HTTP/2 body read failure. uv did
not retry this class until 0.8.16 (astral-sh/uv#15675, h2 was shadowing the
underlying IO error), but the installer pinned UV_MIN_VERSION=0.7.22, so a stale
uv got zero retries and a single blip aborted the whole update under set -e.

Fix (installer only, backwards compatible, no change on success):
- Raise UV_MIN_VERSION to 0.8.16 so stale uv is upgraded to a version that
  retries HTTP/2 streaming body errors.
- Export UV_HTTP_RETRIES=5 and UV_HTTP_TIMEOUT=180 (override-preserving :=).
- Add run_install_cmd_retry (retry-with-backoff around run_install_cmd) and use
  it for the network-heavy uv pip install steps (torch, unsloth, unsloth-zoo
  from git, ROCm torch repair, no-torch runtime deps). Local editable overlays
  and venv creation are left to fail fast.

run_install_cmd_retry preserves the final exit code on permanent failure, so the
existing set -e rollback trap still fires.

* Apply the same transient-download resilience to the Windows installer

install.ps1 is the native-Windows installer and had the identical issue as
install.sh: it pinned $UvMinVersion=0.7.22 (below uv 0.8.16, which is where uv
started retrying HTTP/2 streaming body errors), set no UV_HTTP_* defaults, and
ran each 'uv pip install' once via Invoke-InstallCommand, so a single connection
reset aborted the update and triggered the Exit-InstallFailure rollback.

install.ps1:
- Raise $UvMinVersion to 0.8.16.
- Default $env:UV_HTTP_RETRIES=5 and $env:UV_HTTP_TIMEOUT=180 (preserving overrides).
- Add Invoke-InstallCommandRetry and use it for the network-heavy uv pip install
  steps (torch, unsloth, unsloth-zoo from git, ROCm torch, no-torch runtime deps).
  Local editable overlays and venv creation stay single-shot.

install.sh:
- Align UNSLOTH_INSTALL_RETRIES sanitization with the PowerShell version: a
  non-positive-integer value now falls back to the default of 3 instead of
  silently disabling retries (set =1 to disable). Keeps both installers identical.

* Adopt pre-marker Studio llama.cpp and sidecar dirs on update

After the uv retry fix, an update now reaches studio/setup.sh, whose
Studio-owned ownership guard rejects a llama.cpp or sidecar venv created by an
earlier install that predates the .unsloth-studio-owned marker:

  ERROR: .../llama.cpp already exists and is not marked as a Studio-owned
         llama.cpp install.

The marker and UNSLOTH_PREBUILT_INFO.json were introduced in the same commit,
so a directory from before that point carries neither signal and a legitimate
self-update fails for anyone who installed earlier (reported on issue #6274).

Fold a one-time adoption into _assert_studio_owned_or_absent (setup.sh) and
Assert-StudioOwnedOrAbsent (setup.ps1): when a custom-home directory lacks the
marker, backfill it and proceed only when there is positive evidence it belongs
to an established Studio home -- the directory carries UNSLOTH_PREBUILT_INFO.json,
or STUDIO_HOME already holds Studio's CLI shim or studio.conf from a prior run.
Both installers write the shim and studio.conf only after invoking setup, so a
fresh install into a dirty custom home (the case the guard protects) does not
have them yet and is still rejected. The venv marker is excluded because install
writes it before setup and so cannot tell a prior install from a fresh one.

* Review fixes: restrict llama.cpp adoption to dir-local evidence; restore install.sh +x

Addresses the PR review on the marker-migration change.

P1 - the adoption helper keyed on root-level Studio sentinels ($STUDIO_HOME/bin
/unsloth, share/studio.conf), so once a home was recognized every unmarked child
passed to the guard became adoptable, and an unrelated directory at a
Studio-managed path could be silently marked and overwritten. Base adoption on
evidence inside the directory instead:
  - UNSLOTH_PREBUILT_INFO.json, written by the prebuilt llama.cpp installer (the
    default path, in place well before the marker), or
  - a top-level llama-quantize symlink, written by source builds (a plain
    llama.cpp checkout keeps the binary under build/bin, not a root symlink).
A foreign llama.cpp now stays rejected even inside an established Studio home,
and sidecar venvs (no such fingerprint) stay subject to the strict guard; their
marker has been written since the guard was introduced, so a real custom install
already carries it.

P2 - restore the executable bit on install.sh; a stray mode change to 100644
would break ./install.sh --local on Unix.

On Windows the prebuilt metadata is the signal; source builds are git checkouts
indistinguishable from a user clone, so they are left to the strict guard.

* Bound UNSLOTH_INSTALL_RETRIES / _DELAY before numeric use

An oversized all-digit override (e.g. a fat-fingered
"99999999999999999999") passed the digit-only validation and then reached the
numeric comparison: POSIX `[ -ge ]` errored with "Illegal number" mid-loop and
could spin instead of falling back, and PowerShell's `[int]` cast threw an
Int32 overflow under $ErrorActionPreference = "Stop" before any install ran.

Sanitize with a length guard + range check (sh) and [int]::TryParse with bounds
(ps1), so out-of-range or oversized values fall back to the default. Bounds:
1..100 retries, 0..3600s base delay.

* Studio installers: scope llama.cpp adoption to prebuilt metadata; reject leading-zero retry delay

setup.sh: drop the top-level llama-quantize symlink as an ownership-adoption signal, leaving UNSLOTH_PREBUILT_INFO.json as the sole fingerprint. The shared ownership guard runs immediately before a destructive replace / rm -rf, and a bare root llama-quantize symlink is user-creatable (a user can keep their own llama.cpp build with such a convenience symlink at a custom UNSLOTH_STUDIO_HOME), so the old check could adopt and then delete a user directory. This matches the Windows installer, which already keeps markerless source builds strict. Pre-marker prebuilt installs still adopt via the metadata file, so the original update fix is preserved.

install.sh: reject leading-zero values for UNSLOTH_INSTALL_RETRY_DELAY. A value like 08 or 09 passed the range check but then hit the backoff doubling $((_ricr_delay * 2)), where a non-octal leading zero is a fatal arithmetic error mid-retry. The 0?* pattern routes such values to the default; bare 0 stays valid.

* Tighten the comments added in this PR

* Condense the comments in this PR
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working network Network connectivity e.g. proxies, DNS, and SSL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python installation fails during streamed unpack with connection reset error

2 participants