Pin bitsandbytes to continuous-release_main on ROCm (4-bit decode fix)#4954
Pin bitsandbytes to continuous-release_main on ROCm (4-bit decode fix)#4954danielhanchen merged 8 commits intomainfrom
Conversation
bitsandbytes 0.49.2 on PyPI ships with a broken 4-bit GEMV kernel on
every ROCm target:
- CDNA (gfx90a / gfx942 / gfx950 = MI210 / MI300X / MI350) via a
broken blocksize=32/64 warp64 GEMV kernel whose tests were
explicitly skipped with ROCM_WARP_SIZE_64 guards because the
code was known broken.
- RDNA3 / RDNA3.5 (gfx1100-1103 / gfx1150-1152) via a compile-time
BNB_WARP_SIZE macro in the host-side dispatch that resolves to
64 when the multi-arch wheel is compiled with CDNA as the
primary target, so num_blocks is wrong on RDNA and half the GEMV
output is never written.
At decode shape (1, 1, hidden) both bugs produce NaN. Training is
unaffected because training shapes are (batch, seq_len > 1, hidden)
and never touch the GEMV path. The crash during autoregressive
inference surfaces as _assert_async_cuda_kernel in torch.multinomial
which on HIP becomes a hard HSA_STATUS_ERROR_EXCEPTION instead of
a clean Python error.
Both bugs are fixed by bitsandbytes commit 713a3b8 ("[ROCm] Enable
blocksize 32 4-bit quantization and GEMV kernels on AMD CDNA",
PR #1887, merged 2026-03-09) which replaces BNB_WARP_SIZE with a
runtime hipDeviceGetAttribute query and ships a working CDNA warp64
kernel. That commit has not shipped to PyPI yet, but
continuous-release_main wheels are published on every push to bnb
main via GitHub Releases.
Point the ROCm install path at the continuous-release_main x86_64 and
aarch64 wheels and fall back to PyPI >=0.49.1 when the pre-release is
unreachable (offline installs, firewalled hosts, or architectures not
covered by the pre-release wheels). Drop the pin once bnb cuts a
0.50+ tag on PyPI.
Verified on MI300X (gfx942, ROCm 7.2, torch 2.10.0+rocm7.1): direct
bnb GEMV shape test now returns 0.0078 max abs error at seq_len=1
(no NaN) vs NaN on 0.49.2, and full Unsloth + for_inference + 4-bit
sampling generation works end-to-end.
NVIDIA / CPU / Mac / Windows paths are unaffected -- the helper is
gated on the ROCm torch index and platform.machine() respectively.
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to install a pre-release version of bitsandbytes for AMD ROCm environments to address a 4-bit GEMV kernel bug. The changes include new helper functions in both the shell installer and the Python installation stack to attempt downloading specific wheels from GitHub with a fallback to PyPI. The review feedback highlights the fragility of hardcoding version numbers in the wheel URLs and suggests using a safer printing utility for error logs in the Python implementation.
| _bnb_whl_url="https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_x86_64.whl" | ||
| ;; | ||
| aarch64|arm64) | ||
| _bnb_whl_url="https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_aarch64.whl" |
There was a problem hiding this comment.
The wheel URLs contain a hardcoded version number (1.33.7.preview). If the upstream bitsandbytes project updates the version number in their continuous-release_main tag, these URLs will return a 404 error, breaking the installer for ROCm users. Consider if there is a way to resolve the latest asset URL dynamically or ensure the version number remains stable.
| "bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_x86_64.whl" | ||
| ), | ||
| "aarch64": ( | ||
| "https://github.com/bitsandbytes-foundation/bitsandbytes/releases/" | ||
| "download/continuous-release_main/" | ||
| "bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_aarch64.whl" |
There was a problem hiding this comment.
Similar to the shell script, these URLs hardcode the 1.33.7.preview version. This makes the installation process fragile if the upstream continuous release updates its versioning. Since this is a pre-release pin, it might be worth adding a comment about the expected stability of this specific version string.
| def pip_install_try( | ||
| label: str, | ||
| *args: str, | ||
| constrain: bool = True, | ||
| ) -> bool: | ||
| """Try to install with pip/uv. Returns True on success, False on failure | ||
| (without raising or exiting). For optional install attempts with a | ||
| follow-up fallback, such as the bnb ROCm pre-release wheel. | ||
| """ | ||
| constraint_args: list[str] = [] | ||
| if constrain and CONSTRAINTS.is_file(): | ||
| constraint_args = ["-c", str(CONSTRAINTS)] | ||
|
|
||
| if USE_UV: | ||
| cmd = _build_uv_cmd(args) + constraint_args | ||
| else: | ||
| cmd = _build_pip_cmd(args) + constraint_args | ||
|
|
||
| if VERBOSE: | ||
| _step(_LABEL, f"{label}...", _dim) | ||
| result = subprocess.run( | ||
| cmd, | ||
| stdout = subprocess.PIPE, | ||
| stderr = subprocess.STDOUT, | ||
| ) | ||
| if result.returncode == 0: | ||
| return True | ||
| if VERBOSE and result.stdout: | ||
| print(result.stdout.decode(errors = "replace")) | ||
| return False |
There was a problem hiding this comment.
The pip_install_try function is a good addition for handling optional installation steps. However, it uses print directly for error output (line 686). To maintain consistency with the rest of the installer and ensure compatibility with potentially non-UTF-8 consoles on Windows, consider using the _safe_print helper defined earlier in the file.
def pip_install_try(
label: str,
*args: str,
constrain: bool = True,
) -> bool:
"""Try to install with pip/uv. Returns True on success, False on failure
(without raising or exiting). For optional install attempts with a
follow-up fallback, such as the bnb ROCm pre-release wheel.
"""
constraint_args: list[str] = []
if constrain and CONSTRAINTS.is_file():
constraint_args = ["-c", str(CONSTRAINTS)]
if USE_UV:
cmd = _build_uv_cmd(args) + constraint_args
else:
cmd = _build_pip_cmd(args) + constraint_args
if VERBOSE:
_step(_LABEL, f"{label}...", _dim)
result = subprocess.run(
cmd,
stdout = subprocess.PIPE,
stderr = subprocess.STDOUT,
)
if result.returncode == 0:
return True
if VERBOSE and result.stdout:
_safe_print(result.stdout.decode(errors = "replace"))
return FalseThe 16-bit fallback in studio/backend/core/inference/inference.py was added as a workaround for a bug that this PR already fixes at the install layer: bitsandbytes <= 0.49.2 has a broken 4-bit GEMV kernel on every ROCm target, which NaNs at decode shape (seq_len=1) and crashes autoregressive inference. bnb PR #1887 (commit 713a3b8, in 0.50.0.dev0+, pinned by install.sh / install_python_stack.py in this PR) restores correct 4-bit decode on MI300X and verified working end-to-end with full Unsloth + for_inference + sampling. Revert the dual code path so ROCm and NVIDIA both go through the normal FastLanguageModel.from_pretrained + for_inference flow: - Remove the conditional `from unsloth import` that skipped the import on ROCm. The monkey-patches it was trying to avoid were never the cause of the crash; bnb 4-bit GEMV was. - Remove the `if _hw_module.IS_ROCM:` branch in load_model that loaded with plain transformers + PEFT + bfloat16, and the `_resolve_fp16_base` helper it relied on. - Remove the `get_chat_template is not None` fallback in _load_chat_template_info -- get_chat_template is now always imported. - Refactor the audio/vision ROCm guard to check _hw_module.IS_ROCM directly instead of the removed _IS_ROCM_ENV global. Audio and vision on ROCm still need separate validation (FastVisionModel and the CSM audio codecs were never tested on HIP) so the guard stays for now. Add _bnb_rocm_4bit_ok() as a runtime safety net for users who install from this PR before the install.sh bnb pin kicks in, or whose installer fell back to the PyPI pin because the continuous- release wheel was unreachable. When the installed bnb is < 0.50 on ROCm, force load_in_4bit=False and strip any -unsloth-bnb-4bit / -bnb-4bit suffix from the model path so a pre-quantized repo resolves to its FP16 sibling instead of pulling bnb back in via the repo's quantization_config. LoRA adapters whose base is a pre-quantized repo on old bnb will still fail inside Unsloth's loader -- the only real fix there is `unsloth studio update`. Verified on MI300X (gfx942, ROCm 7.2, torch 2.10.0+rocm7.1): - HAPPY path (bnb 0.50.0.dev0, load_in_4bit=True, pre-quantized repo): loads in 4-bit via the fixed GEMV, generation returns "Paris." for greedy and sampling. - SAFETY-NET path (simulated old bnb, suffix-stripped to the FP16 sibling, load_in_4bit=False): loads in bf16, generation returns "Paris." for greedy and sampling. Net diff is ~45 lines smaller than the pre-revert state because the entire plain-transformers 16-bit branch is gone.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aa9fbe6035
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # inside Unsloth's loader -- the only real fix there is | ||
| # `unsloth studio update` to pick up bnb >= 0.50. | ||
| _load_path = config.path | ||
| if not _bnb_rocm_4bit_ok() and load_in_4bit: |
There was a problem hiding this comment.
Apply ROCm fallback even when 4-bit is disabled
The ROCm safety path is gated by and load_in_4bit, so it does not run for requests that already set load_in_4bit=False. That is a problem for pre-quantized *-bnb-4bit model paths, because those configs can still route through bitsandbytes despite load_in_4bit=False; with bnb <0.50 this reintroduces the broken ROCm decode path (NaNs/crashes) instead of the intended 16-bit fallback. This regression is introduced here because the previous ROCm loader path always resolved away pre-quantized suffixes before loading.
Useful? React with 👍 / 👎.
load_model() can be called many times in a single session but the bnb version and hardware state cannot change at runtime, so memoise the check. First call is ~1.9 ms (dominated by the lazy `import bitsandbytes` inside the try block), subsequent calls drop to sub-microsecond dict lookups. Zero behavioral change.
Comment-only cleanup across install.sh, studio/install_python_stack.py, and studio/backend/core/inference/inference.py. No behavioral change.
for more information, see https://pre-commit.ci
Studio's ROCm support is brand new (PR #4720, merged today) and every fresh install pulls the bnb continuous-release_main wheel via install.sh / install_python_stack.py in this same PR. There are no existing ROCm Studio installs carrying bnb < 0.50, so the defensive version-check fallback is guarding against a scenario that cannot actually occur. Delete the helper, the functools import, and the safety-net block -- inference.py now calls FastLanguageModel.from_pretrained directly with no ROCm branching.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 63d296cd58
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| dtype = dtype, | ||
| load_in_4bit = load_in_4bit, | ||
| device_map = device_map, |
There was a problem hiding this comment.
Keep ROCm loads out of broken 4-bit fallback path
The new ROCm installer paths explicitly fall back to bitsandbytes>=0.49.1 when the GitHub pre-release wheel is unreachable (offline/firewalled hosts), and those same code paths note that ROCm 4-bit decode is broken in that fallback. But load_model now always routes ROCm text models through FastLanguageModel.from_pretrained(..., load_in_4bit=load_in_4bit) with the default still True, so these fallback environments will continue to hit the known-broken 4-bit kernels (NaNs/crashes) during generation instead of degrading to the previous 16-bit-safe behavior. A runtime guard is needed here to force a non-4bit path when the fixed bnb build is not present.
Useful? React with 👍 / 👎.
…bnb fix Vision inference was blocked by the same bnb 4-bit GEMV bug that affected text inference (vision models use bnb 4-bit for the LM backbone). With bnb 0.50+ pinned in install.sh / install_python_stack.py, vision works end-to-end on MI300X: Llama-3.2-11B-Vision-Instruct-unsloth-bnb-4bit loaded in 4-bit via FastVisionModel + for_inference returns a correct answer to a multimodal prompt. Audio (CSM) was never actually blocked by HIP — on this hardware CSM loads and runs its backbone forward pass fine with bnb 0.50, then fails during generate() with a transformers-level kwarg validation mismatch in generation_csm.py (`backbone_last_hidden_state` rejected). That's a pre-existing transformers/CSM integration bug that reproduces identically on NVIDIA, so the ROCm-gated guard was never actually protecting users from anything HIP-specific. Remove the combined audio/vision guard and the now-unused _hw_module import. Also restore the one-word "Can be" in an inline comment that drifted during the earlier comment-shortening pass, so the inference.py delta vs pre-#4720 is exactly the max_seq_length<=0 crash fix and nothing else.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d5342be634
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| f"yet supported on AMD ROCm. Use GGUF inference instead." | ||
| ) | ||
|
|
||
| if config.is_audio: |
There was a problem hiding this comment.
Re-add ROCm gate before audio/vision Unsloth loading
This commit removes the ROCm guard that previously blocked config.is_audio/config.is_vision from entering Unsloth inference paths, so ROCm now proceeds into FastVisionModel/FastModel loaders without a ROCm-safe fallback. That is a regression because the prior behavior intentionally failed fast with a clear GGUF recommendation; now ROCm users can hit runtime HIP kernel failures (especially when the installer falls back to PyPI bitsandbytes, which this same commit marks as decode-broken) instead of receiving a deterministic actionable error.
Useful? React with 👍 / 👎.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request enables Unsloth support on AMD ROCm by implementing a mechanism to install a specific bitsandbytes pre-release wheel that fixes 4-bit GEMV issues, with fallbacks to PyPI. It also removes previous ROCm-specific workarounds in the inference backend. Feedback suggests refactoring hardcoded URL strings in both the installation script and the Python stack installer to improve maintainability and reduce duplication.
| case "$_ARCH" in | ||
| x86_64|amd64) | ||
| _bnb_whl_url="https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_x86_64.whl" | ||
| ;; | ||
| aarch64|arm64) | ||
| _bnb_whl_url="https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_aarch64.whl" | ||
| ;; | ||
| *) | ||
| _bnb_whl_url="" | ||
| ;; | ||
| esac |
There was a problem hiding this comment.
To improve maintainability and reduce duplication, you could define a base URL for the wheel and append the architecture-specific part. This would make it easier to update the pinned version in the future.
| case "$_ARCH" in | |
| x86_64|amd64) | |
| _bnb_whl_url="https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_x86_64.whl" | |
| ;; | |
| aarch64|arm64) | |
| _bnb_whl_url="https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_aarch64.whl" | |
| ;; | |
| *) | |
| _bnb_whl_url="" | |
| ;; | |
| esac | |
| _bnb_base_url="https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24" | |
| case "$_ARCH" in | |
| x86_64|amd64) | |
| _bnb_whl_url="${_bnb_base_url}_x86_64.whl" | |
| ;; | |
| aarch64|arm64) | |
| _bnb_whl_url="${_bnb_base_url}_aarch64.whl" | |
| ;; | |
| *) | |
| _bnb_whl_url="" | |
| ;; | |
| esac |
| _BNB_ROCM_PRERELEASE_URLS: dict[str, str] = { | ||
| "x86_64": ( | ||
| "https://github.com/bitsandbytes-foundation/bitsandbytes/releases/" | ||
| "download/continuous-release_main/" | ||
| "bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_x86_64.whl" | ||
| ), | ||
| "aarch64": ( | ||
| "https://github.com/bitsandbytes-foundation/bitsandbytes/releases/" | ||
| "download/continuous-release_main/" | ||
| "bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_aarch64.whl" | ||
| ), | ||
| } |
There was a problem hiding this comment.
To improve maintainability, you could define the base URL and wheel filename as constants. This avoids repeating the long URL string and makes future version updates easier.
_BNB_BASE_URL = (
"https://github.com/bitsandbytes-foundation/bitsandbytes/releases/"
"download/continuous-release_main"
)
_BNB_WHEEL_TEMPLATE = "bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_{arch}.whl"
_BNB_ROCM_PRERELEASE_URLS: dict[str, str] = {
"x86_64": f"{_BNB_BASE_URL}/{_BNB_WHEEL_TEMPLATE.format(arch='x86_64')}",
"aarch64": f"{_BNB_BASE_URL}/{_BNB_WHEEL_TEMPLATE.format(arch='aarch64')}",
}
Summary
bitsandbyteson ROCm hosts to thecontinuous-release_mainwheel from the upstream bnb GitHub release, which contains the CDNA/RDNA 4-bit GEMV fix in bnb PR #1887 (merged 2026-03-09, post-0.49.2).bitsandbytes>=0.49.1when the pre-release URL is unreachable (offline installs, firewalled hosts, or architectures not covered by the pre-release wheels).install.shandstudio/install_python_stack.py; gated on the ROCm torch index +platform.machine(), so NVIDIA / CPU / Mac / Windows paths are untouched.Why
bitsandbytes0.49.2 on PyPI ships with a broken 4-bit GEMV kernel on every ROCm target:gfx90a/gfx942/gfx950= MI210 / MI300X / MI350): broken blocksize=32/64 warp64 GEMV kernel. The corresponding tests were explicitly skipped withROCM_WARP_SIZE_64guards in 0.49.2 because the code was known broken.gfx1100-gfx1103/gfx1150-gfx1152): compile-timeBNB_WARP_SIZEmacro in host-side dispatch resolves to 64 when the multi-arch wheel is compiled with CDNA as the primary target, sonum_blocksis wrong on RDNA and half the GEMV output is never written.At decode shape
(batch=1, seq_len=1, hidden)both bugs produce NaN. Training is unaffected because training shapes are(batch, seq_len > 1, hidden)and never touch the GEMV path -- it's a GEMM at training shapes and works correctly.The crash during autoregressive inference surfaces as
_assert_async_cuda_kernelinsidetorch.multinomial, which on HIP becomes a hardHSA_STATUS_ERROR_EXCEPTIONrather than a clean Python error. Greedy decode silently returns garbage (first token OK, subsequent tokens collapse toargmax(NaN) = 0=!). Either way, inference is broken.Both bugs are fixed by bnb commit
713a3b8(PR #1887), which replaces the compile-time macro with a cachedhipDeviceGetAttribute(hipDeviceAttributeWarpSize)runtime query and ships a working CDNA warp64 GEMV kernel. That commit has not shipped to PyPI yet; the continuous-release_main wheels are published on every push to bnb main via GitHub Releases.Verification
On an MI300X VF (
gfx942, ROCm 7.2, torch 2.10.0+rocm7.1):Direct bnb 4-bit
Linear4bitshape test vs dequantized referenceEnd-to-end Unsloth + 4-bit +
for_inference+ sampling(Was previously crashing with
hipErrorLaunchFailureon 0.49.2.)Platform safety
TORCH_INDEX_URLmatching*/rocm*(bash) androcm_torch_ready(Python). Never executes on NVIDIA installs._bnb_whl_urlis empty, falls through directly to PyPI fallback with no attempted pre-release download.Test plan
_bnb_rocm_prerelease_url()across x86_64, amd64, aarch64, arm64, riscv64 (uppercase alias handled)