Skip to content

[ROCm] Windows workflow for creating wheels with ROCm 7.2.1 support#1915

Merged
matthewdouglas merged 2 commits intobitsandbytes-foundation:mainfrom
sstamenk:rocm_windows_workflow
Apr 8, 2026
Merged

[ROCm] Windows workflow for creating wheels with ROCm 7.2.1 support#1915
matthewdouglas merged 2 commits intobitsandbytes-foundation:mainfrom
sstamenk:rocm_windows_workflow

Conversation

@sstamenk
Copy link
Copy Markdown
Contributor

@sstamenk sstamenk commented Apr 6, 2026

Add Windows ROCm build support and documentation

Summary

  • Add CI workflow support for building bitsandbytes with ROCm on Windows, using pip-installable ROCm SDK wheels
  • Update documentation to reflect Windows ROCm support across installation guide, README, and troubleshooting

CI Build

  • .github/scripts/build-rocm.sh: Add Windows branch that installs ROCm SDK wheels from repo.radeon.com, expands the devel tarball via rocm-sdk init, and builds with Ninja + Clang from the SDK. Targets RDNA 3/3.5/4 consumer GPU architectures.
  • .github/workflows/python-package.yml: Add windows-2025 + ROCm 7.2.1 to the build-rocm matrix via include (gated to 7.2.1 only). Add conditional MSVC setup step and Linux-only disk cleanup guard.

CMakeLists.txt fixes

  • Path normalization: Use file(TO_CMAKE_PATH ...) when reading $ENV{ROCM_PATH} to convert Windows backslashes to forward slashes, preventing CMake escape errors in generated HIP compiler files.
  • HIP_PLATFORM detection: Set HIP_PLATFORM=amd explicitly because the pip SDK's hip-config.cmake cannot auto-detect it.
  • DLL output directory: Extend RUNTIME_OUTPUT_DIRECTORY / LIBRARY_OUTPUT_DIRECTORY to all WIN32 builds (not just MSVC), so Clang/Ninja HIP builds place the DLL in bitsandbytes/.

Documentation

  • docs/source/installation.mdx: Add Windows row to ROCm wheel table, add Windows compile-from-source tab with pip SDK workflow, update PyTorch ROCm links to cover Windows, add warning about GPU arch coverage in pre-built wheels.
  • README.md: Add AMD GPU row to Windows platform support table.
  • docs/source/errors.mdx: Add generic CUDA/ROCm version mismatch troubleshooting section, ROCm GPU arch mismatch section, and general diagnostics guidance.

Minor fixes

  • bitsandbytes/diagnostics/cuda.py: Add amdhip64*.dll Windows pattern to HIP runtime library search.
  • bitsandbytes/cuda_specs.py: Mark get_rocm_warpsize() as dead code (no callers in the codebase).
  • tests/test_linear4bit.py: Guard torch.distributed.is_nccl_available() with getattr fallback so the FSDP test is cleanly skipped instead of crashing during collection on torch builds without distributed backends (e.g. Windows ROCm).

Testing

Verified locally on Windows 11 with AMD Radeon 780M (gfx1103):

  • CMake configures successfully with pip-installed ROCm SDK 7.2.1
  • HIP compilation produces libbitsandbytes_rocm72.dll
  • GPU quantization and Linear8bitLt pass end-to-end
  • python -m bitsandbytes sanity check fails with a segfault when using PyTorch wheels from repo.radeon.com/rocm/windows/rocm-rel-7.2.1/ due to missing gfx1103 support in that build, but works on other architectures
  • Full test suite run with TheRock 7.12 release from repo.amd.com/rocm/whl/gfx110X-all/ (which includes gfx1103 PyTorch support):
28 failed, 2628 passed, 160 skipped, 29 deselected, 30 xfailed, 110 warnings in 2100.71s (0:35:00)

The 28 failures are pre-existing fp32 numerical precision threshold tests -- not regressions from this PR.

@sstamenk
Copy link
Copy Markdown
Contributor Author

sstamenk commented Apr 6, 2026

Note on GPU architecture list: Limited the Windows build targets to gfx1100;gfx1101;gfx1102;gfx1150;gfx1151;gfx1200;gfx1201 because those are the only architectures supported by the PyTorch wheels at repo.radeon.com/rocm/windows/rocm-rel-7.2.1/. Other architectures should work in theory -- I've confirmed this on gfx1103 (Radeon 780M) where bitsandbytes successfully compiles, loads, and runs quantization functions end-to-end. I can add other architectures to the list if we want to go that way.

Why not TheRock preview releases: I decided not to include TheRock 7.11/7.12 preview releases from repo.amd.com/rocm/whl/ because the artifacts are split by architecture family (e.g. gfx110X-all/, gfx120X-all/), which doesn't map cleanly to a single install URL in the build script. There are plans upstream to move to unified builds, so we can update this when that lands.

@matthewdouglas Are there any potential CI issues in adding this Windows support for ROCm that I'm not aware of?

@sstamenk sstamenk force-pushed the rocm_windows_workflow branch from dcf3e92 to 9c1dff7 Compare April 6, 2026 16:12
@matthewdouglas matthewdouglas added this to the v0.50.0 milestone Apr 6, 2026
@matthewdouglas matthewdouglas added Documentation Improvements or additions to documentation Windows ROCm Build CI/CD labels Apr 6, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 6, 2026

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@matthewdouglas
Copy link
Copy Markdown
Member

Looks good, thanks! A couple quick questions:

  • Is ROCm 7.2 the only version we want to build/support for Windows so far, or should there be any earlier ones?
  • On the Linux builds we set it to MinSizeRel and used --offload-compress to minimize size. Is this not supported for Windows?

@sstamenk
Copy link
Copy Markdown
Contributor Author

sstamenk commented Apr 7, 2026

Is ROCm 7.2 the only version we want to build/support for Windows so far, or should there be any earlier ones?

I opted only for 7.2.1 because of the way ROCm wheels are tagged at https://repo.radeon.com/rocm/windows/ (this version names them with rocm-<rocm_version>.tar.gz while 7.1.1 uses rocm-0.1-dev0.tar.gz). In theory we can also do 7.1.1 I would just have to update the logic to parse this correctly.

For other versions I'm not sure, we can probably build using HIP SDK but that would be a different approach completely. This seemed easier and I think the wheels are more recent (the latest version of HIP SDK is 7.1 afaik). I can look into this and address it in a follow up if needed.

On the Linux builds we set it to MinSizeRel and used --offload-compress to minimize size. Is this not supported for Windows?

Added

@matthewdouglas
Copy link
Copy Markdown
Member

Sticking to 7.2.1+ works for me, thanks!

@matthewdouglas matthewdouglas merged commit 250cdb3 into bitsandbytes-foundation:main Apr 8, 2026
92 checks passed
@0xDELUXA
Copy link
Copy Markdown

This is great!

One thing I want to note is that bitsandbytes also works on GPUs other than those listed in:

bnb_rocm_arch="gfx1100;gfx1101;gfx1102;gfx1150;gfx1151;gfx1200;gfx1201"

It has been validated via my fork that it works on all ROCm (TheRock)–supported RDNA2, RDNA3, RDNA3.5, and RDNA4 GPUs on Windows.

@sstamenk
Copy link
Copy Markdown
Contributor Author

sstamenk commented Apr 14, 2026

I have no doubt that all of TheRock supported architectures work. The main issue is that TheRock artifacts are currently split by architecture family which complicates things for generating a single wheel that supports all of the architectures. Once TheRock unifies them we can address this in a follow up. That's the main reason why I decided to stick with 7.2.1 and the architectures that are built by its PyTorch wheels.

Are you saying that building all of the architectures works when only having for example gfx1151 specific wheels installed? I haven't explicitly tested this.

@0xDELUXA
Copy link
Copy Markdown

0xDELUXA commented Apr 14, 2026

Are you saying that building all of the architectures works when only having for example gfx1151 specific wheels installed? I haven't explicitly tested this.

I’ve built a single upstream wheel with TheRock, covering all RDNA architectures, and we’ve tested it on certain GPUs across RDNA2 through RDNA4; it works as expected.

@sstamenk
Copy link
Copy Markdown
Contributor Author

Let me do some testing and if I don't run into any issues, I'm open to expanding the Windows support matrix to cover the TheRock 7.11 and 7.12 releases from https://repo.amd.com/rocm/whl/. It would be up to @matthewdouglas if he agrees with adding this coverage.

There are 2 considerations to have in mind:

@0xDELUXA
Copy link
Copy Markdown

0xDELUXA commented Apr 15, 2026

Yeah, based on my experience, installing TheRock ROCm (Windows) for one architecture allows bitsandbytes to be built for all others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Build CI/CD Documentation Improvements or additions to documentation ROCm Windows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants