[ROCm] Windows workflow for creating wheels with ROCm 7.2.1 support#1915
Conversation
|
Note on GPU architecture list: Limited the Windows build targets to Why not TheRock preview releases: I decided not to include TheRock 7.11/7.12 preview releases from repo.amd.com/rocm/whl/ because the artifacts are split by architecture family (e.g. @matthewdouglas Are there any potential CI issues in adding this Windows support for ROCm that I'm not aware of? |
dcf3e92 to
9c1dff7
Compare
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Looks good, thanks! A couple quick questions:
|
I opted only for 7.2.1 because of the way ROCm wheels are tagged at https://repo.radeon.com/rocm/windows/ (this version names them with For other versions I'm not sure, we can probably build using HIP SDK but that would be a different approach completely. This seemed easier and I think the wheels are more recent (the latest version of HIP SDK is 7.1 afaik). I can look into this and address it in a follow up if needed.
Added |
|
Sticking to 7.2.1+ works for me, thanks! |
|
This is great! One thing I want to note is that bitsandbytes also works on GPUs other than those listed in: bnb_rocm_arch="gfx1100;gfx1101;gfx1102;gfx1150;gfx1151;gfx1200;gfx1201"It has been validated via my fork that it works on all ROCm (TheRock)–supported RDNA2, RDNA3, RDNA3.5, and RDNA4 GPUs on Windows. |
|
I have no doubt that all of TheRock supported architectures work. The main issue is that TheRock artifacts are currently split by architecture family which complicates things for generating a single wheel that supports all of the architectures. Once TheRock unifies them we can address this in a follow up. That's the main reason why I decided to stick with 7.2.1 and the architectures that are built by its PyTorch wheels. Are you saying that building all of the architectures works when only having for example gfx1151 specific wheels installed? I haven't explicitly tested this. |
I’ve built a single upstream wheel with TheRock, covering all RDNA architectures, and we’ve tested it on certain GPUs across RDNA2 through RDNA4; it works as expected. |
|
Let me do some testing and if I don't run into any issues, I'm open to expanding the Windows support matrix to cover the TheRock 7.11 and 7.12 releases from https://repo.amd.com/rocm/whl/. It would be up to @matthewdouglas if he agrees with adding this coverage. There are 2 considerations to have in mind:
|
|
Yeah, based on my experience, installing TheRock ROCm (Windows) for one architecture allows bitsandbytes to be built for all others. |
Add Windows ROCm build support and documentation
Summary
CI Build
.github/scripts/build-rocm.sh: Add Windows branch that installs ROCm SDK wheels fromrepo.radeon.com, expands the devel tarball viarocm-sdk init, and builds with Ninja + Clang from the SDK. Targets RDNA 3/3.5/4 consumer GPU architectures..github/workflows/python-package.yml: Addwindows-2025+ ROCm 7.2.1 to thebuild-rocmmatrix viainclude(gated to 7.2.1 only). Add conditional MSVC setup step and Linux-only disk cleanup guard.CMakeLists.txt fixes
file(TO_CMAKE_PATH ...)when reading$ENV{ROCM_PATH}to convert Windows backslashes to forward slashes, preventing CMake escape errors in generated HIP compiler files.HIP_PLATFORM=amdexplicitly because the pip SDK'ship-config.cmakecannot auto-detect it.RUNTIME_OUTPUT_DIRECTORY/LIBRARY_OUTPUT_DIRECTORYto allWIN32builds (not just MSVC), so Clang/Ninja HIP builds place the DLL inbitsandbytes/.Documentation
docs/source/installation.mdx: Add Windows row to ROCm wheel table, add Windows compile-from-source tab with pip SDK workflow, update PyTorch ROCm links to cover Windows, add warning about GPU arch coverage in pre-built wheels.README.md: Add AMD GPU row to Windows platform support table.docs/source/errors.mdx: Add generic CUDA/ROCm version mismatch troubleshooting section, ROCm GPU arch mismatch section, and general diagnostics guidance.Minor fixes
bitsandbytes/diagnostics/cuda.py: Addamdhip64*.dllWindows pattern to HIP runtime library search.bitsandbytes/cuda_specs.py: Markget_rocm_warpsize()as dead code (no callers in the codebase).tests/test_linear4bit.py: Guardtorch.distributed.is_nccl_available()withgetattrfallback so the FSDP test is cleanly skipped instead of crashing during collection on torch builds without distributed backends (e.g. Windows ROCm).Testing
Verified locally on Windows 11 with AMD Radeon 780M (gfx1103):
libbitsandbytes_rocm72.dllLinear8bitLtpass end-to-endpython -m bitsandbytessanity check fails with a segfault when using PyTorch wheels from repo.radeon.com/rocm/windows/rocm-rel-7.2.1/ due to missing gfx1103 support in that build, but works on other architecturesThe 28 failures are pre-existing fp32 numerical precision threshold tests -- not regressions from this PR.