Skip to content

Add experimental install guide for ROCm#1550

Open
xzuyn wants to merge 4 commits intoaxolotl-ai-cloud:mainfrom
xzuyn:rocm_readme
Open

Add experimental install guide for ROCm#1550
xzuyn wants to merge 4 commits intoaxolotl-ai-cloud:mainfrom
xzuyn:rocm_readme

Conversation

@xzuyn
Copy link
Copy Markdown
Contributor

@xzuyn xzuyn commented Apr 19, 2024

Description

This adds a guide on how to install Axolotl for ROCm users.

Currently you need to install the packages included in pip install -e '.[deepspeed]' then uninstall torch, xformers, and bitsandbytes, so you can then install the ROCm versions of torch and bitsandbytes. The process is a definitely janky, since you install stuff you don't want just to uninstall it afterwards.

Installing the ROCm version of torch first to try to skip a step results in Axolotl failing to install, so the order this is in is necessary without changes to the readme.txt or setup.py.

Improvements could be made to this setup by preventing torch, bitsandbytes, and xformers from being installed by modifying setup.py to include [amd] and [nvidia] options. That way we would skip the install-then-uninstall step done before installing the packages required.

Motivation and Context

I still see people on places like Reddit asking if it's possible yet to train AI stuff using AMD hardware. I want more people to know it's possible, although still experimental.

How has this been tested?

Using my personal system; Ubuntu 22.04.4, using ROCm 6.1 with an RX 7900 XTX. I've been using Axolotl (and other PyTorch based AI tools like kohya_ss) this way for months on various version of PyTorch and ROCm without major issues.

The only time I've had issues was when ROCm 6.0.2 released and caused training to only output 0 loss after I upgraded. This might've just been an issue with how I upgraded.

Screenshots (if appropriate)

Types of changes

This only adds additions to the README.md file.

Social Handles (Optional)

@winglian
Copy link
Copy Markdown
Collaborator

thanks for this @xzuyn would it be helpful if I handled this in the docker images for you? Do you use the docker images?

@ehartford does this line up with the AMD work that you've been doing?

@ehartford
Copy link
Copy Markdown
Contributor

I'm happy to test it

@xzuyn
Copy link
Copy Markdown
Contributor Author

xzuyn commented Apr 19, 2024

would it be helpful if I handled this in the docker images for you? Do you use the docker images?

The only time I use docker is with runpod, but then I'm using an NVIDIA GPU. Even though the setup is a little janky, the ROCm setup is fairly straightforward for me to do in a venv.

`main` branch is `0.41.3.post1`, so using `rocm` branch brings us to `0.42.0`
@xzuyn
Copy link
Copy Markdown
Contributor Author

xzuyn commented Apr 20, 2024

I updated the readme to use the rocm branch instead of the main branch as that seems to be newer.

Although arlo-phoenix now recommends using the official ROCm rocm_enabled branch of bitsandbytes instead of his fork. It's more up to date (26 commits behind & 0.44.0.dev0 vs. 140 commits behind & 0.42.0) and initially looks to install without issue, but when running Axolotl I get an error.

Could not find the bitsandbytes CUDA binary at PosixPath('/media/xzuyn/NVMe/LClones/axolotl/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_hip_nohipblaslt.so')
Could not load bitsandbytes native library: /media/xzuyn/NVMe/LClones/axolotl/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so: cannot open shared object file: No such file or directory
Traceback (most recent call last):
  File "/media/xzuyn/NVMe/LClones/axolotl/venv/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 122, in <module>
    lib = get_native_library()
  File "/media/xzuyn/NVMe/LClones/axolotl/venv/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 104, in get_native_library
    dll = ct.cdll.LoadLibrary(str(binary_path))
  File "/usr/lib/python3.10/ctypes/__init__.py", line 452, in LoadLibrary
    return self._dlltype(name)
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /media/xzuyn/NVMe/LClones/axolotl/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so: cannot open shared object file: No such file or directory

So for now this is the latest I can confirm works for me.

@ehartford
Copy link
Copy Markdown
Contributor

I go the other way around
I modify the requirements.txt so it doesn't install torch, bitsandbytes, xformers, flash attention, triton, deepspeed, etc.
then I install those manually myself.

@lizamd
Copy link
Copy Markdown

lizamd commented May 2, 2024

hi @xzuyn Thanks for the effort, I have been trying this on rocm and no luck yet. can we connect internally AMD? is it ok to put contact information on your profile? my email is: liz.li@amd.com

Without this you get `NameError: name 'amdsmi' is not defined`
@winglian
Copy link
Copy Markdown
Collaborator

winglian commented Dec 5, 2024

looks like it should work out of the box according to our docs https://github.com/axolotl-ai-cloud/axolotl/blob/main/docs/amd_hpc.qmd

@hsmallbone
Copy link
Copy Markdown

hsmallbone commented Jan 9, 2025

I have gotten axolotl working with DeepSpeed on ROCm (MI250X). Can assist with the install guide (the guide in production currently doesn't mention the bitsandbytes requirement and I believe flash-attn can now be directly pip installed). ROCm 6.1+ is essentially mandatory for most packages.

https://huggingface.co/docs/bitsandbytes/main/en/installation?platform=ROCm#installation

pip install --no-deps --force-reinstall 'https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_multi-backend-refactor/bitsandbytes-0.44.1.dev0-py3-none-manylinux_2_24_x86_64.whl'

@bursteratom
Copy link
Copy Markdown
Contributor

@hsmallbone you mentioned that flash attention with ROCm support can now be installed with pip, wondering if you can point me to any blog post that confirms that? I was only able to find the official AMD blog post which says you still have to install from source

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants