Add experimental install guide for ROCm by xzuyn · Pull Request #1550 · axolotl-ai-cloud/axolotl

xzuyn · 2024-04-19T14:01:10Z

Description

This adds a guide on how to install Axolotl for ROCm users.

Currently you need to install the packages included in pip install -e '.[deepspeed]' then uninstall torch, xformers, and bitsandbytes, so you can then install the ROCm versions of torch and bitsandbytes. The process is a definitely janky, since you install stuff you don't want just to uninstall it afterwards.

Installing the ROCm version of torch first to try to skip a step results in Axolotl failing to install, so the order this is in is necessary without changes to the readme.txt or setup.py.

Improvements could be made to this setup by preventing torch, bitsandbytes, and xformers from being installed by modifying setup.py to include [amd] and [nvidia] options. That way we would skip the install-then-uninstall step done before installing the packages required.

Motivation and Context

I still see people on places like Reddit asking if it's possible yet to train AI stuff using AMD hardware. I want more people to know it's possible, although still experimental.

How has this been tested?

Using my personal system; Ubuntu 22.04.4, using ROCm 6.1 with an RX 7900 XTX. I've been using Axolotl (and other PyTorch based AI tools like kohya_ss) this way for months on various version of PyTorch and ROCm without major issues.

The only time I've had issues was when ROCm 6.0.2 released and caused training to only output 0 loss after I upgraded. This might've just been an issue with how I upgraded.

Screenshots (if appropriate)

Types of changes

This only adds additions to the README.md file.

Social Handles (Optional)

winglian · 2024-04-19T18:32:43Z

thanks for this @xzuyn would it be helpful if I handled this in the docker images for you? Do you use the docker images?

@ehartford does this line up with the AMD work that you've been doing?

ehartford · 2024-04-19T19:06:50Z

I'm happy to test it

xzuyn · 2024-04-19T21:09:35Z

would it be helpful if I handled this in the docker images for you? Do you use the docker images?

The only time I use docker is with runpod, but then I'm using an NVIDIA GPU. Even though the setup is a little janky, the ROCm setup is fairly straightforward for me to do in a venv.

`main` branch is `0.41.3.post1`, so using `rocm` branch brings us to `0.42.0`

xzuyn · 2024-04-20T23:51:24Z

I updated the readme to use the rocm branch instead of the main branch as that seems to be newer.

Although arlo-phoenix now recommends using the official ROCm rocm_enabled branch of bitsandbytes instead of his fork. It's more up to date (26 commits behind & 0.44.0.dev0 vs. 140 commits behind & 0.42.0) and initially looks to install without issue, but when running Axolotl I get an error.

Could not find the bitsandbytes CUDA binary at PosixPath('/media/xzuyn/NVMe/LClones/axolotl/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_hip_nohipblaslt.so')
Could not load bitsandbytes native library: /media/xzuyn/NVMe/LClones/axolotl/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so: cannot open shared object file: No such file or directory
Traceback (most recent call last):
  File "/media/xzuyn/NVMe/LClones/axolotl/venv/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 122, in <module>
    lib = get_native_library()
  File "/media/xzuyn/NVMe/LClones/axolotl/venv/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 104, in get_native_library
    dll = ct.cdll.LoadLibrary(str(binary_path))
  File "/usr/lib/python3.10/ctypes/__init__.py", line 452, in LoadLibrary
    return self._dlltype(name)
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /media/xzuyn/NVMe/LClones/axolotl/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so: cannot open shared object file: No such file or directory

So for now this is the latest I can confirm works for me.

ehartford · 2024-04-29T05:24:54Z

I go the other way around
I modify the requirements.txt so it doesn't install torch, bitsandbytes, xformers, flash attention, triton, deepspeed, etc.
then I install those manually myself.

lizamd · 2024-05-02T17:25:51Z

hi @xzuyn Thanks for the effort, I have been trying this on rocm and no luck yet. can we connect internally AMD? is it ok to put contact information on your profile? my email is: liz.li@amd.com

Without this you get `NameError: name 'amdsmi' is not defined`

winglian · 2024-12-05T21:33:59Z

looks like it should work out of the box according to our docs https://github.com/axolotl-ai-cloud/axolotl/blob/main/docs/amd_hpc.qmd

hsmallbone · 2025-01-09T04:58:15Z

I have gotten axolotl working with DeepSpeed on ROCm (MI250X). Can assist with the install guide (the guide in production currently doesn't mention the bitsandbytes requirement and I believe flash-attn can now be directly pip installed). ROCm 6.1+ is essentially mandatory for most packages.

https://huggingface.co/docs/bitsandbytes/main/en/installation?platform=ROCm#installation

pip install --no-deps --force-reinstall 'https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_multi-backend-refactor/bitsandbytes-0.44.1.dev0-py3-none-manylinux_2_24_x86_64.whl'

bursteratom · 2025-03-05T03:54:35Z

@hsmallbone you mentioned that flash attention with ROCm support can now be installed with pip, wondering if you can point me to any blog post that confirms that? I was only able to find the official AMD blog post which says you still have to install from source

Add experimental install guide for ROCm

9921e4f

Use the newer version of the bitsandbytes fork

8e15925

`main` branch is `0.41.3.post1`, so using `rocm` branch brings us to `0.42.0`

Missed the clone location name

33727dc

Use stable ROCm PyTorch

a9a77a2

Without this you get `NameError: name 'amdsmi' is not defined`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add experimental install guide for ROCm#1550

Add experimental install guide for ROCm#1550
xzuyn wants to merge 4 commits intoaxolotl-ai-cloud:mainfrom
xzuyn:rocm_readme

xzuyn commented Apr 19, 2024

Uh oh!

winglian commented Apr 19, 2024

Uh oh!

ehartford commented Apr 19, 2024

Uh oh!

xzuyn commented Apr 19, 2024

Uh oh!

xzuyn commented Apr 20, 2024

Uh oh!

ehartford commented Apr 29, 2024

Uh oh!

lizamd commented May 2, 2024

Uh oh!

winglian commented Dec 5, 2024

Uh oh!

hsmallbone commented Jan 9, 2025 •

edited

Loading

Uh oh!

bursteratom commented Mar 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

Conversation

xzuyn commented Apr 19, 2024

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Uh oh!

winglian commented Apr 19, 2024

Uh oh!

ehartford commented Apr 19, 2024

Uh oh!

xzuyn commented Apr 19, 2024

Uh oh!

xzuyn commented Apr 20, 2024

Uh oh!

ehartford commented Apr 29, 2024

Uh oh!

lizamd commented May 2, 2024

Uh oh!

winglian commented Dec 5, 2024

Uh oh!

hsmallbone commented Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bursteratom commented Mar 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

hsmallbone commented Jan 9, 2025 •

edited

Loading