Add experimental install guide for ROCm#1550
Add experimental install guide for ROCm#1550xzuyn wants to merge 4 commits intoaxolotl-ai-cloud:mainfrom
Conversation
|
thanks for this @xzuyn would it be helpful if I handled this in the docker images for you? Do you use the docker images? @ehartford does this line up with the AMD work that you've been doing? |
|
I'm happy to test it |
The only time I use docker is with runpod, but then I'm using an NVIDIA GPU. Even though the setup is a little janky, the ROCm setup is fairly straightforward for me to do in a venv. |
`main` branch is `0.41.3.post1`, so using `rocm` branch brings us to `0.42.0`
|
I updated the readme to use the Although arlo-phoenix now recommends using the official ROCm So for now this is the latest I can confirm works for me. |
|
I go the other way around |
|
hi @xzuyn Thanks for the effort, I have been trying this on rocm and no luck yet. can we connect internally AMD? is it ok to put contact information on your profile? my email is: liz.li@amd.com |
Without this you get `NameError: name 'amdsmi' is not defined`
|
looks like it should work out of the box according to our docs https://github.com/axolotl-ai-cloud/axolotl/blob/main/docs/amd_hpc.qmd |
|
I have gotten axolotl working with DeepSpeed on ROCm (MI250X). Can assist with the install guide (the guide in production currently doesn't mention the bitsandbytes requirement and I believe flash-attn can now be directly pip installed). ROCm 6.1+ is essentially mandatory for most packages. https://huggingface.co/docs/bitsandbytes/main/en/installation?platform=ROCm#installation |
|
@hsmallbone you mentioned that flash attention with ROCm support can now be installed with pip, wondering if you can point me to any blog post that confirms that? I was only able to find the official AMD blog post which says you still have to install from source |
Description
This adds a guide on how to install Axolotl for ROCm users.
Currently you need to install the packages included in
pip install -e '.[deepspeed]'then uninstalltorch,xformers, andbitsandbytes, so you can then install the ROCm versions oftorchandbitsandbytes. The process is a definitely janky, since you install stuff you don't want just to uninstall it afterwards.Installing the ROCm version of
torchfirst to try to skip a step results in Axolotl failing to install, so the order this is in is necessary without changes to thereadme.txtorsetup.py.Improvements could be made to this setup by preventing
torch,bitsandbytes, andxformersfrom being installed by modifyingsetup.pyto include[amd]and[nvidia]options. That way we would skip the install-then-uninstall step done before installing the packages required.Motivation and Context
I still see people on places like Reddit asking if it's possible yet to train AI stuff using AMD hardware. I want more people to know it's possible, although still experimental.
How has this been tested?
Using my personal system; Ubuntu 22.04.4, using ROCm 6.1 with an RX 7900 XTX. I've been using Axolotl (and other PyTorch based AI tools like kohya_ss) this way for months on various version of PyTorch and ROCm without major issues.
The only time I've had issues was when ROCm 6.0.2 released and caused training to only output 0 loss after I upgraded. This might've just been an issue with how I upgraded.
Screenshots (if appropriate)
Types of changes
This only adds additions to the
README.mdfile.Social Handles (Optional)