-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMD Support - Segmentation Fault #272
Comments
AMD GPU have not been supported now. Maybe we could try to add the |
This is probably related to the export variable, "export HSA_OVERRIDE_GFX_VERSION=10.3.0" not being set because Navi 23 is not officially supported at the moment. |
That would be fantastic if you could! I really wanted to experiment with it locally and I did try, but at most I can only use the Model Inference and do the first two steps of training, alternatively..? Is there a way to train it on the CPU, like with the other steps, if you weren't able to provide support for AMD? Apologies for displaying my ignorance, I'm not a coder and I only dabbled with it 3 days ago, after seeing some impressive examples of what it was able to do. Regardless though I wanted to say, you're all doing an incredible and keep up the amazing work! :-) |
Have you tried installing pytorch2.1.0+rocm5.5? |
To make sure I got this right, I uninstalled and reinstalled everything in the virtual env. pip freeze | xargs pip uninstall -y
pip install -U --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.5
# Workaround for bug #1109
cat requirements-dml.txt | xargs -I _ pip install "_"
python infer-web.py I got this output: /home/nato/.asdf/installs/python/3.10.13/lib/python3.10/site-packages/torch/cuda/__init__.py:611: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
2023-08-30 16:43:28 | INFO | faiss.loader | Loading faiss with AVX2 support.
2023-08-30 16:43:28 | INFO | faiss.loader | Successfully loaded faiss with AVX2 support.
No supported Nvidia GPU found
use cpu instead
Use Language: en_US
Running on local URL: http://0.0.0.0:7865 OS: Pop!_OS 22.04 LTS x86_64
Host: MS-7D53 1.0
Kernel: 6.4.6-76060406-generic
CPU: AMD Ryzen 5 5600X (12) @ 3.700GHz
GPU: AMD ATI Radeon RX 6700 XT
Memory: 32007 MiB |
So does RVC not support AMD :( was wondering why it wasn't exporting anymore. Shame :( |
You can use amd cards on windows version if you install the directml / dml version. |
That's not very useful. We want AMD support, not Windows vendor-locking... |
@NatoBoram |
To make sure I got this right, I uninstalled and reinstalled everything in the virtual env. pip freeze | xargs pip uninstall -y
# Notice the version number change
pip install -U --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.6
# Workaround for bug #1109
cat requirements-dml.txt | xargs -I _ pip install "_"
python infer-web.py I got this output: /home/nato/.asdf/installs/python/3.10.13/lib/python3.10/site-packages/torch/cuda/__init__.py:611: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
2023-09-07 12:11:48 | INFO | faiss.loader | Loading faiss with AVX2 support.
2023-09-07 12:11:48 | INFO | faiss.loader | Successfully loaded faiss with AVX2 support.
2023-09-07 12:11:49 | INFO | configs.config | No supported Nvidia GPU found
2023-09-07 12:11:49 | INFO | configs.config | Use cpu instead
2023-09-07 12:11:49 | INFO | __main__ | Use Language: en_US
Running on local URL: http://0.0.0.0:7865 OS: Pop!_OS 22.04 LTS x86_64
Host: MS-7D53 1.0
Kernel: 6.4.6-76060406-generic
CPU: AMD Ryzen 5 5600X (12) @ 3.700GHz
GPU: AMD ATI Radeon RX 6700 XT
Memory: 32007 MiB Running this on |
Don't mind the message it's only a display, internally it just attempts to match against known nvidia generations, and displays this if there's nothing it knows about. Reference here:
It should work if you give it index 0 with 0 being your only cuda device available. Can ROCm users tell me if training works for them ? It seemed to work initially, until I somehow managed to trigger a kernel panic. Now training is awfully slow |
I made a pull request with some instructions on how to run RVC with ROCm. Training is running with ~40 sec per epoch on a RX6700XT (12GB) with a batch size of 16. |
Also encountering a segfault on Ubuntu 22.04 while doing any operation with PTH files. ONNX files work just fine, but conversion doesn't, so there's no models available for use until this is fixed. |
@Ecstatify Please keep me updated on this, I am experiencing the exact same issue. I already replaced the entire python runtime, but had no luck with it. Last step is going full nuclear and doing it on a fresh linux install. During slow training, can you also see one python multiprocessing thread being maxed out ? |
So, after much trial and error, I got this working, I will list the instructions below on how I got Training working on an AMD RX 6750 XT on Ubuntu Desktop 22.04.3 LTS as per Some notes: I downloaded the code directly, not a release version, it should work on a release version, but I downloaded the code, ran through the build instructions etc. A big thanks to this person on Reddit who wrote the original base instructions I used and tweaked for RVC Orion_light on Reddit ALSO, I haven't tested rmvpe or rmvpe_gpu as I forgot to get the pretrain's for them, but it should work, side note I believe rmvpe was having issues with audio longer than 3 minutes, at least I was. Install Notes:
After booting back into Ubuntu, we will install ROCm and Pytorch.
Next we will build RVC V2 from source, pretty self explanatory via the official docs, but will retype them here as there is some extra stuff with AMD on Linux
Note about the interface: I had to use Harvest like instead of rmvpe or rmvpe_gpu becuase I forgot to download that model. Also for GPU indexes I put At a batch size of 16 and training 300 Epochs I'm using 99% of my GPU as indicated by GPU% and my temperature is around the low to mid 70s in celsius, I also do have some Coil Whine (Reference AMD GPU). It also takes about 30-40 seconds per Epoch. I hope this helped someone trying to set this up and train with their AMD GPU on Linux! |
Unfortunately, this didn't solve the problem, as this is basically how I
set it up in the first place, minus using the `python-is-python3`
metapackage instead of an alias, and using `apt` instead of `apt-get`,
because we are not in 2008 anymore and you may break things elsewhere on
your system by not doing these properly.
I've also just realized that these are the same instructions already provided, so they're already established to not work correctly for most.
|
Use
Use
In Ubuntu's # Alias definitions.
# You may want to put all your additions into a separate file like
# ~/.bash_aliases, instead of adding them here directly.
# See /usr/share/doc/bash-doc/examples in the bash-doc package.
if [ -f ~/.bash_aliases ]; then
. ~/.bash_aliases
fi This means you can put your aliases in |
I wrote this very late at night, so excuse the issues with certain commands. As stated in the original post, I wrote this in hopes it can help someone who wants a step by step guide as well as hopefully be able to help someone who is stuck. Side note, it's still training at around 100-185 Watts depending on when you look at Also, I did have the |
That's a completely different error to those most people are having (the program just dies outright due to a segfault) and the steps above are the same steps everyone else had beforehand, so the question is, why does this work for you and not us? |
That, I'm unsure about, I did just finish training a model tonight and have shut down my Ubuntu install multiple times. I'm willing to provide any info I can and what you guys might need. Just let me know! |
Try running with Edit : ParzivalWolfram, are you using the HSA_OVERRIDE_GFX_VERSION env variable ? It is required, for reasons I could get into the details of. |
Yes.
|
Can confirm, no additional output. Will post the normal output I get, however, interesting thing to note, when mine boots up it detects the exact model of GPU I have; see here
|
I have a 7800 XT, so the strings may not be updated in ROCm yet as it's pretty new. I also forgot that I added some debug output of my own while tracking down a different problem, so if you're wondering what the extra debug line at the top is, that was my doing. |
Very interesting, I wonder if everyone having Seg Fault is on a newer AMD GPU (Ex. RX 7000 Series)? EDIT: OP has a RX 6000 Series card, so can't be that |
Not all the 6000/7000 series cards are on the same underlying chipset. You'd have to check the chipset on something like TechPowerUp's GPU database. I'd guess that's pretty likely, since per dmesg, it's dying in AMD's HIP libraries in particular for me. I only just noticed the log there. |
I'm gonna take a guess here, but you people might be using "outdated" ROCm installations. Mind sharing the distribution and the |
I'll share my working ROCm Device Libs tomorrow, but I did want to ask, after your Kernal Panic a few comments above, how did you fix your training speed? Mine seems to fluctuate a lot when training different models, referring to Wattage shown in |
|
I didn't really fix it, I started using on of the forks, the "Mangio" one. Still not sure what happened, but training speeds are fine on that. Tried profiling the code, but apparently it really was the training step that became really slow. Might look into it some more at some point, if I ever get the time to, because it's just that weird. |
Package does support your card... Get the pytorch preview package for ROCm (which added support for the gfx1100 target which is the 7900xt, supposedly) Not sure on the situation for gfx1100+ cards, but having tried to override to gfx900 on a 10xx card, it didn't work, so I suppose you can't override across a major version and have it work. |
Interesting, I will give the Mangio one a try. As for my Device Libs, here they are and they work for me at least:
|
No dice. Libraries did not like that.
|
I came here from the overlapping circle of Stable Diffusion and I've had the living nightmare of getting it to work on Linux with my AMD gear. The Linux installation of rocm for my 7900xtx needs the 'HSA....=11.0' line that you note there. There is also the versions of rocm to consider, 7900xtx needs 5.6 (& 5.5) and lower cards needs 5.4 as I recall (and a different HSA line). One of the issues that I encountered on Linux was that it refused to add me to the usergroups - you should input the lines and reboot and then check you have been added. |
I got SD working on my old 5500xt pretty quickly (would not recommend, though, not enough VRAM at all), and with the same configs and software versions, it actually still works on my 7800xt. I know some distros make |
Any updates on this? Stable Diffusion and LLM training (like LLaMa and Mistral) work without a hitch with the usual ROCm PyTorch installations, but this just does PyTorch+ROCm-5.4.2 fails to see my 7900XTX in the UI, then tries to use DML, then fails. PyTorch+ROCm-5.6 or 5.7 show in the UI, but fails.
This is how my console looks when trying to train. In the UI all I did was change the experiment name, and training folder. Then I hit Full log from console.
|
I installed the nightly build of pytorch and training was so slow, it took 3-9 minutes to reach 1 epoch. So I did a clean install of artix linux and I created an python3.11 virtual environment and I installed the requirements-py311.txt file. I also disable the igpu in the bios. fairseq coudn't be installed so I installed the .whl file from this link. I'm using the March 19 2024 nigtly build. Here's a link for it. Edit: I've noticed I've been using miniconda3 this whole time. Not sure if that helped. |
I got it to work on my Gentoo Linux with RX 7900 XT and ROCm 5.7.1 this way:
if you mess this step up go back to step 1 and make sure you have a fresh environment
this is it. I put all in one little bash script so I don't have to type as much. However before executing it you need to activate the minimamba environment because it's not possible from scripts/subprocesses
then make it executable with |
System:
Manually installed PyTorch 2.0.1 for rocm. Then installed requirements from requirements.txt.
webui boots up without problems but when trying inference or training I get the following message:
Can we expect AMD support in the near future?
The text was updated successfully, but these errors were encountered: