AMD Support - Segmentation Fault #272

Chilluminati91 · 2023-05-12T07:15:42Z

System:

Ubuntu 22.04 LTS
Intel i5-12600k
32GB DDR4
AMD Radeon RX 6650 XT

Manually installed PyTorch 2.0.1 for rocm. Then installed requirements from requirements.txt.
webui boots up without problems but when trying inference or training I get the following message:

python infer-web.py
Use Language: en_US
Running on local URL:  http://0.0.0.0:7865
loading weights/zro.pth
gin_channels: 256 self.spk_embed_dim: 109
<All keys matched successfully>
Segmentation fault (core dumped)

Can we expect AMD support in the near future?

The text was updated successfully, but these errors were encountered:

fumiama · 2023-05-14T15:31:28Z

AMD GPU have not been supported now. Maybe we could try to add the dml support later.

ArchiverXP · 2023-06-10T04:38:01Z

This is probably related to the export variable, "export HSA_OVERRIDE_GFX_VERSION=10.3.0" not being set because Navi 23 is not officially supported at the moment.

JinxNyota · 2023-06-18T22:50:15Z

AMD GPU have not been supported now. Maybe we could try to add the dml support later.

That would be fantastic if you could!

I really wanted to experiment with it locally and I did try, but at most I can only use the Model Inference and do the first two steps of training, alternatively..? Is there a way to train it on the CPU, like with the other steps, if you weren't able to provide support for AMD?

Apologies for displaying my ignorance, I'm not a coder and I only dabbled with it 3 days ago, after seeing some impressive examples of what it was able to do.

Regardless though I wanted to say, you're all doing an incredible and keep up the amazing work! :-)

GUUser91 · 2023-08-29T07:29:50Z

Have you tried installing pytorch2.1.0+rocm5.5?
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.5

NatoBoram · 2023-08-30T20:47:15Z

To make sure I got this right, I uninstalled and reinstalled everything in the virtual env.

pip freeze | xargs pip uninstall -y
pip install -U --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.5
# Workaround for bug #1109
cat requirements-dml.txt | xargs -I _ pip install "_"
python infer-web.py

I got this output:

/home/nato/.asdf/installs/python/3.10.13/lib/python3.10/site-packages/torch/cuda/__init__.py:611: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
2023-08-30 16:43:28 | INFO | faiss.loader | Loading faiss with AVX2 support.
2023-08-30 16:43:28 | INFO | faiss.loader | Successfully loaded faiss with AVX2 support.
No supported Nvidia GPU found
use cpu instead
Use Language: en_US
Running on local URL:  http://0.0.0.0:7865

OS: Pop!_OS 22.04 LTS x86_64 
Host: MS-7D53 1.0 
Kernel: 6.4.6-76060406-generic 
CPU: AMD Ryzen 5 5600X (12) @ 3.700GHz 
GPU: AMD ATI Radeon RX 6700 XT
Memory: 32007 MiB

callmedezz · 2023-08-31T23:45:54Z

So does RVC not support AMD :( was wondering why it wasn't exporting anymore. Shame :(

GUUser91 · 2023-09-01T15:31:05Z

You can use amd cards on windows version if you install the directml / dml version.

NatoBoram · 2023-09-02T03:15:27Z

That's not very useful. We want AMD support, not Windows vendor-locking...

GUUser91 · 2023-09-07T00:31:58Z

@NatoBoram
Have you tried pytorch2.1.0+rocm5.6?
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.6

NatoBoram · 2023-09-07T16:12:53Z

To make sure I got this right, I uninstalled and reinstalled everything in the virtual env.

pip freeze | xargs pip uninstall -y
# Notice the version number change
pip install -U --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.6
# Workaround for bug #1109
cat requirements-dml.txt | xargs -I _ pip install "_"
python infer-web.py

I got this output:

/home/nato/.asdf/installs/python/3.10.13/lib/python3.10/site-packages/torch/cuda/__init__.py:611: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
2023-09-07 12:11:48 | INFO | faiss.loader | Loading faiss with AVX2 support.
2023-09-07 12:11:48 | INFO | faiss.loader | Successfully loaded faiss with AVX2 support.
2023-09-07 12:11:49 | INFO | configs.config | No supported Nvidia GPU found
2023-09-07 12:11:49 | INFO | configs.config | Use cpu instead
2023-09-07 12:11:49 | INFO | __main__ | Use Language: en_US
Running on local URL:  http://0.0.0.0:7865

OS: Pop!_OS 22.04 LTS x86_64 
Host: MS-7D53 1.0 
Kernel: 6.4.6-76060406-generic 
CPU: AMD Ryzen 5 5600X (12) @ 3.700GHz 
GPU: AMD ATI Radeon RX 6700 XT
Memory: 32007 MiB

Running this on main, commit 569fcd8.

GatienDoesStuff · 2023-09-13T23:57:29Z

Don't mind the message it's only a display, internally it just attempts to match against known nvidia generations, and displays this if there's nothing it knows about.

Reference here:

Retrieval-based-Voice-Conversion-WebUI/infer-web.py

Line 72 in 72a18e6

if torch.cuda.is_available() or ngpu != 0:

It should work if you give it index 0 with 0 being your only cuda device available.

Can ROCm users tell me if training works for them ? It seemed to work initially, until I somehow managed to trigger a kernel panic. Now training is awfully slow

WorXeN · 2023-09-14T13:47:53Z

I made a pull request with some instructions on how to run RVC with ROCm.

Training is running with ~40 sec per epoch on a RX6700XT (12GB) with a batch size of 16.

Ecstatify · 2023-09-17T03:46:57Z

Training was actually working very well for me. Then my PC froze and had to be forcibly shut off, and now training is very slow. Additionally, my graphics card appears to be using less power but is reporting full utilization. weird

Before

After

ParzivalWolfram · 2023-09-17T07:53:29Z

Also encountering a segfault on Ubuntu 22.04 while doing any operation with PTH files. ONNX files work just fine, but conversion doesn't, so there's no models available for use until this is fixed.

GatienDoesStuff · 2023-09-17T20:32:41Z

@Ecstatify Please keep me updated on this, I am experiencing the exact same issue. I already replaced the entire python runtime, but had no luck with it. Last step is going full nuclear and doing it on a fresh linux install.

During slow training, can you also see one python multiprocessing thread being maxed out ?

JohnAMacDonald · 2023-09-22T06:09:26Z

So, after much trial and error, I got this working, I will list the instructions below on how I got Training working on an AMD RX 6750 XT on Ubuntu Desktop 22.04.3 LTS as per lsb_release -a.

Some notes:

I downloaded the code directly, not a release version, it should work on a release version, but I downloaded the code, ran through the build instructions etc.

A big thanks to this person on Reddit who wrote the original base instructions I used and tweaked for RVC Orion_light on Reddit

ALSO, I haven't tested rmvpe or rmvpe_gpu as I forgot to get the pretrain's for them, but it should work, side note I believe rmvpe was having issues with audio longer than 3 minutes, at least I was.

Install Notes:

Download and install Ubuntu Desktop 22.04.3 LTS from the official Ubuntu website.
Once installed open a terminal window.
Run these commands separately sudo usermod -a -G render YourUsernameHere & sudo usermod -a -G video YourUsernameHere. This is adding yourself to both the Render & Video Groups..
Install Python3 with this command (may already be preinstalled) sudo apt-get install python3.
Open Bashrc with Nano with this command nano ~/.bashrc, then at the bottom of that file add alias python=python3 export HSA_OVERRIDE_GFX_VERSION=10.3.0. Make sure that the alias python=python3 & export HSA_OVERRIDE_GFX_VERSION=10.3.0 are on DIFFERENT lines.
Reboot (Important).

After booting back into Ubuntu, we will install ROCm and Pytorch.

Go to the PyTorch | Start Locally website and check the ROCm version (currently 5.4.2).
Go to this website How to Install ROCm (amd.com). select version that is compatible with pytorch and find command for you installed ubuntu version. Example that works right now with RVC V2 sudo apt-get update wget https://repo.radeon.com/amdgpu-install/5.4.2/ubuntu/jammy/amdgpu-install_5.4.50402-1_all.deb sudo apt-get install ./amdgpu-install_5.4.50402-1_all.deb.
Then run this command sudo amdgpu-install --usecase=rocm --no-dkms and let that finish.
Reboot (Important).
Open Start Locally | PyTorch again, select Stable, Linux, pip, python, and ROCm. and run the command outputted below that in your terminal. Example pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2. (You may need to install pip, with Ubuntu it will just let you know that pip is missing and you can get it by running something like sudo apt-get install pip. That pip install command could be wrong, so double check.
Reboot (Important).

Next we will build RVC V2 from source, pretty self explanatory via the official docs, but will retype them here as there is some extra stuff with AMD on Linux

Download the source code either by clicking Code then Download ZIP or git clone https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI.git
Extract the ZIP file.
CD (Change Directory) into the the extracted folder.
Run this command curl -sSL https://install.python-poetry.org | python3 -.
After that Finishes, run this command poetry install.
After that, then install the projects AMD requirements via pip with this command pip install -r requirements-amd.txt.
Then after that is done, run this command sudo apt-get install rocm-hip-sdk rocm-opencl-sdk.
Don't know if this does anything important, but doing these steps made mine work, so run the command export ROCM_PATH=/opt/rocm but not the command export HSA_OVERRIDE_GFX_VERSION=10.3.0 as we already added that to our Bashrc file.
After everything is done, run the command python3 infer-web.py to open the web interface to start using RVC!

Note about the interface: I had to use Harvest like instead of rmvpe or rmvpe_gpu becuase I forgot to download that model. Also for GPU indexes I put 0-1-2 just to be safe. And the biggest note your GPU WON'T show under GPU Information, it will say Unfortunately, there is no compatible GPU available to support your training., but when you go to train, open a new terminal window and run the command rocm-smi and it will tell you this info from left to right: GPU Index (I believe) Temp AvgPwr SCLK MCLK Fan Percentage Perf PwrCap VRAM% and GPU%. To tell if your AMD card is being used, check the Temp and the GPU%.

At a batch size of 16 and training 300 Epochs I'm using 99% of my GPU as indicated by GPU% and my temperature is around the low to mid 70s in celsius, I also do have some Coil Whine (Reference AMD GPU). It also takes about 30-40 seconds per Epoch.

I hope this helped someone trying to set this up and train with their AMD GPU on Linux!

ParzivalWolfram · 2023-09-22T09:27:45Z

Unfortunately, this didn't solve the problem, as this is basically how I set it up in the first place, minus using the `python-is-python3` metapackage instead of an alias, and using `apt` instead of `apt-get`, because we are not in 2008 anymore and you may break things elsewhere on your system by not doing these properly. I've also just realized that these are the same instructions already provided, so they're already established to not work correctly for most.

NatoBoram · 2023-09-22T16:11:19Z

3. Run these commands separately sudo usermod -a -G render YourUsernameHere & sudo usermod -a -G video YourUsernameHere. This is adding yourself to both the Render & Video Groups..

Use $USER instead of YourUsernameHere

Install Python3 with this command (may already be preinstalled) sudo apt-get install python3.

Use apt instead of apt-get

Open Bashrc with Nano with this command nano ~/.bashrc, then at the bottom of that file add alias python=python3 export HSA_OVERRIDE_GFX_VERSION=10.3.0. Make sure that the alias python=python3 & export HSA_OVERRIDE_GFX_VERSION=10.3.0 are on DIFFERENT lines.

In Ubuntu's .bashrc, there's a dedicated space for aliases:

# Alias definitions.
# You may want to put all your additions into a separate file like
# ~/.bash_aliases, instead of adding them here directly.
# See /usr/share/doc/bash-doc/examples in the bash-doc package.

if [ -f ~/.bash_aliases ]; then
    . ~/.bash_aliases
fi

This means you can put your aliases in ~/.bash_aliases without polluting your ~/.bashrc.

JohnAMacDonald · 2023-09-23T05:18:58Z

3. Run these commands separately sudo usermod -a -G render YourUsernameHere & sudo usermod -a -G video YourUsernameHere. This is adding yourself to both the Render & Video Groups..

Use $USER instead of YourUsernameHere

Install Python3 with this command (may already be preinstalled) sudo apt-get install python3.

Use apt instead of apt-get

Open Bashrc with Nano with this command nano ~/.bashrc, then at the bottom of that file add alias python=python3 export HSA_OVERRIDE_GFX_VERSION=10.3.0. Make sure that the alias python=python3 & export HSA_OVERRIDE_GFX_VERSION=10.3.0 are on DIFFERENT lines.

In Ubuntu's .bashrc, there's a dedicated space for aliases:
# Alias definitions.
# You may want to put all your additions into a separate file like
# ~/.bash_aliases, instead of adding them here directly.
# See /usr/share/doc/bash-doc/examples in the bash-doc package.

if [ -f ~/.bash_aliases ]; then
    . ~/.bash_aliases
fi
This means you can put your aliases in ~/.bash_aliases without polluting your ~/.bashrc.

I wrote this very late at night, so excuse the issues with certain commands.

As stated in the original post, I wrote this in hopes it can help someone who wants a step by step guide as well as hopefully be able to help someone who is stuck.

Side note, it's still training at around 100-185 Watts depending on when you look at rocm-smi. The wattage did drop to around ~60 Watts, but a reboot fixed it and I think it was related to me changing the Power Plan or whatever it's called in Ubuntu from Balanced to Performance while training.

Also, I did have the UserWarning: Can't initialize NVML warnings.warn("Can't initialize NVML") error originally, after nuking my install, and reinstalling and doing the steps I wrote fixed that error for me, so maybe I got lucky. My GPU is the RX 6750 XT, and it's an AMD Reference model.

ParzivalWolfram · 2023-09-24T05:13:00Z

Run these commands separately sudo usermod -a -G render YourUsernameHere & sudo usermod -a -G video YourUsernameHere. This is adding yourself to both the Render & Video Groups..

Use $USER instead of YourUsernameHere

Install Python3 with this command (may already be preinstalled) sudo apt-get install python3.

Use apt instead of apt-get

Open Bashrc with Nano with this command nano ~/.bashrc, then at the bottom of that file add alias python=python3 export HSA_OVERRIDE_GFX_VERSION=10.3.0. Make sure that the alias python=python3 & export HSA_OVERRIDE_GFX_VERSION=10.3.0 are on DIFFERENT lines.

In Ubuntu's .bashrc, there's a dedicated space for aliases:
# Alias definitions.
# You may want to put all your additions into a separate file like
# ~/.bash_aliases, instead of adding them here directly.
# See /usr/share/doc/bash-doc/examples in the bash-doc package.

if [ -f ~/.bash_aliases ]; then
    . ~/.bash_aliases
fi
This means you can put your aliases in ~/.bash_aliases without polluting your ~/.bashrc.
I wrote this very late at night, so excuse the issues with certain commands.

As stated in the original post, I wrote this in hopes it can help someone who wants a step by step guide as well as hopefully be able to help someone who is stuck.

Side note, it's still training at around 100-185 Watts depending on when you look at rocm-smi. The wattage did drop to around ~60 Watts, but a reboot fixed it and I think it was related to me changing the Power Plan or whatever it's called in Ubuntu from Balanced to Performance while training.

Also, I did have the UserWarning: Can't initialize NVML warnings.warn("Can't initialize NVML") error originally, after nuking my install, and reinstalling and doing the steps I wrote fixed that error for me, so maybe I got lucky. My GPU is the RX 6750 XT, and it's an AMD Reference model.

That's a completely different error to those most people are having (the program just dies outright due to a segfault) and the steps above are the same steps everyone else had beforehand, so the question is, why does this work for you and not us?

JohnAMacDonald · 2023-09-24T08:21:45Z

Run these commands separately sudo usermod -a -G render YourUsernameHere & sudo usermod -a -G video YourUsernameHere. This is adding yourself to both the Render & Video Groups..

Use $USER instead of YourUsernameHere

Install Python3 with this command (may already be preinstalled) sudo apt-get install python3.

Use apt instead of apt-get

Open Bashrc with Nano with this command nano ~/.bashrc, then at the bottom of that file add alias python=python3 export HSA_OVERRIDE_GFX_VERSION=10.3.0. Make sure that the alias python=python3 & export HSA_OVERRIDE_GFX_VERSION=10.3.0 are on DIFFERENT lines.

In Ubuntu's .bashrc, there's a dedicated space for aliases:
# Alias definitions.
# You may want to put all your additions into a separate file like
# ~/.bash_aliases, instead of adding them here directly.
# See /usr/share/doc/bash-doc/examples in the bash-doc package.

if [ -f ~/.bash_aliases ]; then
    . ~/.bash_aliases
fi
This means you can put your aliases in ~/.bash_aliases without polluting your ~/.bashrc.
I wrote this very late at night, so excuse the issues with certain commands.

As stated in the original post, I wrote this in hopes it can help someone who wants a step by step guide as well as hopefully be able to help someone who is stuck.

Side note, it's still training at around 100-185 Watts depending on when you look at rocm-smi. The wattage did drop to around ~60 Watts, but a reboot fixed it and I think it was related to me changing the Power Plan or whatever it's called in Ubuntu from Balanced to Performance while training.

Also, I did have the UserWarning: Can't initialize NVML warnings.warn("Can't initialize NVML") error originally, after nuking my install, and reinstalling and doing the steps I wrote fixed that error for me, so maybe I got lucky. My GPU is the RX 6750 XT, and it's an AMD Reference model.
That's a completely different error to those most people are having (the program just dies outright due to a segfault) and the steps above are the same steps everyone else had beforehand, so the question is, why does this work for you and not us?

That, I'm unsure about, I did just finish training a model tonight and have shut down my Ubuntu install multiple times. I'm willing to provide any info I can and what you guys might need.

Just let me know!

GatienDoesStuff · 2023-09-25T13:43:05Z

Try running with AMD_LOG_LEVEL=2 (logging can be toggled from levels 1 to 4, with 4 being the most verbose).

Edit : ParzivalWolfram, are you using the HSA_OVERRIDE_GFX_VERSION env variable ? It is required, for reasons I could get into the details of.

ParzivalWolfram · 2023-09-26T14:43:41Z

Try running with AMD_LOG_LEVEL=2 (logging can be toggled from levels 1 to 4, with 4 being the most verbose).

Edit : ParzivalWolfram, are you using the HSA_OVERRIDE_GFX_VERSION env variable ? It is required, for reasons I could get into the details of.

Yes. AMD_LOG_LEVEL variable changed nothing, no additional output is given.

2023-09-26 09:41:09.222377: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-26 09:41:10 | INFO | faiss.loader | Loading faiss with AVX2 support.
2023-09-26 09:41:10 | INFO | faiss.loader | Successfully loaded faiss with AVX2 support.
2023-09-26 09:41:11 | INFO | configs.config | DEBUG: torch.cuda.is_available(): True
2023-09-26 09:41:11 | INFO | configs.config | Found GPU AMD Radeon Graphics
2023-09-26 09:41:11 | INFO | __main__ | Use Language: en_US
Running on local URL:  http://0.0.0.0:7865
2023-09-26 09:41:21 | INFO | infer.modules.vc.modules | Get sid: test-model.pth
2023-09-26 09:41:21 | INFO | infer.modules.vc.modules | Loading: assets/weights/test-model.pth
run-gui.sh: line 2: 54700 Segmentation fault      (core dumped) AMD_LOG_LEVEL=2 HSA_OVERRIDE_GFX_VERSION=10.3.0 python3 infer-web.py

JohnAMacDonald · 2023-09-27T02:39:40Z

Try running with AMD_LOG_LEVEL=2 (logging can be toggled from levels 1 to 4, with 4 being the most verbose).
Edit : ParzivalWolfram, are you using the HSA_OVERRIDE_GFX_VERSION env variable ? It is required, for reasons I could get into the details of.

Yes. AMD_LOG_LEVEL variable changed nothing, no additional output is given.

2023-09-26 09:41:09.222377: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-26 09:41:10 | INFO | faiss.loader | Loading faiss with AVX2 support.
2023-09-26 09:41:10 | INFO | faiss.loader | Successfully loaded faiss with AVX2 support.
2023-09-26 09:41:11 | INFO | configs.config | DEBUG: torch.cuda.is_available(): True
2023-09-26 09:41:11 | INFO | configs.config | Found GPU AMD Radeon Graphics
2023-09-26 09:41:11 | INFO | __main__ | Use Language: en_US
Running on local URL:  http://0.0.0.0:7865
2023-09-26 09:41:21 | INFO | infer.modules.vc.modules | Get sid: test-model.pth
2023-09-26 09:41:21 | INFO | infer.modules.vc.modules | Loading: assets/weights/test-model.pth
run-gui.sh: line 2: 54700 Segmentation fault      (core dumped) AMD_LOG_LEVEL=2 HSA_OVERRIDE_GFX_VERSION=10.3.0 python3 infer-web.py

Can confirm, no additional output. Will post the normal output I get, however, interesting thing to note, when mine boots up it detects the exact model of GPU I have; see here 2023-09-26 23:35:41 | INFO | configs.config | Found GPU AMD Radeon RX 6750 XT as opposed to what @ParzivalWolfram has which is 2023-09-26 09:41:11 | INFO | configs.config | Found GPU AMD Radeon Graphics.

2023-09-26 23:35:38.214390: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-09-26 23:35:38.482033: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09L-26 23:35:40 | INFO | faiss.loader | Loading faiss with AVX2 support.
2023-09-26 23:35:40 | INFO | faiss.loader | Successfully loaded faiss with AVX2 support.
2023-09-26 23:35:41 | INFO | configs.config | Found GPU AMD Radeon RX 6750 XT
2023-09-26 23:35:41 | INFO | __main__ | Use Language: en_US
Running on local URL:  http://0.0.0.0:7865```

ParzivalWolfram · 2023-09-27T02:42:36Z

Try running with AMD_LOG_LEVEL=2 (logging can be toggled from levels 1 to 4, with 4 being the most verbose).
Edit : ParzivalWolfram, are you using the HSA_OVERRIDE_GFX_VERSION env variable ? It is required, for reasons I could get into the details of.

Yes. AMD_LOG_LEVEL variable changed nothing, no additional output is given.
2023-09-26 09:41:09.222377: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-26 09:41:10 | INFO | faiss.loader | Loading faiss with AVX2 support.
2023-09-26 09:41:10 | INFO | faiss.loader | Successfully loaded faiss with AVX2 support.
2023-09-26 09:41:11 | INFO | configs.config | DEBUG: torch.cuda.is_available(): True
2023-09-26 09:41:11 | INFO | configs.config | Found GPU AMD Radeon Graphics
2023-09-26 09:41:11 | INFO | __main__ | Use Language: en_US
Running on local URL:  http://0.0.0.0:7865
2023-09-26 09:41:21 | INFO | infer.modules.vc.modules | Get sid: test-model.pth
2023-09-26 09:41:21 | INFO | infer.modules.vc.modules | Loading: assets/weights/test-model.pth
run-gui.sh: line 2: 54700 Segmentation fault      (core dumped) AMD_LOG_LEVEL=2 HSA_OVERRIDE_GFX_VERSION=10.3.0 python3 infer-web.py
Can confirm, no additional output. Will post the normal output I get, however:

$ AMD_LOG_LEVEL=2 python3 infer-web.py 2023-09-26 23:35:38.214390: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2023-09-26 23:35:38.482033: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-09L-26 23:35:40 | INFO | faiss.loader | Loading faiss with AVX2 support. 2023-09-26 23:35:40 | INFO | faiss.loader | Successfully loaded faiss with AVX2 support. 2023-09-26 23:35:41 | INFO | configs.config | Found GPU AMD Radeon RX 6750 XT 2023-09-26 23:35:41 | INFO | __main__ | Use Language: en_US Running on local URL: http://0.0.0.0:7865

Interesting thing to note, when mine boots up it detects the exact model of GPU I have; see here 2023-09-26 23:35:41 | INFO | configs.config | Found GPU AMD Radeon RX 6750 XT as opposed to what @ParzivalWolfram has which is 2023-09-26 09:41:11 | INFO | configs.config | Found GPU AMD Radeon Graphics.

I have a 7800 XT, so the strings may not be updated in ROCm yet as it's pretty new. I also forgot that I added some debug output of my own while tracking down a different problem, so if you're wondering what the extra debug line at the top is, that was my doing.

JohnAMacDonald · 2023-09-27T02:43:49Z

Try running with AMD_LOG_LEVEL=2 (logging can be toggled from levels 1 to 4, with 4 being the most verbose).
Edit : ParzivalWolfram, are you using the HSA_OVERRIDE_GFX_VERSION env variable ? It is required, for reasons I could get into the details of.

Yes. AMD_LOG_LEVEL variable changed nothing, no additional output is given.
2023-09-26 09:41:09.222377: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-26 09:41:10 | INFO | faiss.loader | Loading faiss with AVX2 support.
2023-09-26 09:41:10 | INFO | faiss.loader | Successfully loaded faiss with AVX2 support.
2023-09-26 09:41:11 | INFO | configs.config | DEBUG: torch.cuda.is_available(): True
2023-09-26 09:41:11 | INFO | configs.config | Found GPU AMD Radeon Graphics
2023-09-26 09:41:11 | INFO | __main__ | Use Language: en_US
Running on local URL:  http://0.0.0.0:7865
2023-09-26 09:41:21 | INFO | infer.modules.vc.modules | Get sid: test-model.pth
2023-09-26 09:41:21 | INFO | infer.modules.vc.modules | Loading: assets/weights/test-model.pth
run-gui.sh: line 2: 54700 Segmentation fault      (core dumped) AMD_LOG_LEVEL=2 HSA_OVERRIDE_GFX_VERSION=10.3.0 python3 infer-web.py
Can confirm, no additional output. Will post the normal output I get, however:
$ AMD_LOG_LEVEL=2 python3 infer-web.py 2023-09-26 23:35:38.214390: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2023-09-26 23:35:38.482033: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-09L-26 23:35:40 | INFO | faiss.loader | Loading faiss with AVX2 support. 2023-09-26 23:35:40 | INFO | faiss.loader | Successfully loaded faiss with AVX2 support. 2023-09-26 23:35:41 | INFO | configs.config | Found GPU AMD Radeon RX 6750 XT 2023-09-26 23:35:41 | INFO | __main__ | Use Language: en_US Running on local URL: http://0.0.0.0:7865
Interesting thing to note, when mine boots up it detects the exact model of GPU I have; see here 2023-09-26 23:35:41 | INFO | configs.config | Found GPU AMD Radeon RX 6750 XT as opposed to what @ParzivalWolfram has which is 2023-09-26 09:41:11 | INFO | configs.config | Found GPU AMD Radeon Graphics.
I have a 7800 XT, so the strings may not be updated in ROCm yet as it's pretty new. I also forgot that I added some debug output of my own while tracking down a different problem, so if you're wondering what the extra debug line at the top is, that was my doing.

Very interesting, I wonder if everyone having Seg Fault is on a newer AMD GPU (Ex. RX 7000 Series)?

EDIT: OP has a RX 6000 Series card, so can't be that

ParzivalWolfram · 2023-09-27T03:12:02Z

Try running with AMD_LOG_LEVEL=2 (logging can be toggled from levels 1 to 4, with 4 being the most verbose).
Edit : ParzivalWolfram, are you using the HSA_OVERRIDE_GFX_VERSION env variable ? It is required, for reasons I could get into the details of.

Yes. AMD_LOG_LEVEL variable changed nothing, no additional output is given.
2023-09-26 09:41:09.222377: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-26 09:41:10 | INFO | faiss.loader | Loading faiss with AVX2 support.
2023-09-26 09:41:10 | INFO | faiss.loader | Successfully loaded faiss with AVX2 support.
2023-09-26 09:41:11 | INFO | configs.config | DEBUG: torch.cuda.is_available(): True
2023-09-26 09:41:11 | INFO | configs.config | Found GPU AMD Radeon Graphics
2023-09-26 09:41:11 | INFO | __main__ | Use Language: en_US
Running on local URL:  http://0.0.0.0:7865
2023-09-26 09:41:21 | INFO | infer.modules.vc.modules | Get sid: test-model.pth
2023-09-26 09:41:21 | INFO | infer.modules.vc.modules | Loading: assets/weights/test-model.pth
run-gui.sh: line 2: 54700 Segmentation fault      (core dumped) AMD_LOG_LEVEL=2 HSA_OVERRIDE_GFX_VERSION=10.3.0 python3 infer-web.py
Can confirm, no additional output. Will post the normal output I get, however:
$ AMD_LOG_LEVEL=2 python3 infer-web.py 2023-09-26 23:35:38.214390: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2023-09-26 23:35:38.482033: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-09L-26 23:35:40 | INFO | faiss.loader | Loading faiss with AVX2 support. 2023-09-26 23:35:40 | INFO | faiss.loader | Successfully loaded faiss with AVX2 support. 2023-09-26 23:35:41 | INFO | configs.config | Found GPU AMD Radeon RX 6750 XT 2023-09-26 23:35:41 | INFO | __main__ | Use Language: en_US Running on local URL: http://0.0.0.0:7865
Interesting thing to note, when mine boots up it detects the exact model of GPU I have; see here 2023-09-26 23:35:41 | INFO | configs.config | Found GPU AMD Radeon RX 6750 XT as opposed to what @ParzivalWolfram has which is 2023-09-26 09:41:11 | INFO | configs.config | Found GPU AMD Radeon Graphics.
I have a 7800 XT, so the strings may not be updated in ROCm yet as it's pretty new. I also forgot that I added some debug output of my own while tracking down a different problem, so if you're wondering what the extra debug line at the top is, that was my doing.
Very interesting, I wonder if everyone having Seg Fault is on a newer AMD GPU (Ex. RX 7000 Series)?

EDIT: OP has a RX 6000 Series card, so can't be that

Not all the 6000/7000 series cards are on the same underlying chipset. You'd have to check the chipset on something like TechPowerUp's GPU database. I'd guess that's pretty likely, since per dmesg, it's dying in AMD's HIP libraries in particular for me. I only just noticed the log there.

GatienDoesStuff · 2023-09-27T09:09:16Z

I'm gonna take a guess here, but you people might be using "outdated" ROCm installations.

Mind sharing the distribution and the rocm-device-libs package version on you guys 'sytems ?

JohnAMacDonald · 2023-09-27T09:36:14Z

I'm gonna take a guess here, but you people might be using "outdated" ROCm installations.

Mind sharing the distribution and the rocm-device-libs package version on you guys 'sytems ?

I'll share my working ROCm Device Libs tomorrow, but I did want to ask, after your Kernal Panic a few comments above, how did you fix your training speed? Mine seems to fluctuate a lot when training different models, referring to Wattage shown in rocm-smi

ParzivalWolfram · 2023-09-27T15:10:57Z

I'm gonna take a guess here, but you people might be using "outdated" ROCm installations.

Mind sharing the distribution and the rocm-device-libs package version on you guys 'sytems ?

$ cat /etc/os-release && apt list rocm-device-libs && pip3 list | grep torch
PRETTY_NAME="Ubuntu 22.04 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04 (Jammy Jellyfish)"
VERSION_CODENAME=jammy
UBUNTU_CODENAME=jammy
Listing... Done
rocm-device-libs/jammy,now 1.0.0.50600-67~22.04 amd64 [installed,automatic]
pytorch-triton-rocm          2.1.0+34f8189eae
torch                        2.2.0.dev20230916+rocm5.6
torchaudio                   2.2.0.dev20230916+rocm5.6
torchcrepe                   0.0.20
torchvision                  0.17.0.dev20230916+rocm5.6

GatienDoesStuff · 2023-09-27T17:14:34Z

I'll share my working ROCm Device Libs tomorrow, but I did want to ask, after your Kernal Panic a few comments above, how did you fix your training speed? Mine seems to fluctuate a lot when training different models, referring to Wattage shown in rocm-smi

I didn't really fix it, I started using on of the forks, the "Mangio" one. Still not sure what happened, but training speeds are fine on that.

Tried profiling the code, but apparently it really was the training step that became really slow. Might look into it some more at some point, if I ever get the time to, because it's just that weird.

GatienDoesStuff · 2023-09-27T17:33:09Z

ParzivalWolfram

Package does support your card...
Can you try one last thing ? I think I got this.

Get the pytorch preview package for ROCm (which added support for the gfx1100 target which is the 7900xt, supposedly)
Assuming the preview library is API-compatible with current stable, it should work if you run RVC with HSA_OVERRIDE_GFX_VERSION=11.0.0 as it will spoof your gfx1101 card and hopefully allow it to run.

Not sure on the situation for gfx1100+ cards, but having tried to override to gfx900 on a 10xx card, it didn't work, so I suppose you can't override across a major version and have it work.

JohnAMacDonald · 2023-09-27T19:48:26Z

I'll share my working ROCm Device Libs tomorrow, but I did want to ask, after your Kernal Panic a few comments above, how did you fix your training speed? Mine seems to fluctuate a lot when training different models, referring to Wattage shown in rocm-smi

I didn't really fix it, I started using on of the forks, the "Mangio" one. Still not sure what happened, but training speeds are fine on that.

Tried profiling the code, but apparently it really was the training step that became really slow. Might look into it some more at some point, if I ever get the time to, because it's just that weird.

Interesting, I will give the Mangio one a try. As for my Device Libs, here they are and they work for me at least:

PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
Listing... Done
rocm-device-libs/jammy,now 1.0.0.50402-104~22.04 amd64 [installed,automatic]
pytorch-triton-rocm          2.0.1
torch                        2.0.1+rocm5.4.2
torchaudio                   2.0.2+rocm5.4.2
torchcrepe                   0.0.20
torchvision                  0.15.2+rocm5.4.2```

ParzivalWolfram · 2023-09-28T20:33:35Z

ParzivalWolfram

Package does support your card... Can you try one last thing ? I think I got this.

Get the pytorch preview package for ROCm (which added support for the gfx1100 target which is the 7900xt, supposedly) Assuming the preview library is API-compatible with current stable, it should work if you run RVC with HSA_OVERRIDE_GFX_VERSION=11.0.0 as it will spoof your gfx1101 card and hopefully allow it to run.

Not sure on the situation for gfx1100+ cards, but having tried to override to gfx900 on a 10xx card, it didn't work, so I suppose you can't override across a major version and have it work.

No dice. Libraries did not like that.

:1:hip_code_object.cpp      :507 : 0548094068 us: 54709: [tid:0x7fdc93244000]   Devices:
:1:hip_code_object.cpp      :510 : 0548094071 us: 54709: [tid:0x7fdc93244000]     amdgcn-amd-amdhsa--gfx1100 - [Not Found]
:1:hip_code_object.cpp      :514 : 0548094073 us: 54709: [tid:0x7fdc93244000]   Bundled Code Objects:
:1:hip_code_object.cpp      :530 : 0548094076 us: 54709: [tid:0x7fdc93244000]     host-x86_64-unknown-linux - [Unsupported]
:1:hip_code_object.cpp      :528 : 0548094079 us: 54709: [tid:0x7fdc93244000]     hipv4-amdgcn-amd-amdhsa--gfx1030 - [code object targetID is amdgcn-amd-amdhsa--gfx1030]
:1:hip_code_object.cpp      :528 : 0548094082 us: 54709: [tid:0x7fdc93244000]     hipv4-amdgcn-amd-amdhsa--gfx900 - [code object targetID is amdgcn-amd-amdhsa--gfx900]
:1:hip_code_object.cpp      :528 : 0548094085 us: 54709: [tid:0x7fdc93244000]     hipv4-amdgcn-amd-amdhsa--gfx906 - [code object targetID is amdgcn-amd-amdhsa--gfx906]
:1:hip_code_object.cpp      :528 : 0548094088 us: 54709: [tid:0x7fdc93244000]     hipv4-amdgcn-amd-amdhsa--gfx908 - [code object targetID is amdgcn-amd-amdhsa--gfx908]
:1:hip_code_object.cpp      :528 : 0548094091 us: 54709: [tid:0x7fdc93244000]     hipv4-amdgcn-amd-amdhsa--gfx90a - [code object targetID is amdgcn-amd-amdhsa--gfx90a]
:1:hip_code_object.cpp      :534 : 0548094093 us: 54709: [tid:0x7fdc93244000] hipErrorNoBinaryForGpu: Unable to find code object for all current devices! - 209
:1:hip_fatbin.cpp           :265 : 0548094096 us: 54709: [tid:0x7fdc93244000] hipErrorNoBinaryForGpu: Couldn't find binary for current devices! - 209
:1:hip_code_object.cpp      :506 : 0548094976 us: 54709: [tid:0x7fdc93244000] hipErrorNoBinaryForGpu: Unable to find code object for all current devices!

Grey3016 · 2023-09-29T10:54:47Z

ParzivalWolfram

Package does support your card... Can you try one last thing ? I think I got this.

Get the pytorch preview package for ROCm (which added support for the gfx1100 target which is the 7900xt, supposedly) Assuming the preview library is API-compatible with current stable, it should work if you run RVC with HSA_OVERRIDE_GFX_VERSION=11.0.0 as it will spoof your gfx1101 card and hopefully allow it to run.

Not sure on the situation for gfx1100+ cards, but having tried to override to gfx900 on a 10xx card, it didn't work, so I suppose you can't override across a major version and have it work.

I came here from the overlapping circle of Stable Diffusion and I've had the living nightmare of getting it to work on Linux with my AMD gear. The Linux installation of rocm for my 7900xtx needs the 'HSA....=11.0' line that you note there. There is also the versions of rocm to consider, 7900xtx needs 5.6 (& 5.5) and lower cards needs 5.4 as I recall (and a different HSA line).
For some reason, the SD python scripts failed to identify my 7900 card as navi3 and kept trying to install 5.4. I'm trying the Windows DirectML fork of RCV as I have limited space on my Linux SSD and it says "Using DirectML" but then fails to do this / fails to find my card.

One of the issues that I encountered on Linux was that it refused to add me to the usergroups - you should input the lines and reboot and then check you have been added.

ParzivalWolfram · 2023-09-29T13:38:55Z

ParzivalWolfram

Package does support your card... Can you try one last thing ? I think I got this.
Get the pytorch preview package for ROCm (which added support for the gfx1100 target which is the 7900xt, supposedly) Assuming the preview library is API-compatible with current stable, it should work if you run RVC with HSA_OVERRIDE_GFX_VERSION=11.0.0 as it will spoof your gfx1101 card and hopefully allow it to run.
Not sure on the situation for gfx1100+ cards, but having tried to override to gfx900 on a 10xx card, it didn't work, so I suppose you can't override across a major version and have it work.

I came here from the overlapping circle of Stable Diffusion and I've had the living nightmare of getting it to work on Linux with my AMD gear. The Linux installation of rocm for my 7900xtx needs the 'HSA....=11.0' line that you note there. There is also the versions of rocm to consider, 7900xtx needs 5.6 (& 5.5) and lower cards needs 5.4 as I recall (and a different HSA line). For some reason, the SD python scripts failed to identify my 7900 card as navi3 and kept trying to install 5.4. I'm trying the Windows DirectML fork of RCV as I have limited space on my Linux SSD and it says "Using DirectML" but then fails to do this / fails to find my card.

One of the issues that I encountered on Linux was that it refused to add me to the usergroups - you should input the lines and reboot and then check you have been added.

I got SD working on my old 5500xt pretty quickly (would not recommend, though, not enough VRAM at all), and with the same configs and software versions, it actually still works on my 7800xt. I know some distros make /etc work differently, which may also be the cause of the user group issues.

xzuyn · 2023-11-07T08:05:02Z

Any updates on this? Stable Diffusion and LLM training (like LLaMa and Mistral) work without a hitch with the usual ROCm PyTorch installations, but this just does Aborted (core dumped) no matter what I try.

PyTorch+ROCm-5.4.2 fails to see my 7900XTX in the UI, then tries to use DML, then fails.

PyTorch+ROCm-5.6 or 5.7 show in the UI, but fails.

PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
Listing... Done
rocm-device-libs/jammy,now 1.0.0.50701-98~22.04 amd64 [installed,automatic]
pytorch-triton-rocm          2.1.0+34f8189eae
torch                        2.2.0.dev20231106+rocm5.6
torchaudio                   2.2.0.dev20231106+rocm5.6
torchcrepe                   0.0.20
torchvision                  0.17.0.dev20231106+rocm5.6

This is how my console looks when trying to train. In the UI all I did was change the experiment name, and training folder. Then I hit One-click training.

Full log from console.

(venv) xzuyn@xzuyn-UBUNTU:/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI$ export ROCM_PATH=/opt/rocm-5.7.1
(venv) xzuyn@xzuyn-UBUNTU:/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI$ export HSA_OVERRIDE_GFX_VERSION=11.0.0
(venv) xzuyn@xzuyn-UBUNTU:/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI$ sudo usermod -aG render $USERNAME
[sudo] password for xzuyn: 
(venv) xzuyn@xzuyn-UBUNTU:/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI$ sudo usermod -aG video $USERNAME
(venv) xzuyn@xzuyn-UBUNTU:/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI$ python infer-web.py 
2023-11-07 02:55:44.492488: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-07 02:55:47 | INFO | configs.config | Found GPU Radeon RX 7900 XTX
is_half:True, device:cuda:0
2023-11-07 02:55:48 | INFO | __main__ | Use Language: en_US
Running on local URL:  http://0.0.0.0:7865
2023-11-07 02:56:09 | INFO | httpx | HTTP Request: POST http://localhost:7865/api/predict "HTTP/1.1 200 OK"
2023-11-07 02:56:09 | INFO | __main__ | "/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python" infer/modules/train/preprocess.py "./dat/joe" 40000 8 "/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/logs/joe" False 3.0
['infer/modules/train/preprocess.py', './dat/joe', '40000', '8', '/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/logs/joe', 'False', '3.0']
start preprocess
['infer/modules/train/preprocess.py', './dat/joe', '40000', '8', '/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/logs/joe', 'False', '3.0']
./dat/joe/12.wav->Suc.
./dat/joe/13.wav->Suc.
./dat/joe/2.wav->Suc.
./dat/joe/15.wav->Suc.
./dat/joe/14.wav->Suc.
./dat/joe/16.wav->Suc.
./dat/joe/1.wav->Suc.
./dat/joe/10.wav->Suc.
./dat/joe/11.wav->Suc.
./dat/joe/20.wav->Suc.
./dat/joe/22.wav->Suc.
./dat/joe/27.wav->Suc.
./dat/joe/28.wav->Suc.
./dat/joe/23.wav->Suc.
./dat/joe/18.wav->Suc.
./dat/joe/17.wav->Suc.
./dat/joe/21.wav->Suc.
./dat/joe/3.wav->Suc.
./dat/joe/4.wav->Suc.
./dat/joe/19.wav->Suc.
./dat/joe/25.wav->Suc.
./dat/joe/7.wav->Suc.
./dat/joe/29.wav->Suc.
./dat/joe/24.wav->Suc.
./dat/joe/30.wav->Suc.
./dat/joe/5.wav->Suc.
./dat/joe/6.wav->Suc.
./dat/joe/26.wav->Suc.
./dat/joe/32.wav->Suc.
./dat/joe/31.wav->Suc.
./dat/joe/8.wav->Suc.
./dat/joe/33.wav->Suc.
./dat/joe/9.wav->Suc.
end preprocess
2023-11-07 02:56:13 | INFO | __main__ | start preprocess
['infer/modules/train/preprocess.py', './dat/joe', '40000', '8', '/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/logs/joe', 'False', '3.0']
./dat/joe/12.wav->Suc.
./dat/joe/13.wav->Suc.
./dat/joe/2.wav->Suc.
./dat/joe/15.wav->Suc.
./dat/joe/14.wav->Suc.
./dat/joe/16.wav->Suc.
./dat/joe/1.wav->Suc.
./dat/joe/10.wav->Suc.
./dat/joe/11.wav->Suc.
./dat/joe/20.wav->Suc.
./dat/joe/22.wav->Suc.
./dat/joe/27.wav->Suc.
./dat/joe/28.wav->Suc.
./dat/joe/23.wav->Suc.
./dat/joe/18.wav->Suc.
./dat/joe/17.wav->Suc.
./dat/joe/21.wav->Suc.
./dat/joe/3.wav->Suc.
./dat/joe/4.wav->Suc.
./dat/joe/19.wav->Suc.
./dat/joe/25.wav->Suc.
./dat/joe/7.wav->Suc.
./dat/joe/29.wav->Suc.
./dat/joe/24.wav->Suc.
./dat/joe/30.wav->Suc.
./dat/joe/5.wav->Suc.
./dat/joe/6.wav->Suc.
./dat/joe/26.wav->Suc.
./dat/joe/32.wav->Suc.
./dat/joe/31.wav->Suc.
./dat/joe/8.wav->Suc.
./dat/joe/33.wav->Suc.
./dat/joe/9.wav->Suc.
end preprocess

2023-11-07 02:56:13 | INFO | httpx | HTTP Request: POST http://localhost:7865/api/predict "HTTP/1.1 200 OK"
2023-11-07 02:56:13 | INFO | __main__ | "/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python" infer/modules/train/extract/extract_f0_rmvpe.py 2 0 0 "/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/logs/joe" True 
2023-11-07 02:56:13 | INFO | __main__ | "/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python" infer/modules/train/extract/extract_f0_rmvpe.py 2 1 0 "/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/logs/joe" True 
['infer/modules/train/extract/extract_f0_rmvpe.py', '2', '1', '0', '/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/logs/joe', 'True']
todo-f0-59
f0ing,now-0,all-59,-/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/logs/joe/1_16k_wavs/0_1.wav
['infer/modules/train/extract/extract_f0_rmvpe.py', '2', '0', '0', '/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/logs/joe', 'True']
todo-f0-59
f0ing,now-0,all-59,-/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/logs/joe/1_16k_wavs/0_0.wav
Loading rmvpe model
Loading rmvpe model
terminate called after throwing an instance of 'boost::filesystem::filesystem_error'
  what():  boost::filesystem::remove: Directory not empty [system:39]: "/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/TEMP/miopen-interim-hsaco-1b22-2e82-e0f3-af12"
terminate called after throwing an instance of 'boost::filesystem::filesystem_error'
  what():  boost::filesystem::remove: Directory not empty [system:39]: "/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/TEMP/miopen-interim-hsaco-7823-b47b-18ca-718f"
Aborted (core dumped)
Aborted (core dumped)
2023-11-07 02:56:20 | INFO | __main__ | ['infer/modules/train/extract/extract_f0_rmvpe.py', '2', '1', '0', '/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/logs/joe', 'True']
todo-f0-59
f0ing,now-0,all-59,-/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/logs/joe/1_16k_wavs/0_1.wav
['infer/modules/train/extract/extract_f0_rmvpe.py', '2', '0', '0', '/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/logs/joe', 'True']
todo-f0-59
f0ing,now-0,all-59,-/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/logs/joe/1_16k_wavs/0_0.wav

2023-11-07 02:56:20 | INFO | __main__ | "/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python" infer/modules/train/extract_feature_print.py cuda:0 1 0 0 "/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/logs/joe" v2
2023-11-07 02:56:22.053976: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
['infer/modules/train/extract_feature_print.py', 'cuda:0', '1', '0', '0', '/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/logs/joe', 'v2']
/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/logs/joe
load model(s) from assets/hubert/hubert_base.pt
2023-11-07 02:56:25 | INFO | fairseq.tasks.hubert_pretraining | current directory is /media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI
2023-11-07 02:56:25 | INFO | fairseq.tasks.hubert_pretraining | HubertPretrainingTask Config {'_name': 'hubert_pretraining', 'data': 'metadata', 'fine_tuning': False, 'labels': ['km'], 'label_dir': 'label', 'label_rate': 50.0, 'sample_rate': 16000, 'normalize': False, 'enable_padding': False, 'max_keep_size': None, 'max_sample_size': 250000, 'min_sample_size': 32000, 'single_target': False, 'random_crop': True, 'pad_audio': False}
2023-11-07 02:56:25 | INFO | fairseq.models.hubert.hubert | HubertModel Config: {'_name': 'hubert', 'label_rate': 50.0, 'extractor_mode': default, 'encoder_layers': 12, 'encoder_embed_dim': 768, 'encoder_ffn_embed_dim': 3072, 'encoder_attention_heads': 12, 'activation_fn': gelu, 'layer_type': transformer, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.05, 'dropout_input': 0.1, 'dropout_features': 0.1, 'final_dim': 256, 'untie_final_proj': True, 'layer_norm_first': False, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 0.1, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': static, 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'checkpoint_activations': False, 'required_seq_len_multiple': 2, 'depthwise_conv_kernel_size': 31, 'attn_type': '', 'pos_enc_type': 'abs', 'fp16': False}
/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
move model to cuda
all-feature-118
terminate called after throwing an instance of 'boost::filesystem::filesystem_error'
  what():  boost::filesystem::remove: Directory not empty [system:39]: "/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/TEMP/miopen-interim-hsaco-ee8e-70bd-2b10-a22f"
Aborted (core dumped)
2023-11-07 02:56:28 | INFO | __main__ | ['infer/modules/train/extract/extract_f0_rmvpe.py', '2', '1', '0', '/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/logs/joe', 'True']
todo-f0-59
f0ing,now-0,all-59,-/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/logs/joe/1_16k_wavs/0_1.wav
['infer/modules/train/extract/extract_f0_rmvpe.py', '2', '0', '0', '/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/logs/joe', 'True']
todo-f0-59
f0ing,now-0,all-59,-/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/logs/joe/1_16k_wavs/0_0.wav
['infer/modules/train/extract_feature_print.py', 'cuda:0', '1', '0', '0', '/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/logs/joe', 'v2']
/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/logs/joe
load model(s) from assets/hubert/hubert_base.pt
move model to cuda
all-feature-118

2023-11-07 02:56:28 | INFO | httpx | HTTP Request: POST http://localhost:7865/api/predict "HTTP/1.1 200 OK"
2023-11-07 02:56:28 | INFO | __main__ | Use gpus: 0
2023-11-07 02:56:28 | INFO | __main__ | "/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python" infer/modules/train/train.py -e "joe" -sr 40k -f0 1 -bs 12 -g 0 -te 20 -se 5 -pg assets/pretrained_v2/f0G40k.pth -pd assets/pretrained_v2/f0D40k.pth -l 0 -c 0 -sw 0 -v v2
2023-11-07 02:56:30.081107: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
DEBUG:tensorflow:Falling back to TensorFlow client; we recommended you install the Cloud TPU client directly with pip install cloud-tpu-client.
DEBUG:h5py._conv:Creating converter from 7 to 5
DEBUG:h5py._conv:Creating converter from 5 to 7
DEBUG:h5py._conv:Creating converter from 7 to 5
DEBUG:h5py._conv:Creating converter from 5 to 7
2023-11-07 02:56:35.518618: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
DEBUG:tensorflow:Falling back to TensorFlow client; we recommended you install the Cloud TPU client directly with pip install cloud-tpu-client.
DEBUG:h5py._conv:Creating converter from 7 to 5
DEBUG:h5py._conv:Creating converter from 5 to 7
DEBUG:h5py._conv:Creating converter from 7 to 5
DEBUG:h5py._conv:Creating converter from 5 to 7
INFO:joe:{'data': {'filter_length': 2048, 'hop_length': 400, 'max_wav_value': 32768.0, 'mel_fmax': None, 'mel_fmin': 0.0, 'n_mel_channels': 125, 'sampling_rate': 40000, 'win_length': 2048, 'training_files': './logs/joe/filelist.txt'}, 'model': {'filter_channels': 768, 'gin_channels': 256, 'hidden_channels': 192, 'inter_channels': 192, 'kernel_size': 3, 'n_heads': 2, 'n_layers': 6, 'p_dropout': 0, 'resblock': '1', 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'resblock_kernel_sizes': [3, 7, 11], 'spk_embed_dim': 109, 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'upsample_rates': [10, 10, 2, 2], 'use_spectral_norm': False}, 'train': {'batch_size': 12, 'betas': [0.8, 0.99], 'c_kl': 1.0, 'c_mel': 45, 'epochs': 20000, 'eps': 1e-09, 'fp16_run': True, 'init_lr_ratio': 1, 'learning_rate': 0.0001, 'log_interval': 200, 'lr_decay': 0.999875, 'seed': 1234, 'segment_size': 12800, 'warmup_epochs': 0}, 'model_dir': './logs/joe', 'experiment_dir': './logs/joe', 'save_every_epoch': 5, 'name': 'joe', 'total_epoch': 20, 'pretrainG': 'assets/pretrained_v2/f0G40k.pth', 'pretrainD': 'assets/pretrained_v2/f0D40k.pth', 'version': 'v2', 'gpus': '0', 'sample_rate': '40k', 'if_f0': 1, 'if_latest': 0, 'save_every_weights': '0', 'if_cache_data_in_gpu': 0}
/media/xzuyn/NVMe/LClones/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
DEBUG:infer.lib.infer_pack.models:gin_channels: 256, self.spk_embed_dim: 109
2023-11-07 02:56:41 | INFO | httpx | HTTP Request: POST http://localhost:7865/api/predict "HTTP/1.1 200 OK"
2023-11-07 02:56:41 | INFO | httpx | HTTP Request: POST http://localhost:7865/api/predict "HTTP/1.1 200 OK"
2023-11-07 02:56:41 | INFO | httpx | HTTP Request: POST http://localhost:7865/api/predict "HTTP/1.1 200 OK"
2023-11-07 02:56:41 | INFO | httpx | HTTP Request: POST http://localhost:7865/reset "HTTP/1.1 200 OK"

GUUser91 · 2024-03-20T08:08:23Z

I installed the nightly build of pytorch and training was so slow, it took 3-9 minutes to reach 1 epoch. So I did a clean install of artix linux and I created an python3.11 virtual environment and I installed the requirements-py311.txt file. I also disable the igpu in the bios. fairseq coudn't be installed so I installed the .whl file from this link.
facebookresearch/fairseq#5012 (comment)
Now it takes 30 seconds to reach 1 epoch during training on my 7900 xtx.

I'm using the March 19 2024 nigtly build. Here's a link for it.
https://archive.org/details/pytorch-rocm-working-files

Edit: I've noticed I've been using miniconda3 this whole time. Not sure if that helped.
Edit 2: Nevermind. I installed the stable pytorch 2.2.2 rocm5.7 build and now training is slighty faster, inference is now as fast as my previous nvidia graphics card (4060 TI).

Bratzmeister · 2024-04-22T23:43:04Z

I got it to work on my Gentoo Linux with RX 7900 XT and ROCm 5.7.1 this way:

create new micromamba/conda python3.10 environment
activate said environment
git clone the repo and cd into it
install prerequisites with correct source like so
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7
pip install -r requirements-amd.txt
make sure you don't see nvidia or cuda or cu11 or cu12 mentioned in the packages installed. they instead should mention +rocm5.7 behind their version or have rocm in the package name
e.g. pytorch-triton-rocm-2.2.0 torch-2.2.2+rocm5.7 torchaudio-2.2.2+rocm5.7

if you mess this step up go back to step 1 and make sure you have a fresh environment

find out your gpu id with
rocm-smi --showproductname
it will look something like this. In my case the dGPU is ID 0 and the iGPU is ID 1
take the Id from step 5 and export it in those 2 torch related variables
export CUDA_VISIBLE_DEVICES=0
export HIP_VISIBLE_DEVICES=0
0 being the ID of my RX 7900 XT
find out your llvm amdgpu architecture by issuing
rocminfo
then look for your dGPU under HSA Agents. It will look something like this

the interesting part is the "name".
Make sure you pick the actual dGPU and not the iGPU
(in my case for Ryzen 7950x APU it's called AMD Radeon Graphics)
so now that we have the internal name (gfx1100 for RX 7900 XT and RX 7900 XTX) we look up the supported devices for rocm amd official supported architectures
if your gpu is not listed there don't worry. as long as your architecture family is supported you are fine. i.e. any RX 7xxx series card is RDNA 3 and every RX 6xxx series card is RDNA 2 which are the most popular consumer cards currently I guess. So in my example we export another two variables with the following structure/syntax
export PYTORCH_ROCM_ARCH="gfx1100" here we put the actual name from step 7
export HSA_OVERRIDE_GFX_VERSION=11.0.0 this one can be a bit tricky but in our case is simple. the syntax here is always xx.x.x, you take it from the "gfx..." so gfx1100 turns into 11.0.0 and gfx1030 turns into 10.3.0. Now remember only few gfx have official support and if you have lets say gfx1103 or gfx 1033 you must put a 0 for the last digit. so gfx1103 becomes 11.0.0 and gfx 1033 will be 10.3.0

this is it. I put all in one little bash script so I don't have to type as much. However before executing it you need to activate the minimamba environment because it's not possible from scripts/subprocesses
so you can create a file like this in the RVC folder with nano rvc.sh, paste this (edit as you need to)

#!/bin/bash
export CUDA_VISIBLE_DEVICES=0
export HIP_VISIBLE_DEVICES=0
export HSA_OVERRIDE_GFX_VERSION=11.0.0
export PYTORCH_ROCM_ARCH="gfx1100"

python3 infer-web.py --pycmd python3

then make it executable with chmod +x rvc.sh
now you can run it with ./rvc.sh (make sure you activated the micromamba environment in your shell)

Tps-F added the enhancement ✨功能增强 label May 13, 2023

This was referenced Jun 12, 2023

Support for Apple's MPS [?] #136

Open

Support for AMD RX 5700 #547

Closed

sethtallen mentioned this issue Jul 28, 2023

Ability to use GPU and APU #886

Closed

JohnAMacDonald mentioned this issue Sep 22, 2023

ROCm support for AMD GPUs on Linux #1248

Merged

5 tasks

Bratzmeister mentioned this issue Apr 23, 2024

Please update the readme with newer ROCM instructions #1968

Open

github-actions bot added the stale 🪨话题搁置 label May 25, 2024

AMD Support - Segmentation Fault #272

AMD Support - Segmentation Fault #272

Comments

Chilluminati91 commented May 12, 2023 • edited

fumiama commented May 14, 2023

ArchiverXP commented Jun 10, 2023

JinxNyota commented Jun 18, 2023

GUUser91 commented Aug 29, 2023

NatoBoram commented Aug 30, 2023 • edited

callmedezz commented Aug 31, 2023

GUUser91 commented Sep 1, 2023

NatoBoram commented Sep 2, 2023

GUUser91 commented Sep 7, 2023

NatoBoram commented Sep 7, 2023

GatienDoesStuff commented Sep 13, 2023

WorXeN commented Sep 14, 2023

Ecstatify commented Sep 17, 2023

ParzivalWolfram commented Sep 17, 2023

GatienDoesStuff commented Sep 17, 2023

JohnAMacDonald commented Sep 22, 2023 • edited

ParzivalWolfram commented Sep 22, 2023 via email • edited

NatoBoram commented Sep 22, 2023

JohnAMacDonald commented Sep 23, 2023 • edited

ParzivalWolfram commented Sep 24, 2023

JohnAMacDonald commented Sep 24, 2023

GatienDoesStuff commented Sep 25, 2023 • edited

ParzivalWolfram commented Sep 26, 2023

JohnAMacDonald commented Sep 27, 2023 • edited

ParzivalWolfram commented Sep 27, 2023

JohnAMacDonald commented Sep 27, 2023 • edited

ParzivalWolfram commented Sep 27, 2023

GatienDoesStuff commented Sep 27, 2023

JohnAMacDonald commented Sep 27, 2023

ParzivalWolfram commented Sep 27, 2023

GatienDoesStuff commented Sep 27, 2023

GatienDoesStuff commented Sep 27, 2023

JohnAMacDonald commented Sep 27, 2023

ParzivalWolfram commented Sep 28, 2023

Grey3016 commented Sep 29, 2023 • edited

ParzivalWolfram commented Sep 29, 2023

xzuyn commented Nov 7, 2023 • edited

GUUser91 commented Mar 20, 2024 • edited

Bratzmeister commented Apr 22, 2024 • edited

Chilluminati91 commented May 12, 2023 •

edited

NatoBoram commented Aug 30, 2023 •

edited

JohnAMacDonald commented Sep 22, 2023 •

edited

ParzivalWolfram commented Sep 22, 2023 via email •

edited

JohnAMacDonald commented Sep 23, 2023 •

edited

GatienDoesStuff commented Sep 25, 2023 •

edited

JohnAMacDonald commented Sep 27, 2023 •

edited

JohnAMacDonald commented Sep 27, 2023 •

edited

Grey3016 commented Sep 29, 2023 •

edited

xzuyn commented Nov 7, 2023 •

edited

GUUser91 commented Mar 20, 2024 •

edited

Bratzmeister commented Apr 22, 2024 •

edited