[Bug]: Stop breaking backwards compatibility or at least warn #1386

danielzgtg · 2023-12-22T11:04:13Z

Describe the bug

rocBLAS 5.6 fails with a confusing error message when mixed with ROCm 6.0 libraries or TensileLibrary.

To Reproduce

Precise version of rocBLAS installed or rocBLAS commit hash if building from source.
Steps to reproduce the behavior:

Install ROCm 6.0
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.6
Install https://github.com/arlo-phoenix/bitsandbytes-rocm-5.6/tree/rocm
Run https://www.llamaindex.ai/ or https://github.com/AUTOMATIC1111/stable-diffusion-webui

Expected behavior

I should not have to spend an hour debugging this, and only find the problem using gdb. rocBLAS 5.6 should either succeed or give a clear error message when loading the TensileLibrary from rocBLAS 6.0 or when loaded while mixed in with ROCm shared libraries.

Log-files

$ ./main.py
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/config.json HTTP/1.1" 200 0
https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/config.json HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/configuration_stablelm_epoch.py HTTP/1.1" 200 0
https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/configuration_stablelm_epoch.py HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/config.json HTTP/1.1" 200 0
https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/config.json HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/modeling_stablelm_epoch.py HTTP/1.1" 200 0
https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/modeling_stablelm_epoch.py HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/generation_config.json HTTP/1.1" 200 0
https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/generation_config.json HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/tokenizer_config.json HTTP/1.1" 200 0
https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/tokenizer_config.json HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /BAAI/bge-small-en-v1.5/resolve/main/config.json HTTP/1.1" 200 0
https://huggingface.co:443 "HEAD /BAAI/bge-small-en-v1.5/resolve/main/config.json HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /BAAI/bge-small-en-v1.5/resolve/main/tokenizer_config.json HTTP/1.1" 200 0
https://huggingface.co:443 "HEAD /BAAI/bge-small-en-v1.5/resolve/main/tokenizer_config.json HTTP/1.1" 200 0
DEBUG:llama_index.readers.file.base:> [SimpleDirectoryReader] Total files added: 1
> [SimpleDirectoryReader] Total files added: 1
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: What I Worked On

February 2021

Before college...
> Adding chunk: What I Worked On

February 2021

Before college...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: I couldn't have put this into words when I was ...
> Adding chunk: I couldn't have put this into words when I was ...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: So I looked around to see what I could salvage ...
> Adding chunk: So I looked around to see what I could salvage ...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: I didn't want to drop out of grad school, but h...
> Adding chunk: I didn't want to drop out of grad school, but h...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: We actually had one of those little stoves, fed...
> Adding chunk: We actually had one of those little stoves, fed...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: But Interleaf still had a few years to live yet...
> Adding chunk: But Interleaf still had a few years to live yet...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: Painting students were supposed to express them...
> Adding chunk: Painting students were supposed to express them...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: Meanwhile I'd been hearing more and more about ...
> Adding chunk: Meanwhile I'd been hearing more and more about ...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: In return for that and doing the initial legal ...
> Adding chunk: In return for that and doing the initial legal ...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: Which meant being easy to use and inexpensive. ...
> Adding chunk: Which meant being easy to use and inexpensive. ...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: Nor had I changed my grad student lifestyle sig...
> Adding chunk: Nor had I changed my grad student lifestyle sig...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: Now when I walked past charming little restaura...
> Adding chunk: Now when I walked past charming little restaura...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: A lot of Lisp hackers dream of building a new L...
> Adding chunk: A lot of Lisp hackers dream of building a new L...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: Over the next several years I wrote lots of ess...
> Adding chunk: Over the next several years I wrote lots of ess...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: So we just made what seemed like the obvious ch...
> Adding chunk: So we just made what seemed like the obvious ch...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: I don't think it was entirely luck that the fir...
> Adding chunk: I don't think it was entirely luck that the fir...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: YC was different from other kinds of work I've ...
> Adding chunk: YC was different from other kinds of work I've ...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: For the rest of 2013 I left running YC more and...
> Adding chunk: For the rest of 2013 I left running YC more and...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: Now they are, though. Now you could continue us...
> Adding chunk: Now they are, though. Now you could continue us...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: Notes

[1] My experience skipped a step in the ...
> Adding chunk: Notes

[1] My experience skipped a step in the ...
DEBUG:llama_index.node_parser.node_utils:> Adding chunk: Startups had once been much more expensive to s...
> Adding chunk: Startups had once been much more expensive to s...

rocBLAS error: Could not load /opt/rocm-6.0.0/lib/rocblas/library/TensileLibrary_lazy_gfx1030.dat

rocBLAS error: Could not initialize Tensile library
Aborted (core dumped)

Environment

Hardware	description
CPU	AMD Ryzen 9 5900X 12-Core Processor
GPU	AMD Radeon RX 6650 XT

Software	version
rocm-core	6.0.0.60000-91~22.04
rocblas	4.0.0.60000-91~22.04

environment.txt

Workaround

Recompile pytorch manually. This will ensure that it loads shared libraries from /opt instead of venv.

The text was updated successfully, but these errors were encountered:

mahmoodw · 2023-12-22T19:03:37Z

Hello @danielzgtg,

Thank you for flagging the need for clearer error messages with ROCm and library version mismatches. Your feedback is vital in refining our library's usability.

Our team will investigate and refine the error notifications to offer guidance for resolving library version disparities. Additionally, we'll clarify any backward compatibility restrictions to assist users in navigating version conflicts more effectively.

We'll keep you updated on our progress as we work to enhance the error messages. Your patience and any additional insights during this process are immensely valuable.

Wasiq

rkamd · 2024-01-02T21:31:39Z

@danielzgtg ,
Thanks for reporting the issue, Do you see Tensile Library files in the path?
output of this command find /opt/ -name "TensileLibrary_*.dat" would help to debug further.

Trat8547 · 2024-01-07T14:01:11Z

That explains it. Spent the last week troubleshooting why Rocm suddenly stopped working, turns out to be a backwards compatibility issue. Quite frustrating.

rkamd · 2024-01-15T17:14:40Z

@danielzgtg and @Trat8547 ,
we were able to execute the sample rocblas program between the release ROCm 5.6 and ROCm 6.0, and internally we have not received any backward compatibility issues from the Frameworks team either.

Having said that, In general when a major version changes ( we follow semantic versioning) API breaking is expected, and upon reviewing the Release notes we see breaking changes in the HIP, and appropriate notification is published here.

Those changes could have contributed to the issue reported here.

danielzgtg · 2024-01-19T19:55:21Z

Here: TensorLibrary.txt. I think the TensileLibrary_*.dat files are fine, and the problem is with the (lack of) version detection in the code that reads them.

Your linked https://rocm.docs.amd.com/en/latest/about/release-notes.html#hip appears to only list API breaking changes. What my issue is about is ABI breaking changes.

The problem is that the pytorch ROCm is bundling .so files that overlap with the system versions in /opt/. Perhaps deleting the libroc* files from venv/lib/python3.11/site-packages/torch/lib/ would force the correct version (i.e. the system versions) to be used. Anyway, my issues on the other AMD repo suggested that you fix this unnecessary shared library bundling problem with pytorch, but perhaps rocBLAS itself should detect this problem. I think glibc does this properly and refuses to let the application run if the wrong version is used.

This is why rebuilding pytorch was a workaround for this problem. But I would rather not wait for the long pytorch compile every time, and I also don't want the prepackaged pytorch builds to contain the libroc*.so files that not only inflate the download size to gigabytes or so but furthermore cause version conflicts.

danielzgtg mentioned this issue Dec 22, 2023

System shared libraries should be preferred when newer than the bundled ones in the wheel ROCm/pytorch#1340

Open

mahmoodw assigned rkamd Dec 22, 2023

amcamd transferred this issue from ROCm/rocBLAS Jan 16, 2024

amcamd transferred this issue from ROCm/ROCm Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Stop breaking backwards compatibility or at least warn #1386

[Bug]: Stop breaking backwards compatibility or at least warn #1386

danielzgtg commented Dec 22, 2023

mahmoodw commented Dec 22, 2023

rkamd commented Jan 2, 2024

Trat8547 commented Jan 7, 2024

rkamd commented Jan 15, 2024

danielzgtg commented Jan 19, 2024

[Bug]: Stop breaking backwards compatibility or at least warn #1386

[Bug]: Stop breaking backwards compatibility or at least warn #1386

Comments

danielzgtg commented Dec 22, 2023

Describe the bug

To Reproduce

Expected behavior

Log-files

Environment

Workaround

mahmoodw commented Dec 22, 2023

rkamd commented Jan 2, 2024

Trat8547 commented Jan 7, 2024

rkamd commented Jan 15, 2024

danielzgtg commented Jan 19, 2024