Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: rocBLAS fails tests badly in FP16 for distro packages #1350

Open
littlewu2508 opened this issue Aug 6, 2023 · 8 comments
Open

[Bug]: rocBLAS fails tests badly in FP16 for distro packages #1350

littlewu2508 opened this issue Aug 6, 2023 · 8 comments
Assignees

Comments

@littlewu2508
Copy link

Describe the bug

Distro rocBLAS-5.6.0 (compiled with upstream llvm-16) fails many FP16 related tests. Both seen on MI210 and Radeon VII. Details can be seen in gzipped test.log:

MI210-test.log.gz
RadeonVII-test.log.gz

The build log is also appended:
MI210-build.log.gz
RadeonVII-build.log.gz

@rkamd
Copy link
Contributor

rkamd commented Aug 7, 2023

@littlewu2508 ,
Could you update some of the missing information such as build log, environment.txt etc., to further investigate the issue.
Please refer to the Bug template here

@littlewu2508
Copy link
Author

littlewu2508 commented Aug 9, 2023

To Reproduce

This result comes from running src_test in Gentoo sc-libs/rocBLAS-5.6.0 package. Currently the package is in this test branch

In Gentoo system, you can replace the default repo with this experiment branch, then build and test rocBLAS:

cd /var/db/repos
mv gentoo{,.bak}
git clone -b rocm-5.6 https://github.com/littlewu2508/gentoo.git 
echo 'ACCEPT_KEYWORDS="~amd64"' > /etc/portage/make.conf
mkdir -p /etc/portage/env /etc/portage/package.use
echo 'FEATURES=test' > /etc/portage/env/test.conf
echo 'sci-libs/rocBLAS test.conf' >> /etc/portage/package.env
emerge "=sci-libs/rocBLAS-5.6.0"

Expected behavior

All tests pass.

Log-files

The complete build-and-test log is
MI210-test.log.gz
MI210-build.log.gz
RadeonVII-build.log.gz
RadeonVII-test.log.gz

Environment

There are two environments

MI210

Hardware description
CPU AMD EPYC 7763
GPU AMD Instinct MI210
Software version
kernel Debian 6.1.27-1 (2023-05-08) x86_64
llvm/clang Gentoo 16.0.6
rocm-core Gentoo rocm-5.6.0
rocblas Gentoo rocm-5.6.0

MI210-environment.txt

Radeon VII

Hardware description
CPU AMD Ryzen 7 5800X
GPU AMD Radeon VII
Software version
kernel Linux 6.3.2
llvm/clang Gentoo 16.0.6
rocm-core Gentoo rocm-5.6.0
rocblas Gentoo rocm-5.6.0

RadeonVII-environment.txt

@rkamd rkamd self-assigned this Aug 14, 2023
@rkamd
Copy link
Contributor

rkamd commented Aug 25, 2023

@littlewu2508 ,
I tried to follow the steps provided by you to reproduce the issue in a Gentoo environment, but I was unable to successfully compile the rocBLAS because of the following error
(masked by: ~amd64 keyword)

I tried to follow some steps to unmask it , but no luck. Not very familiar with Gentoo environment. Any pointers on how to proceed further?

I was not able to reproduce this issue using ROCm 5.6 in Ubuntu

@littlewu2508
Copy link
Author

@littlewu2508 , I tried to follow the steps provided by you to reproduce the issue in a Gentoo environment, but I was unable to successfully compile the rocBLAS because of the following error (masked by: ~amd64 keyword)

Sorry I made a mistake in reproducing steps. Try adding ACCEPT_KEYWORDS="amd64" to echo 'ACCEPT_KEYWORDS="~amd64"' > /etc/portage/make.conf

I tried to follow some steps to unmask it , but no luck. Not very familiar with Gentoo environment. Any pointers on how to proceed further?

I was not able to reproduce this issue using ROCm 5.6 in Ubuntu

If you're using the official ROCm stack shipped by repo.radeon.com and with upstream kernel installed, then you shouldn't encounter this issue. I does not reproduce it as well on Debian12 with .deb from repo.radeon.com installed. So I guess it's Gentoo use upstream LLVM that causes all discrepancies.

@rkamd
Copy link
Contributor

rkamd commented Aug 28, 2023

@littlewu2508,
Thanks for updated steps, I will try to reproduce. I had a discussion with internally with ROCm team and we are guessing it could be a ABI mismatch causing half precision test to fail.

Would you be able to try some of the suggestions from ROCm team provided in rocFFT Issues #439

For reproducing the error, you could use the sample program provided here in Gentoo environment.

And maybe you could try this suggestion to verify if it resolves the issue

@littlewu2508
Copy link
Author

@littlewu2508, Thanks for updated steps, I will try to reproduce. I had a discussion with internally with ROCm team and we are guessing it could be a ABI mismatch causing half precision test to fail.

Would you be able to try some of the suggestions from ROCm team provided in rocFFT Issues #439

For reproducing the error, you could use the sample program provided here in Gentoo environment.

And maybe you could try this suggestion to verify if it resolves the issue

Thank you very much for these suggestions. I have also reproduced the float16.cpp issue, only -O3 generate sensible outputs. I will keep tracking ROCm/rocFFT#439

@littlewu2508
Copy link
Author

@littlewu2508 , Fedoro fix for half precisions is below: https://src.fedoraproject.org/fork/tstellar/rpms/compiler-rt/blob/0459cbc5d9eb15f1ad51d74707b4988049183708/f/0001-compiler-rt-Fix-FLOAT16-feature-detection.patch

Thank you! Is this patch submitted to llvm-project upstream?

thesamesam pushed a commit to llvm/llvm-project that referenced this issue Jan 24, 2024
CMAKE_TRY_COMPILE_TARGET_TYPE defaults to EXECUTABLE, which causes
any feature detection code snippet without a main function to fail,
so we need to make sure it gets explicitly set to STATIC_LIBRARY.

Bug: ROCm/rocFFT#439
Bug: ROCm/rocBLAS#1350
Bug: https://bugs.gentoo.org/916069
Closes: #69842

Reviewed by: thesamesam, mgorny
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants