New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: about 7900xtx benchmark. #1380
Comments
Hello @Axl-zhang, Thank you for bringing the performance concern regarding our recently supported architecture to our attention. Your feedback is instrumental in ensuring optimal functionality across all platforms. To address this matter, I'll be promptly directing this issue to our specialized team for thorough investigation and analysis. Sometimes, certain transpose configurations, such as NT, undergo specific tuning efforts to enhance performance on varying architectures. In the interim, I kindly suggest exploring different transpose configurations and comparing their impact on performance. Trying out alternatives like NT could potentially reveal improvements. Please feel free to report back on any findings or improvements observed with these adjustments. Your assistance in this matter is highly appreciated. Thank you for your patience and cooperation as we work to resolve this issue. Best regards, |
Assigning to Carson for gfx1100 fp32 Tensile tuning. |
the headline 61TFlops spec of rdna3 is kindof a lie, to achieve this rdna3 adds limited fp32 dual issue capability over rdna2. This to use dual issue capability some stars have to align though:
TLDR: except for very specific circumstances 30 ish TFlops is the max you can expect out of RDNA3 |
several blogs claim the clpeak test can achive 80% theoretical performance if use the wave64 ,but now navi3x Tensile set wavefonts is 32 only achive half theoretical performance |
In wave64 mode the hardware can dual issue halfs of the wave, this dosent help you with operations that can not be dual issued at all though, as is the case if i understand the isa documentation correctly in gemm |
|
Suggestion Description
Benchmark show bellow, the performance of the fp32 is very bad, the theoretical performance is 61tflops,actual test is 28tflops less than half of the theoretical.
RTX4090 fp32 has 74tflops (theoretical 81t). Is there any room for further improvement, or are there any suggestions for optimization?
#####################
test platform:
#########################
FP16 benchmark:
#########################
FP32 benchmark:
rocminfo
Operating System
Ubuntu 22.04.3 LTS
GPU
7900xtx
ROCm Component
rocBLAS
The text was updated successfully, but these errors were encountered: