New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report your benchmark results here! #8
Comments
I'd love to add to the benchmarks list. I've got two questions:
Cheers! |
Hi ibonito1, OpenCL support on EPYC CPUs is a bit difficult as these are not officially supported by AMD. Being x86-64, they should work with the Intel OpenCL CPU Runtime though, or alternatively with POCL. Fingers crossed! Regards, |
C-Dub2022 thank you very much for the RX 580 benchmark! If you can post the FP16S and FP16C benchmarks as well, I'll add them to the readme! |
MarcoAurelioFerrari thank you! |
Hi dongwang22, thanks for the benchmark! For the visual interface, uncomment Regards, |
AMD 5700 XT |----------------.------------------------------------------------------------| |----------------.------------------------------------------------------------| |----------------.------------------------------------------------------------| |
Hi gittigittibangbang, thanks for the benchmarks! Efficiency is ~60% which is typical for the AMD GPUs. Performance is limited by VRAM bandwidth only, and the RX 6800 would presumably perform exactly the same. The benchmark setup is a 256³ box, that fills 1.5GB (FP32) or 0.9GB (FP16) of VRAM. The large infinity cache (128MB) is only an insignificant fraction of that so does not significantly boost performance. |
Vega 8 in R7 4750G |----------------.------------------------------------------------------------| |----------------.------------------------------------------------------------| |
Is it possible to add ready-to-run benchmark for MacOS so we can get more result on Mac? |
RTX3060 Laptop GPU with 12700H on ASUS ROG M16 Turbo mode (120W GPU TDP) and external laptop fan PS C:\Software\FluidX3D> .\FluidX3D-Benchmark-FP32-FP16C-Windows.exe PS C:\Software\FluidX3D> .\FluidX3D-Benchmark-FP32-FP16S-Windows.exe PS C:\Software\FluidX3D> |
@HAL9000COM thanks for the Vega 8 benchmarks! Quick question: Is your RAM is 2x16GB DDR4-3200MT/s? And do you have an idea why the GPU shows up 3 times? |
@edmond1992 unfortunately I don't have a Mac, so I can't compile add the executables for MacOS. But the code should work as-is; just compile it as-is with the third line in |
Cross compile?
…Sent from my iPhone
On 23 Oct 2022, at 16:06, Moritz Lehmann ***@***.***> wrote:
Is it possible to add ready-to-run benchmark for MacOS so we can get more result on Mac? Especially the test is bandwidth limited and Apple silicon should be good at this. Not to mention relatively cheap 64GB+ VRAM as they share the same main memory.
@edmond1992<https://github.com/edmond1992> unfortunately I don't have a Mac, so I can't compile add the executables for MacOS. But the code should work as-is; just compile it as-is with the third line in make.sh and you'll get the FP32 benchmark. Uncomment FP16S/FP16C in src/defines.hpp and recompile to get the other 2 benchmarks.
—
Reply to this email directly, view it on GitHub<#8 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ALNCZ732JE6QCL7RKB37QG3WETWXJANCNFSM6AAAAAAQTGZVRY>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
[https://www.polyu.edu.hk/emaildisclaimer/85A-PolyU_Email_Signature.jpg]
Disclaimer:
This message (including any attachments) contains confidential information intended for a specific individual and purpose. If you are not the intended recipient, you should delete this message and notify the sender and The Hong Kong Polytechnic University (the University) immediately. Any disclosure, copying, or distribution of this message, or the taking of any action based on it, is strictly prohibited and may be unlawful.
The University specifically denies any responsibility for the accuracy or quality of information obtained through University E-mail Facilities. Any views and opinions expressed are only those of the author(s) and do not necessarily represent those of the University and the University accepts no liability whatsoever for any losses or damages incurred or caused to any party as a result of the use of such information.
|
2x32GB DDR4-3200 OC to 3533. No idea why GPU shows up multiple times. After some reboot, it now shows up as two devices. |
GTX 1050 on an old gaming laptop. It's amazing I figured out how to even run this and get a benchmark. Now I'm going to try and figure out how to run the simulation on an stl (or similar) file. I know how to use Blender quite well, but this is my first time with visial studio or command line stuff. I'm so out of my depth here 😟 |
@ProjectPhysX have now added the FP16 benchmarks RTX 3080 TiUpdated FP32 (was concurrently baking a fluid in Blender when I ran the last one): |
Hi @SirWixy, thank you so much for the benchmarks! Can you post the FP16S and FP16C results too? |
@gittigittibangbang thanks for the benchmarks! For the CPU you can just stop it with Ctrl+C after it has leveled at constant performance, and take the last MLUPs/s reading. Can you post the program header with the Xeon Gold for the specs, and performance values for FP16S and FP16C too for the Xeon? Thanks! |
|----------------.------------------------------------------------------------| FP32/FP32: 132MLUPs/s, 20GB/s bandwidth, 8 steps/s |
Hi @biergaizi, amazing that it worked! Not the first instance where PoCL beat Nvidia's own runtime. 🖖😛 Kind regards, |
This worked perfectly. It even fixed the performance problem of FP32/FP16S on the Nvidia compiler (PoCL has low performance probably because of a code generation problem). Now the performance in both cases are close to the Nvidia A100! The only exception is FP32/FP16C - the custom floating-point format probably either increased the arithmetic intensity beyond the FP32 non-FMA limit, or hit other restrictions. The CMP 170HX suddenly has its killer app now. FP32/FP32
FP32/FP16S
FP32/FP16C
PatchEven better is the fact that the minimum changes needed for this workaround is just a two-line patch:
|
D3Q19 SRT (FP32/FP32)` ` D3Q19 SRT (FP32/FP16S)
D3Q19 SRT (FP32/FP16C)
|
Haven't seen results for Nvidia A30. OpenCL Benchmark
FP32/FP16C
FP32/FP16S
FP32/FP32
|
Apple M1 Ultra 128G
FP16S
FP16C
|
How is everyone doing the benchmarks for multi gpu configurations? I'm playing around with mi25's and not seeing anywhere near what the specs would suggest I should. I'm wondering if I have a hardware bottleneck or if I missed something in the setup. |
@marcc1229 use the "2/4/8 GPUs" lines in the benchmark setup, and for memory use a value close to the VRAM capacity of one GPU, like The multi-GPU communication has some performance overhead, which shrinks relative to domain compute time the larger the resolution is. The highest possible resolution is the best performing and also the most interesting case for multi-GPU, as at lower resolution a single GPU would be sufficient. But performance at similarly large resolutions should not be too different. For the single-GPU benchmark the resolution should not matter at all as long as it's sufficiently large for full hardware saturation. However, the older GCN/Vega GPUs can have vastly different performance for slightly different grid resolution / workgroup count, the cursed memory bandwidth anomaly which is a problem of the hardware architecture. Try some different large resolutions. Potential bottleneck could be PCIe communication. If you have a server where each GPU is connected by PCIe 3.0 x16 or x8, this should not be a issue. But for example cheap crypto mining hardware with these USB 3 / PCIe 3.0 x1 connections is problematic. |
The small but apparently decently mighty original M1 (2020 MBP).
|
Howdy! Benchmark Results below for the new Nvidia L40S being tested in the Marigold Systems Lab, requested from the /r/Nvidia Subreddit. Nvidia L40s |
@dboswell-marigoldsystems thank you!! |
@lslowmotion for single-GPU, performance is mostly independent of grid size / memory occupation, use the default 256³ / |
@ProjectPhysX yea with Also to complete the ones above, here are single and dual 3090s in FP32/FP16C to add to the benchmark table. Hope these help! |
Hi @lslowmotion, today I realized that with an optimization in update v2.11, I accidentally stepped on a bug in Nvidia's OpenCL driver, which caused failure of memory allocation for larger simulations, including your benchmark runs at larger resolution. This is now fixed in the master branch! Large resolutions up to 2x ~23000 MB are now working again also with the FP16 types. Kind regards, |
|
Results on AMD Radeon RX590 8GB (Running on Rusticl-Mesa OpenCL 1.2) So if you want to use an OpenSource OpenCL implementation (Clover or Rusticl) use Clover until Rusticl become better. Clover by default is OpenCL 1.1 conformant, but you can export:
to use OpenCL 1.2 |
RoG Strix Laptop:
Interesting how my Laptop 3080 Ti beats the other Laptops RTX 4080 ! |
Hi @gitcnd, thanks a lot! Can you please add the FP16S and FP16C benchmarks too? |
Sorry about that - here they are:
|
And just for giggles... (the slowest benchmark here so far :-)
|
@gitcnd Are both DIMM slots on the laptop populated for the Intel iGPU benchmark? If not, the results would be even slower... 😄 |
You are welcome to report your benchmark results for the FP32/FP16S/FP16C accuracy levels here.
Especially numbers for AMD GPUs are desired for GCN/RDNA/RDNA2 architectures.
Thank you!
The text was updated successfully, but these errors were encountered: