Report your benchmark results here! #8

ProjectPhysX · 2022-09-22T16:14:09Z

You are welcome to report your benchmark results for the FP32/FP16S/FP16C accuracy levels here.
Especially numbers for AMD GPUs are desired for GCN/RDNA/RDNA2 architectures.
Thank you!

ibonito1 · 2022-09-23T15:56:57Z

I'd love to add to the benchmarks list. I've got two questions:

I want to benchmark a dual Epyc system (so specifically the CPUs actually). How would I do that (under Windows, but Linux would also be fine), if I have a GPU installed? It always automatically detects the GPU when running the benchmark “releases”.
How to post the benchmarks? Just copy the console output in here?

Cheers!

ProjectPhysX · 2022-09-24T11:25:20Z

Hi ibonito1,

OpenCL support on EPYC CPUs is a bit difficult as these are not officially supported by AMD. Being x86-64, they should work with the Intel OpenCL CPU Runtime though, or alternatively with POCL. Fingers crossed!
To run on a specific device, in the console run ./FluidX3D.exe 2 (on Linux) or FluidX3D.exe 2 (on Windows), to select device with ID 2 for example.
You can just copy the console output here.

Regards,
Moritz

C-Dub2022 · 2022-10-03T07:53:18Z

AMD Radeon RX 580:

ProjectPhysX · 2022-10-04T16:30:02Z

C-Dub2022 thank you very much for the RX 580 benchmark! If you can post the FP16S and FP16C benchmarks as well, I'll add them to the readme!

C-Dub2022 · 2022-10-04T20:11:30Z

Hopefully this is helpful. Let me know if there is anything else I can do.

MarcoAurelioFerrari · 2022-10-07T16:28:42Z

RTX 3060 12GB - v1.1

FP32-FP16C

FP32-FP16S

FP32-FP32

ProjectPhysX · 2022-10-08T06:56:35Z

MarcoAurelioFerrari thank you!

dongwang22 · 2022-10-14T07:35:59Z

Could you please tell me how to open the visualized interface of the flow domain as you said in the readme file? You said input the 2 can turn on the velocity field, but it does not work in the benchmark case. How can I generate pictures like you prensent on twitter ?

ProjectPhysX · 2022-10-14T12:15:15Z

Hi dongwang22,

thanks for the benchmark! For the visual interface, uncomment #define WINDOWS_GRAPHICS and comment #define BENCHMARK in src/defines.hpp, and uncomment for example the Taylor-Green setup in src/setup.cpp. Then compile and you should see the graphical interface where you can toggle rendering modes with keys 1/2/3/4. To generate videos, see the other setups: basically make a C++ loop and repeatedly do some LBM time steps and render images with the corresponding methods of the LBM class.

Regards,
Moritz

fkay1 · 2022-10-17T10:02:20Z

AMD 5700 XT

|----------------.--------- | Device ID 0 | gfx1010:xnack- |----------------'--------- |----------------.--------- | Device ID | 0 | Device Name | gfx1010:xnack- | Device Vendor | | Device Driver | 3444.0 (PAL,LC) | OpenCL Version | OpenCL C 2.0 | Compute Units | | Memory, Cache | | Buffer Limits | |----------------'--------- | Info: OpenCL C code |-----------------.-------- | Grid Resolution | | LBM Type | | Memory Usage | | Max Alloc Size | | Time Steps | | Kin. Viscosity | | Relaxation Time | | Reynolds Number | |---------.-------'-----.-- | MLUPs | Bandwidth | 1366 | 209 GB/s | |---------'-------------'-- | Info: Peak MLUPs/s = 1368 ---------------------------------------------------|
|
---------------------------------------------------|
---------------------------------------------------|
|
|
Advanced Micro Devices, Inc. |
|
|
20 at 1905 MHz (2560 cores, 9.754 TFLOPs/s) |
8176 MB, 16 KB global / 64 KB local |
6949 MB global, 7116390 KB constant |
---------------------------------------------------|
successfully compiled. |
---------------------------------------------------|
256 x 256 x 256 = 16777216 |
D3Q19 SRT (FP32/FP32) |
CPU 272 MB, GPU 1488 MB |
1216 MB |
10 |
1.00000000 |
3.50000000 |
Re < 148 |
---------.-------------------.---------------------|
| Steps/s | Current Step | Time Remaining |
81 | 9996 60% | 0s |
---------'-------------------'---------------------|
|

|----------------.------------------------------------------------------------|
| Device ID 0 | gfx1010:xnack- |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | gfx1010:xnack- |
| Device Vendor | Advanced Micro Devices, Inc. |
| Device Driver | 3444.0 (PAL,LC) |
| OpenCL Version | OpenCL C 2.0 |
| Compute Units | 20 at 1905 MHz (2560 cores, 9.754 TFLOPs/s) |
| Memory, Cache | 8176 MB, 16 KB global / 64 KB local |
| Buffer Limits | 6949 MB global, 7116390 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| LBM Type | D3Q19 SRT (FP32/FP16S) |
| Memory Usage | CPU 272 MB, GPU 880 MB |
| Max Alloc Size | 608 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 3253 | 250 GB/s | 194 | 9988 80% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 3253 |

|----------------.------------------------------------------------------------|
| Device ID 0 | gfx1010:xnack- |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | gfx1010:xnack- |
| Device Vendor | Advanced Micro Devices, Inc. |
| Device Driver | 3444.0 (PAL,LC) |
| OpenCL Version | OpenCL C 2.0 |
| Compute Units | 20 at 1905 MHz (2560 cores, 9.754 TFLOPs/s) |
| Memory, Cache | 8176 MB, 16 KB global / 64 KB local |
| Buffer Limits | 6949 MB global, 7116390 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| LBM Type | D3Q19 SRT (FP32/FP16C) |
| Memory Usage | CPU 272 MB, GPU 880 MB |
| Max Alloc Size | 608 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 3044 | 234 GB/s | 181 | 9992 20% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 3049 |

funlennysub · 2022-10-18T16:01:14Z

FP32/FP16C

FP32/FP16S

FP32/FP32

nicandris · 2022-10-18T16:15:34Z

RTX 2080 SUPER

gittigittibangbang · 2022-10-22T15:45:52Z

I tried a 6900XT, but the score is lower than anticipated. The max bandwidth seems to be limited to 300GB/s, although GPUZ says it's connected via PCIe 4.0 16x and should top out at 512GB/s. The GPU clock is at 2540MHz and the memory clock at 2000MHz. GPU and memory controller loads are at 100%.

With the 3D Taylor-Green model and FP32/FP16S, the MLUPs/s and the bandwidth go through the roof. I'll try some other models, too. FP32/FP32 goes up to 2400 MLUPs/s and 370GB/s, with FP32/FP16C it's 9000 MLUPs/s and 700GB/s.

ProjectPhysX · 2022-10-22T17:32:21Z

Hi gittigittibangbang, thanks for the benchmarks! Efficiency is ~60% which is typical for the AMD GPUs. Performance is limited by VRAM bandwidth only, and the RX 6800 would presumably perform exactly the same. The benchmark setup is a 256³ box, that fills 1.5GB (FP32) or 0.9GB (FP16) of VRAM. The large infinity cache (128MB) is only an insignificant fraction of that so does not significantly boost performance.
With a smaller 128³ box however, which only fills 186MB (FP32) or 76MB (FP16), almost the entire grid fits in the cache and effective bandwidth is much larger.

HAL9000COM · 2022-10-22T18:51:53Z

Vega 8 in R7 4750G
|----------------.------------------------------------------------------------|
| Device ID 0 | gfx90c |
| Device ID 1 | gfx90c |
| Device ID 2 | gfx90c |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | gfx90c |
| Device Vendor | Advanced Micro Devices, Inc. |
| Device Driver | 3380.6 (PAL,HSAIL) |
| OpenCL Version | OpenCL C 2.0 |
| Compute Units | 8 at 2100 MHz (512 cores, 2.150 TFLOPs/s) |
| Memory, Cache | 26899 MB, 16 KB global / 32 KB local |
| Buffer Limits | 19382 MB global, 19847731 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| LBM Type | D3Q19 SRT (FP32/FP32) |
| Memory Usage | CPU 272 MB, GPU 1488 MB |
| Max Alloc Size | 1216 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 246 | 38 GB/s | 15 | 9999 90% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 263 |

|----------------.------------------------------------------------------------|
| Device ID 0 | gfx90c |
| Device ID 1 | gfx90c |
| Device ID 2 | gfx90c |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | gfx90c |
| Device Vendor | Advanced Micro Devices, Inc. |
| Device Driver | 3380.6 (PAL,HSAIL) |
| OpenCL Version | OpenCL C 2.0 |
| Compute Units | 8 at 2100 MHz (512 cores, 2.150 TFLOPs/s) |
| Memory, Cache | 26899 MB, 16 KB global / 32 KB local |
| Buffer Limits | 19382 MB global, 19847731 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| LBM Type | D3Q19 SRT (FP32/FP16S) |
| Memory Usage | CPU 272 MB, GPU 880 MB |
| Max Alloc Size | 608 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 505 | 39 GB/s | 30 | 9998 80% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 511 |

|----------------.------------------------------------------------------------|
| Device ID 0 | gfx90c |
| Device ID 1 | gfx90c |
| Device ID 2 | gfx90c |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | gfx90c |
| Device Vendor | Advanced Micro Devices, Inc. |
| Device Driver | 3380.6 (PAL,HSAIL) |
| OpenCL Version | OpenCL C 2.0 |
| Compute Units | 8 at 2100 MHz (512 cores, 2.150 TFLOPs/s) |
| Memory, Cache | 26899 MB, 16 KB global / 32 KB local |
| Buffer Limits | 19382 MB global, 19847731 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| LBM Type | D3Q19 SRT (FP32/FP16C) |
| Memory Usage | CPU 272 MB, GPU 880 MB |
| Max Alloc Size | 608 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 466 | 36 GB/s | 28 | 9998 80% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 501 |

edmond1992 · 2022-10-23T05:00:06Z

Is it possible to add ready-to-run benchmark for MacOS so we can get more result on Mac?
Especially the test is bandwidth limited and Apple silicon should be good at this.
Not to mention relatively cheap 64GB+ VRAM as they share the same main memory.

edmond1992 · 2022-10-23T05:25:08Z

RTX3060 Laptop GPU with 12700H on ASUS ROG M16 Turbo mode (120W GPU TDP) and external laptop fan
PS C:\Software\FluidX3D> .\FluidX3D-Benchmark-FP32-FP32-Windows.exe
.-----------------------------------------------------------------------------.
| ______________ ______________ |
| \ ________ | | ________ / |
| \ \ | | | | / / |
| \ \ | | | | / / |
| \ \ | | | | / / |
| \ _.-" | | "-._/ / |
| \ .-" _ "-. / |
| .-" .-" "-. "-./ |
| .-" .-"-. "-. |
| \ v" "v / |
| \ \ / / |
| \ \ / / |
| \ \ / / |
| \ ' / |
| \ / |
| \ / |
| ' ╕ Moritz Lehmann |
|----------------.------------------------------------------------------------|
| Device ID 0 | NVIDIA GeForce RTX 3060 Laptop GPU |
| Device ID 1 | Intel(R) Iris(R) Xe Graphics |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | NVIDIA GeForce RTX 3060 Laptop GPU |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 512.78 |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 30 at 1425 MHz (3840 cores, 10.944 TFLOPs/s) |
| Memory, Cache | 6143 MB, 840 KB global / 48 KB local |
| Buffer Limits | 1535 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| LBM Type | D3Q19 SRT (FP32/FP32) |
| Memory Usage | CPU 272 MB, GPU 1488 MB |
| Max Alloc Size | 1216 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 2014 | 308 GB/s | 120 | 9999 90% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 2019 |

PS C:\Software\FluidX3D> .\FluidX3D-Benchmark-FP32-FP16C-Windows.exe
.-----------------------------------------------------------------------------.
| ______________ ______________ |
| \ ________ | | ________ / |
| \ \ | | | | / / |
| \ \ | | | | / / |
| \ \ | | | | / / |
| \ _.-" | | "-._/ / |
| \ .-" _ "-. / |
| .-" .-" "-. "-./ |
| .-" .-"-. "-. |
| \ v" "v / |
| \ \ / / |
| \ \ / / |
| \ \ / / |
| \ ' / |
| \ / |
| \ / |
| ' ╕ Moritz Lehmann |
|----------------.------------------------------------------------------------|
| Device ID 0 | NVIDIA GeForce RTX 3060 Laptop GPU |
| Device ID 1 | Intel(R) Iris(R) Xe Graphics |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | NVIDIA GeForce RTX 3060 Laptop GPU |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 512.78 |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 30 at 1425 MHz (3840 cores, 10.944 TFLOPs/s) |
| Memory, Cache | 6143 MB, 840 KB global / 48 KB local |
| Buffer Limits | 1535 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| LBM Type | D3Q19 SRT (FP32/FP16C) |
| Memory Usage | CPU 272 MB, GPU 880 MB |
| Max Alloc Size | 608 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 3523 | 271 GB/s | 210 | 9996 60% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 3572 |

PS C:\Software\FluidX3D> .\FluidX3D-Benchmark-FP32-FP16S-Windows.exe
.-----------------------------------------------------------------------------.
| ______________ ______________ |
| \ ________ | | ________ / |
| \ \ | | | | / / |
| \ \ | | | | / / |
| \ \ | | | | / / |
| \ _.-" | | "-._/ / |
| \ .-" _ "-. / |
| .-" .-" "-. "-./ |
| .-" .-"-. "-. |
| \ v" "v / |
| \ \ / / |
| \ \ / / |
| \ \ / / |
| \ ' / |
| \ / |
| \ / |
| ' ╕ Moritz Lehmann |
|----------------.------------------------------------------------------------|
| Device ID 0 | NVIDIA GeForce RTX 3060 Laptop GPU |
| Device ID 1 | Intel(R) Iris(R) Xe Graphics |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | NVIDIA GeForce RTX 3060 Laptop GPU |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 512.78 |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 30 at 1425 MHz (3840 cores, 10.944 TFLOPs/s) |
| Memory, Cache | 6143 MB, 840 KB global / 48 KB local |
| Buffer Limits | 1535 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| LBM Type | D3Q19 SRT (FP32/FP16S) |
| Memory Usage | CPU 272 MB, GPU 880 MB |
| Max Alloc Size | 608 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 3991 | 307 GB/s | 238 | 9989 90% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 4012 |

PS C:\Software\FluidX3D>

ProjectPhysX · 2022-10-23T08:01:59Z

@HAL9000COM thanks for the Vega 8 benchmarks! Quick question: Is your RAM is 2x16GB DDR4-3200MT/s? And do you have an idea why the GPU shows up 3 times?

ProjectPhysX · 2022-10-23T08:06:01Z

Is it possible to add ready-to-run benchmark for MacOS so we can get more result on Mac? Especially the test is bandwidth limited and Apple silicon should be good at this. Not to mention relatively cheap 64GB+ VRAM as they share the same main memory.

@edmond1992 unfortunately I don't have a Mac, so I can't compile add the executables for MacOS. But the code should work as-is; just compile it as-is with the third line in make.sh and you'll get the FP32 benchmark. Uncomment FP16S/FP16C in src/defines.hpp and recompile to get the other 2 benchmarks.

edmond1992 · 2022-10-23T08:09:22Z

Cross compile?

…

Sent from my iPhone On 23 Oct 2022, at 16:06, Moritz Lehmann ***@***.***> wrote: Is it possible to add ready-to-run benchmark for MacOS so we can get more result on Mac? Especially the test is bandwidth limited and Apple silicon should be good at this. Not to mention relatively cheap 64GB+ VRAM as they share the same main memory. @edmond1992<https://github.com/edmond1992> unfortunately I don't have a Mac, so I can't compile add the executables for MacOS. But the code should work as-is; just compile it as-is with the third line in make.sh and you'll get the FP32 benchmark. Uncomment FP16S/FP16C in src/defines.hpp and recompile to get the other 2 benchmarks. — Reply to this email directly, view it on GitHub<#8 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ALNCZ732JE6QCL7RKB37QG3WETWXJANCNFSM6AAAAAAQTGZVRY>. You are receiving this because you were mentioned.Message ID: ***@***.***> [https://www.polyu.edu.hk/emaildisclaimer/85A-PolyU_Email_Signature.jpg] Disclaimer: This message (including any attachments) contains confidential information intended for a specific individual and purpose. If you are not the intended recipient, you should delete this message and notify the sender and The Hong Kong Polytechnic University (the University) immediately. Any disclosure, copying, or distribution of this message, or the taking of any action based on it, is strictly prohibited and may be unlawful. The University specifically denies any responsibility for the accuracy or quality of information obtained through University E-mail Facilities. Any views and opinions expressed are only those of the author(s) and do not necessarily represent those of the University and the University accepts no liability whatsoever for any losses or damages incurred or caused to any party as a result of the use of such information.

HAL9000COM · 2022-10-23T11:18:50Z

@HAL9000COM thanks for the Vega 8 benchmarks! Quick question: Is your RAM is 2x16GB DDR4-3200MT/s? And do you have an idea why the GPU shows up 3 times?

2x32GB DDR4-3200 OC to 3533. No idea why GPU shows up multiple times. After some reboot, it now shows up as two devices.

skoz90 · 2022-10-24T17:03:14Z

Nvidia Quadro RTX 5000

SLGY · 2022-10-25T02:42:45Z

GTX 1050 on an old gaming laptop. It's amazing I figured out how to even run this and get a benchmark. Now I'm going to try and figure out how to run the simulation on an stl (or similar) file. I know how to use Blender quite well, but this is my first time with visial studio or command line stuff. I'm so out of my depth here 😟

SLGY · 2022-10-25T03:41:07Z

@ProjectPhysX have now added the FP16 benchmarks

RTX 3080 Ti

Updated FP32 (was concurrently baking a fluid in Blender when I ran the last one):

FP16S:

FP16C:

ProjectPhysX · 2022-10-25T06:30:50Z

Hi @SirWixy, thank you so much for the benchmarks! Can you post the FP16S and FP16C results too?

gittigittibangbang · 2022-10-25T07:38:14Z

Quadro RTX 4000 below. I also tried two Xeon Gold 5218 (2x16 cores), with the FP32/FP32 benchmark they top out at 126MLUPs/s, 20GB/s and 8 steps/s. I did not have the patience to run it to the end. The speedup with GPUs is really dramatic, damn.

ProjectPhysX · 2022-10-25T09:30:17Z

@gittigittibangbang thanks for the benchmarks! For the CPU you can just stop it with Ctrl+C after it has leveled at constant performance, and take the last MLUPs/s reading. Can you post the program header with the Xeon Gold for the specs, and performance values for FP16S and FP16C too for the Xeon? Thanks!

gittigittibangbang · 2022-10-25T10:39:55Z

|----------------.------------------------------------------------------------|
| Device ID 0 | Quadro RTX 4000 |
| Device ID 1 | Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 1 |
| Device Name | Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz |
| Device Vendor | Intel(R) Corporation |
| Device Driver | 6.4.0.37 |
| OpenCL Version | OpenCL C 2.0 |
| Compute Units | 32 at 2300 MHz (16 cores, 1.178 TFLOPs/s) |
| Memory, Cache | 261766 MB, 256 KB global / 32 KB local |
| Buffer Limits | 65441 MB global, 128 KB constant

FP32/FP32: 132MLUPs/s, 20GB/s bandwidth, 8 steps/s
FP32/FP16C: 270MLUPs/s, 21GB/s bandwidth, 16 steps/s
FP32/FP16S: 135MLUPs/s, 10GB/s bandwidth, 8 steps/s

ProjectPhysX · 2023-10-27T11:14:40Z

Hi @biergaizi,

amazing that it worked! Not the first instance where PoCL beat Nvidia's own runtime. 🖖😛
You can try inserting "\n #pragma OPENCL FP_CONTRACT OFF" here and see if this fixes the bad performance on the Nvidia compiler.

Kind regards,
Moritz

biergaizi · 2023-10-27T11:36:14Z

You can try inserting "\n #pragma OPENCL FP_CONTRACT OFF" here and see if this fixes the bad performance on the Nvidia compiler.

This worked perfectly. It even fixed the performance problem of FP32/FP16S on the Nvidia compiler (PoCL has low performance probably because of a code generation problem). Now the performance in both cases are close to the Nvidia A100! The only exception is FP32/FP16C - the custom floating-point format probably either increased the arithmetic intensity beyond the FP32 non-FMA limit, or hit other restrictions.

The CMP 170HX suddenly has its killer app now.

FP32/FP32

|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA Graphics Device                                     |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 535.104.05                                                 |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 70 at 1410 MHz (8960 cores, 25.267 TFLOPs/s)               |
| Memory, Cache  | 7961 MB, 1960 KB global / 48 KB local                      |
| Buffer Limits  | 1990 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                     D3Q19 SRT (FP32/FP32) |
| Memory Usage    |                                CPU 272 MB, GPU 1x 1488 MB |
| Max Alloc Size  |                                                   1216 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    7583 |   1160 GB/s |       452 |         9994  40% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 7585                                                   |

FP32/FP16S

|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|   12386 |    954 GB/s |       738 |         9997  70% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 12392                                                  |

FP32/FP16C

|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    6853 |    528 GB/s |       408 |         9985  50% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 6859                                                   |

Patch

Even better is the fact that the minimum changes needed for this workaround is just a two-line patch:

diff --git a/src/lbm.cpp b/src/lbm.cpp
index d99202f..28aeb25 100644
--- a/src/lbm.cpp
+++ b/src/lbm.cpp
@@ -286,6 +286,8 @@ void LBM_Domain::enqueue_unvoxelize_mesh_on_device(const Mesh* mesh, const uchar
 }
 
 string LBM_Domain::device_defines() const { return
+       "\n     #pragma OPENCL FP_CONTRACT OFF"  // prevents implicit FMA optimizations
+       "\n     #define fma(a, b, c) ((a) * (b) + (c))"  // shadows OpenCL explicit function fma()
        "\n     #define def_Nx "+to_string(Nx)+"u"
        "\n     #define def_Ny "+to_string(Ny)+"u"
        "\n     #define def_Nz "+to_string(Nz)+"u"

SphaeroX · 2023-11-07T19:15:41Z

D3Q19 SRT (FP32/FP32)

`
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | NVIDIA GeForce RTX 4080 Laptop GPU |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 537.42 |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 58 at 2280 MHz (7424 cores, 33.853 TFLOPs/s) |
| Memory, Cache | 12281 MB, 1624 KB global / 48 KB local |
| Buffer Limits | 3070 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| Info: Allocating memory. This may take a few seconds. |
|-----------------.-----------------------------------------------------------|
| Grid Resolution | 256 x 256 x 256 = 16777216 |
| Grid Domains | 1 x 1 x 1 = 1 |
| LBM Type | D3Q19 SRT (FP32/FP32) |
| Memory Usage | CPU 272 MB, GPU 1x 1488 MB |
| Max Alloc Size | 1216 MB |
| Time Steps | 10 |
| Kin. Viscosity | 1.00000000 |
| Relaxation Time | 3.50000000 |
| Reynolds Number | Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining |
| 2544 | 389 GB/s | 152 | 9992 20% | 0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 2577 |

`

D3Q19 SRT (FP32/FP16S)

| Info: Peak MLUPs/s = 5086

D3Q19 SRT (FP32/FP16C)

| Info: Peak MLUPs/s = 5114

fiftyfathoms · 2023-11-27T12:51:30Z

Haven't seen results for Nvidia A30.

OpenCL Benchmark

.-----------------------------------------------------------------------------.
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA A30                                                 |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA A30                                                 |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 535.129.03                                                 |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 56 at 1440 MHz (3584 cores, 10.322 TFLOPs/s)               |
| Memory, Cache  | 24062 MB, 1568 KB global / 48 KB local                     |
| Buffer Limits  | 6015 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                         5.053 TFLOPs/s (1/2 ) |
| FP32  compute                                        10.215 TFLOPs/s ( 1x ) |
| FP16  compute                                          not supported        |
| INT64 compute                                         1.990  TIOPs/s (1/4 ) |
| INT32 compute                                        10.285  TIOPs/s ( 1x ) |
| INT16 compute                                         8.158  TIOPs/s (2/3 ) |
| INT8  compute                                         8.316  TIOPs/s (2/3 ) |
| Memory Bandwidth ( coalesced read      )                        806.94 GB/s |
| Memory Bandwidth ( coalesced      write)                        900.40 GB/s |
| Memory Bandwidth (misaligned read      )                        651.78 GB/s |
| Memory Bandwidth (misaligned      write)                         80.94 GB/s |
| PCIe   Bandwidth (send                 )                         19.16 GB/s |
| PCIe   Bandwidth (   receive           )                         13.22 GB/s |
| PCIe   Bandwidth (        bidirectional)            (Gen4 x16)   12.30 GB/s |
|-----------------------------------------------------------------------------|
|-----------------------------------------------------------------------------|
| Done. Press Enter to exit.                                                  |
'-----------------------------------------------------------------------------'

FP32/FP16C

.-----------------------------------------------------------------------------.
|                       ______________   ______________                       |
|                       \   ________  | |  ________   /                       |
|                        \  \       | | | |       /  /                        |
|                         \  \      | | | |      /  /                         |
|                          \  \     | | | |     /  /                          |
|                           \  \_.-"  | |  "-._/  /                           |
|                            \    _.-" _ "-._    /                            |
|                             \.-" _.-" "-._ "-./                             |
|                               .-"  .-"-.  "-.                               |
|                               \  v"     "v  /                               |
|                                \  \     /  /                                |
|                                 \  \   /  /                                 |
|                                  \  \ /  /                                  |
|                                   \  '  /                                   |
|                                    \   /                                    |
|                                     \ /               FluidX3D Version 2.10 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA A30                                                 |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA A30                                                 |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 535.129.03                                                 |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 56 at 1440 MHz (3584 cores, 10.322 TFLOPs/s)               |
| Memory, Cache  | 24062 MB, 1568 KB global / 48 KB local                     |
| Buffer Limits  | 6015 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16C) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    5712 |    440 GB/s |       340 |         9994  40% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 5726

FP32/FP16S

.-----------------------------------------------------------------------------.
|                       ______________   ______________                       |
|                       \   ________  | |  ________   /                       |
|                        \  \       | | | |       /  /                        |
|                         \  \      | | | |      /  /                         |
|                          \  \     | | | |     /  /                          |
|                           \  \_.-"  | |  "-._/  /                           |
|                            \    _.-" _ "-._    /                            |
|                             \.-" _.-" "-._ "-./                             |
|                               .-"  .-"-.  "-.                               |
|                               \  v"     "v  /                               |
|                                \  \     /  /                                |
|                                 \  \   /  /                                 |
|                                  \  \ /  /                                  |
|                                   \  '  /                                   |
|                                    \   /                                    |
|                                     \ /               FluidX3D Version 2.10 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA A30                                                 |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA A30                                                 |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 535.129.03                                                 |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 56 at 1440 MHz (3584 cores, 10.322 TFLOPs/s)               |
| Memory, Cache  | 24062 MB, 1568 KB global / 48 KB local                     |
| Buffer Limits  | 6015 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16S) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    9718 |    748 GB/s |       579 |         9993  30% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 9721                                                   |

FP32/FP32

.-----------------------------------------------------------------------------.
|                       ______________   ______________                       |
|                       \   ________  | |  ________   /                       |
|                        \  \       | | | |       /  /                        |
|                         \  \      | | | |      /  /                         |
|                          \  \     | | | |     /  /                          |
|                           \  \_.-"  | |  "-._/  /                           |
|                            \    _.-" _ "-._    /                            |
|                             \.-" _.-" "-._ "-./                             |
|                               .-"  .-"-.  "-.                               |
|                               \  v"     "v  /                               |
|                                \  \     /  /                                |
|                                 \  \   /  /                                 |
|                                  \  \ /  /                                  |
|                                   \  '  /                                   |
|                                    \   /                                    |
|                                     \ /               FluidX3D Version 2.10 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA A30                                                 |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA A30                                                 |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 535.129.03                                                 |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 56 at 1440 MHz (3584 cores, 10.322 TFLOPs/s)               |
| Memory, Cache  | 24062 MB, 1568 KB global / 48 KB local                     |
| Buffer Limits  | 6015 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                     D3Q19 SRT (FP32/FP32) |
| Memory Usage    |                                CPU 272 MB, GPU 1x 1488 MB |
| Max Alloc Size  |                                                   1216 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    5002 |    765 GB/s |       298 |         9997  70% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 5004                                                   |

Willian-Zhang · 2023-11-28T17:47:59Z

Apple M1 Ultra 128G
FP32

.-----------------------------------------------------------------------------.
|                       ______________   ______________                       |
|                       \   ________  | |  ________   /                       |
|                        \  \       | | | |       /  /                        |
|                         \  \      | | | |      /  /                         |
|                          \  \     | | | |     /  /                          |
|                           \  \_.-"  | |  "-._/  /                           |
|                            \    _.-" _ "-._    /                            |
|                             \.-" _.-" "-._ "-./                             |
|                               .-"  .-"-.  "-.                               |
|                               \  v"     "v  /                               |
|                                \  \     /  /                                |
|                                 \  \   /  /                                 |
|                                  \  \ /  /                                  |
|                                   \  '  /                                   |
|                                    \   /                                    |
|                                     \ /               FluidX3D Version 2.10 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | Apple M1 Ultra                                             |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | Apple M1 Ultra                                             |
| Device Vendor  | Apple                                                      |
| Device Driver  | 1.2 1.0                                                    |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 64 at 1000 MHz (8192 cores, 16.384 TFLOPs/s)               |
| Memory, Cache  | 98304 MB, 0 KB global / 32 KB local                        |
| Buffer Limits  | 18432 MB global, 1048576 KB constant                       |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                     D3Q19 SRT (FP32/FP32) |
| Memory Usage    |                                CPU 272 MB, GPU 1x 1488 MB |
| Max Alloc Size  |                                                   1216 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    4448 |    681 GB/s |       265 |         9987  70% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 4519                                                   |

FP16S


|----------------.------------------------------------------------------------|
| Device ID    0 | Apple M1 Ultra                                             |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | Apple M1 Ultra                                             |
| Device Vendor  | Apple                                                      |
| Device Driver  | 1.2 1.0                                                    |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 64 at 1000 MHz (8192 cores, 16.384 TFLOPs/s)               |
| Memory, Cache  | 98304 MB, 0 KB global / 32 KB local                        |
| Buffer Limits  | 18432 MB global, 1048576 KB constant                       |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16S) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    8286 |    638 GB/s |       494 |         9995  50% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 8418                                                   |

FP16C

|----------------.------------------------------------------------------------|
| Device ID    0 | Apple M1 Ultra                                             |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | Apple M1 Ultra                                             |
| Device Vendor  | Apple                                                      |
| Device Driver  | 1.2 1.0                                                    |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 64 at 1000 MHz (8192 cores, 16.384 TFLOPs/s)               |
| Memory, Cache  | 98304 MB, 0 KB global / 32 KB local                        |
| Buffer Limits  | 18432 MB global, 1048576 KB constant                       |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16C) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    6794 |    523 GB/s |       405 |         9979  90% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 6915                                                   |

marcc1229 · 2023-12-07T22:16:31Z

How is everyone doing the benchmarks for multi gpu configurations? I'm playing around with mi25's and not seeing anywhere near what the specs would suggest I should. I'm wondering if I have a hardware bottleneck or if I missed something in the setup.

ProjectPhysX · 2023-12-08T06:10:27Z

@marcc1229 use the "2/4/8 GPUs" lines in the benchmark setup, and for memory use a value close to the VRAM capacity of one GPU, like 15800u. For fine-tuning you can also set the resolution directly, for example const uint3 lbm_N = uint3(464u);

The multi-GPU communication has some performance overhead, which shrinks relative to domain compute time the larger the resolution is. The highest possible resolution is the best performing and also the most interesting case for multi-GPU, as at lower resolution a single GPU would be sufficient. But performance at similarly large resolutions should not be too different.

For the single-GPU benchmark the resolution should not matter at all as long as it's sufficiently large for full hardware saturation.

However, the older GCN/Vega GPUs can have vastly different performance for slightly different grid resolution / workgroup count, the cursed memory bandwidth anomaly which is a problem of the hardware architecture. Try some different large resolutions.

Potential bottleneck could be PCIe communication. If you have a server where each GPU is connected by PCIe 3.0 x16 or x8, this should not be a issue. But for example cheap crypto mining hardware with these USB 3 / PCIe 3.0 x1 connections is problematic.

marcc1229 · 2023-12-08T16:24:14Z

This is what I'm getting with 2 mi25's flashed with wx9100 bios running at pcie3.0-16x. I couldn't let them run all the way through because I don't have proper cooling set up yet. I just wanted to test these before committing to buying more and designing a proper cooling setup. I'm a mechanic by trade and I'm trying to use this to help designing an aero/cooling setup for a long running car project so my apologies if I end up asking incredibly stupid questions, I'm learning as I go.

Alex-Vasile · 2023-12-12T23:59:27Z

The small but apparently decently mighty original M1 (2020 MBP).

.-----------------------------------------------------------------------------.
|                       ______________   ______________                       |
|                       \   ________  | |  ________   /                       |
|                        \  \       | | | |       /  /                        |
|                         \  \      | | | |      /  /                         |
|                          \  \     | | | |     /  /                          |
|                           \  \_.-"  | |  "-._/  /                           |
|                            \    _.-" _ "-._    /                            |
|                             \.-" _.-" "-._ "-./                             |
|                               .-"  .-"-.  "-.                               |
|                               \  v"     "v  /                               |
|                                \  \     /  /                                |
|                                 \  \   /  /                                 |
|                                  \  \ /  /                                  |
|                                   \  '  /                                   |
|                                    \   /                                    |
|                                     \ /               FluidX3D Version 2.11 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | Apple M1                                                   |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | Apple M1                                                   |
| Device Vendor  | Apple                                                      |
| Device Driver  | 1.2 1.0 (macOS)                                            |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 8 at 1000 MHz (1024 cores, 2.048 TFLOPs/s)                 |
| Memory, Cache  | 10922 MB, 0 KB global / 32 KB local                        |
| Buffer Limits  | 2048 MB global, 1048576 KB constant                        |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                     D3Q19 SRT (FP32/FP32) |
| Memory Usage    |                                CPU 272 MB, GPU 1x 1488 MB |
| Max Alloc Size  |                                                   1216 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|     377 |     58 GB/s |        22 |         9998  80% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 384                                                    |

.-----------------------------------------------------------------------------.
|                       ______________   ______________                       |
|                       \   ________  | |  ________   /                       |
|                        \  \       | | | |       /  /                        |
|                         \  \      | | | |      /  /                         |
|                          \  \     | | | |     /  /                          |
|                           \  \_.-"  | |  "-._/  /                           |
|                            \    _.-" _ "-._    /                            |
|                             \.-" _.-" "-._ "-./                             |
|                               .-"  .-"-.  "-.                               |
|                               \  v"     "v  /                               |
|                                \  \     /  /                                |
|                                 \  \   /  /                                 |
|                                  \  \ /  /                                  |
|                                   \  '  /                                   |
|                                    \   /                                    |
|                                     \ /               FluidX3D Version 2.11 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | Apple M1                                                   |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | Apple M1                                                   |
| Device Vendor  | Apple                                                      |
| Device Driver  | 1.2 1.0 (macOS)                                            |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 8 at 1000 MHz (1024 cores, 2.048 TFLOPs/s)                 |
| Memory, Cache  | 10922 MB, 0 KB global / 32 KB local                        |
| Buffer Limits  | 2048 MB global, 1048576 KB constant                        |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16S) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|     752 |     58 GB/s |        45 |         9999  90% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 758                                                    |

.-----------------------------------------------------------------------------.
|                       ______________   ______________                       |
|                       \   ________  | |  ________   /                       |
|                        \  \       | | | |       /  /                        |
|                         \  \      | | | |      /  /                         |
|                          \  \     | | | |     /  /                          |
|                           \  \_.-"  | |  "-._/  /                           |
|                            \    _.-" _ "-._    /                            |
|                             \.-" _.-" "-._ "-./                             |
|                               .-"  .-"-.  "-.                               |
|                               \  v"     "v  /                               |
|                                \  \     /  /                                |
|                                 \  \   /  /                                 |
|                                  \  \ /  /                                  |
|                                   \  '  /                                   |
|                                    \   /                                    |
|                                     \ /               FluidX3D Version 2.11 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | Apple M1                                                   |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | Apple M1                                                   |
| Device Vendor  | Apple                                                      |
| Device Driver  | 1.2 1.0 (macOS)                                            |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 8 at 1000 MHz (1024 cores, 2.048 TFLOPs/s)                 |
| Memory, Cache  | 10922 MB, 0 KB global / 32 KB local                        |
| Buffer Limits  | 2048 MB global, 1048576 KB constant                        |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16C) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|     755 |     58 GB/s |        45 |         9998  80% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 759                                                    |

jbruck · 2023-12-20T10:21:30Z

Windows 11
NVIDIA GeForce MX450
MLUPs/s 185

dboswell-marigoldsystems · 2023-12-22T19:35:47Z

Howdy! Benchmark Results below for the new Nvidia L40S being tested in the Marigold Systems Lab, requested from the /r/Nvidia Subreddit.

FP32-16C

FP32-16S

FP32-FP32

FluidX3D Benchmark.docx

Nvidia L40s
Dell PowerEdge R760
Ubuntu Server 22.04.3 LTS
Nvidia 535.129 Driver

marigoldsystems.com

ProjectPhysX · 2023-12-22T21:19:07Z

@dboswell-marigoldsystems thank you!!

Jake1402 · 2024-01-07T15:23:00Z

RTX3050

lslowmotion · 2024-01-18T12:51:26Z

Wait. Is this right for 3090 FP32/FP16S? I got over 658k MLUPs/s just by changing uint memory to 24000u.

Also, for 2 3090s I got 167k MLUPs/s

Is it required to let the memory size to stay at 1488u? Because the 1488u one looks normal to me compared to those on the benchmark sheet.

Also, here are the results using FP32/FP32 on 1488u memory

ProjectPhysX · 2024-01-18T20:26:40Z

@lslowmotion for single-GPU, performance is mostly independent of grid size / memory occupation, use the default 256³ / 1488u MB here.
For multi-GPU benchmarking, larger grid size is a bit faster, because domain communication relative to domain compute time becomes smaller. Since the OS itself needs a few hundred MB of VRAM, 24000 MB, memory allocation will fail (without error message unfortunately), kernels don't actually execute and you get unphysically high scores. Use a bit less than max VRAM capacity, lke, 23500u. Thanks!

lslowmotion · 2024-01-19T03:44:42Z

@ProjectPhysX yea with 23000u now it looks more in line with how it should be. Thanks.

Also to complete the ones above, here are single and dual 3090s in FP32/FP16C to add to the benchmark table. Hope these help!

ProjectPhysX · 2024-01-24T20:26:26Z

Hi @lslowmotion,

today I realized that with an optimization in update v2.11, I accidentally stepped on a bug in Nvidia's OpenCL driver, which caused failure of memory allocation for larger simulations, including your benchmark runs at larger resolution. This is now fixed in the master branch! Large resolutions up to 2x ~23000 MB are now working again also with the FP16 types.
Apologies for the trouble!

Kind regards,
Moritz

marcc1229 · 2024-02-02T01:32:30Z

These are mi25's flashed with wx9100 bios mounted directly to the board.

gryoung4727 · 2024-02-12T05:22:09Z

Results for the ASUS 4070 Ti Super 16GB card, non overclocked.

mckirkus · 2024-02-16T00:41:21Z

RTX 3080 12GB edition - FP16S

RTX 3080 12GB edition - FP16C

RTX 3080 12GB edition - FP32

SLGY · 2024-03-08T03:26:53Z

Here's a multi GPU (technically) result for a Tesla K80 (2 core) GPU. There's a single core K80 (12GB) result in the benchmarks, but now that we have multi GPU functionality here's the 2 core K80 (24GB) result!

chconnor · 2024-03-11T20:00:46Z

|                                     \ /               FluidX3D Version 2.14 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA GeForce GTX 1060 6GB                                |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA GeForce GTX 1060 6GB                                |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 535.161.07 (Linux)                                         |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 10 at 1784 MHz (1280 cores, 4.567 TFLOPs/s)                |
| Memory, Cache  | 6064 MB, 480 KB global / 48 KB local                       |
| Buffer Limits  | 1516 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                     D3Q19 SRT (FP32/FP32) |
| Memory Usage    |                                CPU 272 MB, GPU 1x 1488 MB |
| Max Alloc Size  |                                                   1216 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|     995 |    152 GB/s |        59 |         9997  70% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 997                                                    |


|                                     \ /               FluidX3D Version 2.14 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA GeForce GTX 1060 6GB                                |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA GeForce GTX 1060 6GB                                |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 535.161.07 (Linux)                                         |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 10 at 1784 MHz (1280 cores, 4.567 TFLOPs/s)                |
| Memory, Cache  | 6064 MB, 480 KB global / 48 KB local                       |
| Buffer Limits  | 1516 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16S) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    1924 |    148 GB/s |       115 |         9994  40% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 1925                                                   |

|                                     \ /               FluidX3D Version 2.14 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA GeForce GTX 1060 6GB                                |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA GeForce GTX 1060 6GB                                |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 535.161.07 (Linux)                                         |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 10 at 1784 MHz (1280 cores, 4.567 TFLOPs/s)                |
| Memory, Cache  | 6064 MB, 480 KB global / 48 KB local                       |
| Buffer Limits  | 1516 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16C) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    1772 |    136 GB/s |       106 |         9994  40% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 1785                                                   |

matteocavestri · 2024-04-29T17:48:19Z

Results on AMD Radeon RX590 8GB (Running on Clover-Mesa OpenCL 1.2)

FP32

FP16C

FP16S

matteocavestri · 2024-04-29T19:36:31Z

Results on AMD Radeon RX590 8GB (Running on Rusticl-Mesa OpenCL 1.2)

FP32

FP16C

FP16S

So if you want to use an OpenSource OpenCL implementation (Clover or Rusticl) use Clover until Rusticl become better.

Clover by default is OpenCL 1.1 conformant, but you can export:

CLOVER_DEVICE_VERSION_OVERRIDE=1.2
CLOVER_DEVICE_CLC_VERSION_OVERRIDE=1.2

to use OpenCL 1.2

gitcnd · 2024-05-20T12:03:16Z

RoG Strix Laptop:

|                                     \ /               FluidX3D Version 2.16 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA GeForce RTX 3080 Ti Laptop GPU                      |
| Device ID    1 | Intel(R) UHD Graphics 770                                  |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA GeForce RTX 3080 Ti Laptop GPU                      |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 516.40 (Windows)                                           |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 58 at 1590 MHz (7424 cores, 23.608 TFLOPs/s)               |
| Memory, Cache  | 16383 MB, 1624 KB global / 48 KB local                     |
| Buffer Limits  | 4095 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                     D3Q19 SRT (FP32/FP32) |
| Memory Usage    |                                CPU 272 MB, GPU 1x 1488 MB |
| Max Alloc Size  |                                                   1216 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    2972 |    455 GB/s |       177 |         9992  20% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 2985                                                   |

Interesting how my Laptop 3080 Ti beats the other Laptops RTX 4080 !

ProjectPhysX · 2024-05-20T14:45:33Z

Hi @gitcnd, thanks a lot! Can you please add the FP16S and FP16C benchmarks too?
Almost all RTX 40 series GPUs have severely reduced memory bus width and memory bandwidth as compared to their RTX 30 predecessors, making them slower in compute applications.

gitcnd · 2024-05-20T15:54:18Z

Sorry about that - here they are:

|                                     \ /               FluidX3D Version 2.16 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA GeForce RTX 3080 Ti Laptop GPU                      |
| Device ID    1 | Intel(R) UHD Graphics 770                                  |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA GeForce RTX 3080 Ti Laptop GPU                      |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 516.40 (Windows)                                           |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 58 at 1590 MHz (7424 cores, 23.608 TFLOPs/s)               |
| Memory, Cache  | 16383 MB, 1624 KB global / 48 KB local                     |
| Buffer Limits  | 4095 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16S) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    5832 |    449 GB/s |       348 |         9993  30% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 5908                                                   |


|                                     \ /               FluidX3D Version 2.16 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA GeForce RTX 3080 Ti Laptop GPU                      |
| Device ID    1 | Intel(R) UHD Graphics 770                                  |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA GeForce RTX 3080 Ti Laptop GPU                      |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 516.40 (Windows)                                           |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 58 at 1590 MHz (7424 cores, 23.608 TFLOPs/s)               |
| Memory, Cache  | 16383 MB, 1624 KB global / 48 KB local                     |
| Buffer Limits  | 4095 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16C) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    5759 |    443 GB/s |       343 |         9983  30% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 5780                                                   |

gitcnd · 2024-05-20T16:09:58Z

And just for giggles... (the slowest benchmark here so far :-)

|                                     \ /               FluidX3D Version 2.16 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA GeForce RTX 3080 Ti Laptop GPU                      |
| Device ID    1 | Intel(R) UHD Graphics 770                                  |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 1                                                          |
| Device Name    | Intel(R) UHD Graphics 770                                  |
| Device Vendor  | Intel(R) Corporation                                       |
| Device Driver  | 31.0.101.3962 (Windows)                                    |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 32 at 1550 MHz (256 cores, 0.794 TFLOPs/s)                 |
| Memory, Cache  | 12955 MB, 1920 KB global / 64 KB local                     |
| Buffer Limits  | 4095 MB global, 4194296 KB constant                        |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16C) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|     243 |     19 GB/s |        14 |         9999  90% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 246                                                    |

C:\Users\cnd\Downloads\FluidX3D>bin\FluidX3D.exe -h
Lattice Boltzmann CFD software by Dr. Moritz Lehmann
Usage:
  bin\FluidX3D.exe [OPTION...]

  -h, --help            Print help
  -x arg                X proportion factor (default: 1.0)
  -y arg                Y proportion factor (default: 1.0)
  -z arg                Z proportion factor (default: 1.0)
  -r, --resolution arg  Resolution (default: 4096)
      --re arg          Reynolds number (default: 100000.0)
  -u arg                Velocity (default: 0.1)
  -t, --time arg        Time (default: 10000)
      --scale arg       Scale (default: 0.9)
  -f, --file arg        Filename (default: input.stl)
  -a, --aoa arg         Angle of attack (default: -5.0)
      --camx arg        Camera X (default: 19.0)
      --camy arg        Camera Y (default: 19.1)
      --camz arg        Camera Z (default: 19.2)
      --camzoom arg     Camera Zoom (default: 1.0)
      --camrx arg       Camera Rotation X (default: 33.0)
      --camry arg       Camera Rotation Y (default: 42.0)
      --camfov arg      Camera Field of View (default: 68.0)
  -s, --secs arg        Seconds (default: 10.0)
  -w, --window          Enable window instead of fullscreen mode
      --wait            Wait for keypress befor ending
      --pause           Do not auto-start the simulation
  -d, --display arg     Display (default: 0,1)

biergaizi · 2024-05-20T16:30:11Z

@gitcnd Are both DIMM slots on the laptop populated for the Intel iGPU benchmark? If not, the results would be even slower... 😄

gitcnd · 2024-05-22T01:58:19Z

Yes - everything is populated and replaced for max performance (including special low-latency RAM: I replaced the originals).

This was the fastest laptop in the world when I finished upgrading it :-)

ProjectPhysX added the help wanted extra attention is needed label Sep 22, 2022

ProjectPhysX pinned this issue Oct 20, 2022

Report your benchmark results here! #8

Report your benchmark results here! #8

Comments

ProjectPhysX commented Sep 22, 2022 • edited

ibonito1 commented Sep 23, 2022 • edited

ProjectPhysX commented Sep 24, 2022 • edited

C-Dub2022 commented Oct 3, 2022

ProjectPhysX commented Oct 4, 2022

C-Dub2022 commented Oct 4, 2022

MarcoAurelioFerrari commented Oct 7, 2022 • edited

ProjectPhysX commented Oct 8, 2022

dongwang22 commented Oct 14, 2022

ProjectPhysX commented Oct 14, 2022

fkay1 commented Oct 17, 2022

funlennysub commented Oct 18, 2022

nicandris commented Oct 18, 2022 • edited

gittigittibangbang commented Oct 22, 2022 • edited

ProjectPhysX commented Oct 22, 2022

HAL9000COM commented Oct 22, 2022

edmond1992 commented Oct 23, 2022

edmond1992 commented Oct 23, 2022

ProjectPhysX commented Oct 23, 2022 • edited

ProjectPhysX commented Oct 23, 2022

edmond1992 commented Oct 23, 2022 via email

HAL9000COM commented Oct 23, 2022

skoz90 commented Oct 24, 2022

SLGY commented Oct 25, 2022

SLGY commented Oct 25, 2022 • edited

RTX 3080 Ti

ProjectPhysX commented Oct 25, 2022

gittigittibangbang commented Oct 25, 2022 • edited

ProjectPhysX commented Oct 25, 2022

gittigittibangbang commented Oct 25, 2022

ProjectPhysX commented Oct 27, 2023

biergaizi commented Oct 27, 2023 • edited

FP32/FP32

FP32/FP16S

FP32/FP16C

Patch

SphaeroX commented Nov 7, 2023 • edited

D3Q19 SRT (FP32/FP32)

D3Q19 SRT (FP32/FP16S)

D3Q19 SRT (FP32/FP16C)

fiftyfathoms commented Nov 27, 2023 • edited

OpenCL Benchmark

FP32/FP16C

FP32/FP16S

FP32/FP32

Willian-Zhang commented Nov 28, 2023 • edited

marcc1229 commented Dec 7, 2023

ProjectPhysX commented Dec 8, 2023

marcc1229 commented Dec 8, 2023

Alex-Vasile commented Dec 12, 2023

jbruck commented Dec 20, 2023

dboswell-marigoldsystems commented Dec 22, 2023

ProjectPhysX commented Dec 22, 2023

Jake1402 commented Jan 7, 2024

lslowmotion commented Jan 18, 2024 • edited

ProjectPhysX commented Jan 18, 2024

lslowmotion commented Jan 19, 2024

ProjectPhysX commented Jan 24, 2024

marcc1229 commented Feb 2, 2024

gryoung4727 commented Feb 12, 2024

mckirkus commented Feb 16, 2024

SLGY commented Mar 8, 2024

chconnor commented Mar 11, 2024

matteocavestri commented Apr 29, 2024

matteocavestri commented Apr 29, 2024

gitcnd commented May 20, 2024

ProjectPhysX commented May 20, 2024

gitcnd commented May 20, 2024

gitcnd commented May 20, 2024

biergaizi commented May 20, 2024

gitcnd commented May 22, 2024

ProjectPhysX commented Sep 22, 2022 •

edited

ibonito1 commented Sep 23, 2022 •

edited

ProjectPhysX commented Sep 24, 2022 •

edited

MarcoAurelioFerrari commented Oct 7, 2022 •

edited

nicandris commented Oct 18, 2022 •

edited

gittigittibangbang commented Oct 22, 2022 •

edited

ProjectPhysX commented Oct 23, 2022 •

edited

SLGY commented Oct 25, 2022 •

edited

gittigittibangbang commented Oct 25, 2022 •

edited

biergaizi commented Oct 27, 2023 •

edited

SphaeroX commented Nov 7, 2023 •

edited

fiftyfathoms commented Nov 27, 2023 •

edited

Willian-Zhang commented Nov 28, 2023 •

edited

lslowmotion commented Jan 18, 2024 •

edited