Broader vendor support for hardware acceleration #400

ttraenkler · 2024-05-10T16:04:22Z

I understand the aim of the llm.c project is to be minimal and educational, yet it's minimal approach makes it an interesting portable and performant target for a variety of platforms. In order to leverage GPGPU hardware acceleration llm.c is supporting the proprietary CUDA C API that is limited to Nvidia hardware only. Adding generic hardware acceleration for more operating systems and GPU vendors like Apple Silicon (think M4), and AMD would be desirable.

Targeting each of these independently using their native APIs would be tedious, so one could consider the abstract wgpu GPGPU C API that translates to native DirectX 12, Metal, and Vulkan APIs calls and runs natively on common operating systems and GPUs. In addition it can also run in browsers that support WebGPU, WebGL or WebAssembly. In fact wgpu's modern API inspired by WebGPU, which is already supported in Chrome and soon in Safari and Firefox, but wgpu extends this to native.

I am also interested in alternatives if this turns out to be out of scope given it's limited resources, but maybe others are also interested in contributing to broader hardware support if it is welcome by the maintainers. Is there interest to provide support for hardware acceleration on a broader hardware base in this or a later phase of this project? I would also be interested in feedback by others if the wgpu abstraction would be a suitable abstraction and how this option contrasts with SYCL #31 that others have pointed out in terms of maturity, performance, and support.

anthonix · 2024-05-10T18:12:00Z

Playing devils advocate, I don't think we need an abstraction layer to target AMD devices, as I was able to build llm.c for 7900 XTX and Radeon VII with minimal changes [1], and it is >30% faster than PyTorch nightly out of the box with no AMD specific optimization. That is what is so great about this repo! :)

Regarding going further with AMD specific optimizations, an extra abstraction layer absolutely will not help with that, and I had some fun down the rabbit hole writing some large chunks of RDNA3 assembly (unfortunately couldn't get the compiler to do what was needed from C, or even inline assembly). It enabled capturing the perf still left on the table, but at that point it goes against many of the goals of this repo and might be better referred to as rdna3-llm.s haha.

In terms of Metal, I think there is a Metal fork, and I vaguely recall hearing that Metal has something similar to hipify, so is it really that tedious to target without a new abstraction layer?

[1] https://github.com/anthonix/llm.c (fp32 at least, will get the mixed precision working now and update shortly)

ttraenkler · 2024-05-10T19:40:25Z

Playing devils advocate, I don't think we need an abstraction layer to target AMD devices, as I was able to build llm.c for 7900 XTX and Radeon VII with minimal changes [1], and it is >30% faster than PyTorch nightly out of the box with no AMD specific optimization. That is what is so great about this repo! :)

Congrats! IMHO, it would be great to have an AMD port merged into this repo at some point or linked from the readme.

In terms of Metal, I think there is a Metal fork, and I vaguely recall hearing that Metal has something similar to hipify, so is it really that tedious to target without a new abstraction layer?

Same here would be great to have this merged or linked. Do you remember the URL of the Metal fork?

Obviously a well optimized implementation in a vendor specific API should be faster, though wgpu would be a much faster fallback than the CPU for those not supported directly.

anthonix · 2024-05-10T20:06:05Z

[...] Congrats! IMHO, it would be great to have an AMD port merged into this repo at some point or linked from the readme.

Yeah definitely congrats to the authors of this repo! (as I say, I didn't have to do much to make it work on AMD devices!)

As for merging AMD support, perhaps eventually but things are moving really fast here so might be worth holding off a bit until things stabilize? Like there is something to be said for keeping things simple and only having to deal with Nvidia while blazing a path, and I wouldn't want maintaining AMD support to get in the way of or slow down that great trail blazing progress that is happening right now.

I think there is a Metal fork [...]
[...] would be great to have this merged or linked. Do you remember the URL of the Metal fork?

It's at the bottom of the README. I just took a look, unfortunately it hasn't been touched in ages (relative to how fast this repo moves -- two weeks, haha).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broader vendor support for hardware acceleration #400

Broader vendor support for hardware acceleration #400

ttraenkler commented May 10, 2024 •

edited

anthonix commented May 10, 2024

ttraenkler commented May 10, 2024 •

edited

anthonix commented May 10, 2024

Broader vendor support for hardware acceleration #400

Broader vendor support for hardware acceleration #400

Comments

ttraenkler commented May 10, 2024 • edited

anthonix commented May 10, 2024

ttraenkler commented May 10, 2024 • edited

anthonix commented May 10, 2024

ttraenkler commented May 10, 2024 •

edited

ttraenkler commented May 10, 2024 •

edited