Add support for AMD devices #413

anthonix · 2024-05-14T23:14:12Z

This unobtrusively adds support for AMD devices, in a way that minimizes changes or adding new code.

Performance with bfloat16 on a 7900 XTX is ~50,000 toks/sec for a single GPU, and ~210,000 toks/sec for 4x GPUs (as a frame of reference, the latest pytorch 2.4.0.dev20240513 runs at ~42,000 toks/sec on a single device).

Should this be merged here, or maintained as a separate fork?

anthonix · 2024-05-14T23:22:59Z

FWIW, I've been playing around with more aggressive AMD specific optimizations that realize good gains over this, but I thought it was worth getting some feedback on how baseline AMD support could be integrated first.

karpathy · 2024-05-16T20:09:22Z

This is very interesting to browse through and see! But yes, I think separate fork makes a lot more sense for AMD, and super happy to link to it in the notable forks section. It's cool that you seem to be getting some very nice throughputs!

anthonix · 2024-05-16T20:23:17Z

Sounds good, will open another pull req with link to the fork in the README

anthonix added 5 commits May 16, 2024 12:15

Add support for AMD devices

85ac5fa

Fix casts

136490b

train_gpt2_fp32 builds for AMD, runs at ~23,700 toks/sec on 7900 XTX

0aedcba

AMD builds work for profile_gpt2 / test_gpt2 / test_gpt2_fp32

ba5048b

Use DPP insns for faster reductions

8f36b75

anthonix force-pushed the master branch from 8a8c02c to 8f36b75 Compare May 16, 2024 19:16

anthonix closed this May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for AMD devices #413

Add support for AMD devices #413

anthonix commented May 14, 2024

anthonix commented May 14, 2024

karpathy commented May 16, 2024

anthonix commented May 16, 2024

Add support for AMD devices #413

Add support for AMD devices #413

Conversation

anthonix commented May 14, 2024

anthonix commented May 14, 2024

karpathy commented May 16, 2024

anthonix commented May 16, 2024