considerably speed up CPU matmul #411

ngc92 · 2024-05-14T19:40:21Z

Not sure how much we care about perf of the CPU version. This is trying to give a substantial boost, while still keeping complexity in check: without the comments, the matmul still fits on a single screen.

While it would have been possible to either just print an error message for "bad" shapes, or write a function that is more generic and handles the unrollable part followed by an epilogue loop, I think the approach of having a slow, but very simple version, and a fast, but shape-restricted version is better. It keeps the fast algorithm more readable, and having the slow version there as a reference has pedagogical value in itself.

I have also removed the helper row pointers (out_bt), because they make it more difficult to see the strides of the individual memory accesses.

On my system, I observed about 20% end-to-end speedup (~5s to ~4s per step) with this change.

…adable

azret · 2024-05-14T21:31:11Z

Would be nice to setup a dev\CPU area for CPU kernels just like we have for CUDA specifically for cases like this.

azret · 2024-05-14T21:49:53Z

@ngc92 I don't see the same performance improvements as you do.

== kernel #1 (nn.dev.MatMul_+matmul_forward_slow) ==

out:
OK: CPU: 51.56322 == GPU: 51.56322
OK: CPU: -13.7409 == GPU: -13.74089
OK: CPU: -2.222031 == GPU: -2.222029
OK: CPU: -13.50699 == GPU: -13.50698
OK: CPU: -18.66131 == GPU: -18.6613

OK

== kernel #2 (nn.dev.MatMul_+matmul_forward_new) ==

out:
OK: CPU: 51.56322 == GPU: 51.56321
OK: CPU: -13.7409 == GPU: -13.7409
OK: CPU: -2.222031 == GPU: -2.222035
OK: CPU: -13.50699 == GPU: -13.50699
OK: CPU: -18.66131 == GPU: -18.66132
ERROR: CPU: -67.68067 != GPU: -67.68078

FAILED.

kernel #1 forward (nn.dev.MatMul_+matmul_forward_slow), 3360.00 ms
kernel #2 forward (nn.dev.MatMul_+matmul_forward_new), 7265.00 ms

MSVC compiler does a pretty good job optimizing the matmul_forward_slow

--- Analyzing function: matmul_forward_slow
D:\SRC\nn.cs\nn\cpu\MatMul.c(15) : info C5001: loop vectorized
D:\SRC\nn.cs\nn\cpu\MatMul.c(13) : info C5002: loop not vectorized due to reason '1106'
D:\SRC\nn.cs\nn\cpu\MatMul.c(12) : info C5002: loop not vectorized due to reason '1106'

--- Analyzing function: matmul_forward
D:\SRC\nn.cs\nn\cpu\MatMul.c(46) : info C5002: loop not vectorized due to reason '1200'
D:\SRC\nn.cs\nn\cpu\MatMul.c(55) : info C5002: loop not vectorized due to reason '1200'
D:\SRC\nn.cs\nn\cpu\MatMul.c(53) : info C5002: loop not vectorized due to reason '1106'
D:\SRC\nn.cs\nn\cpu\MatMul.c(62) : info C5002: loop not vectorized due to reason '1200'
D:\SRC\nn.cs\nn\cpu\MatMul.c(43) : info C5002: loop not vectorized due to reason '1106'
D:\SRC\nn.cs\nn\cpu\MatMul.c(42) : info C5002: loop not vectorized due to reason '1106'

Microsoft's compiler messages for loop vectorization https://learn.microsoft.com/en-us/cpp/error-messages/tool-errors/vectorizer-and-parallelizer-messages?view=msvc-170

ngc92 · 2024-05-14T22:36:15Z

This seems to indicate more that MSVC is doing a terrible job with the second loop, not a particularly good job at the first loop. GCC produces the following beautiful assembly:

.L11:
vmovups ymm0, YMMWORD PTR [rcx+rax]
vfmadd231ps     ymm1, ymm0, YMMWORD PTR [rsi+rax]
vfmadd231ps     ymm2, ymm0, YMMWORD PTR [r8+rax]
vfmadd231ps     ymm3, ymm0, YMMWORD PTR [r10+rax]
vfmadd231ps     ymm4, ymm0, YMMWORD PTR [r14+rax]
vfmadd231ps     ymm5, ymm0, YMMWORD PTR [r13+0+rax]
vfmadd231ps     ymm6, ymm0, YMMWORD PTR [r12+rax]
vfmadd231ps     ymm7, ymm0, YMMWORD PTR [rbx+rax]
vfmadd231ps     ymm8, ymm0, YMMWORD PTR [r11+rax]
add     rax, 32
cmp     r9, rax
jne     .L11

This relies critically on `-Ofast', though, as it requires reordering floating-point summation.

Could this be just because msvc is compiled without anything like -match=native, and thus runs in SSE2 mode without any AVX instructions? If I add /arch:AVX2 as an option, I get

$LL13@matmul_for:
lea     eax, DWORD PTR [r14+r10]
movsxd  rcx, eax
movsxd  rax, r10d
vmovss  xmm0, DWORD PTR [r9+rcx*4]
vfmadd231ss xmm1, xmm0, DWORD PTR [rdi+rax*4]
lea     eax, DWORD PTR [r10+rdx]
movsxd  rcx, eax
lea     eax, DWORD PTR [r15+r10]
vfmadd231ss xmm2, xmm0, DWORD PTR [rdi+rcx*4]
movsxd  rcx, eax
lea     eax, DWORD PTR [r12+r10]
vfmadd231ss xmm3, xmm0, DWORD PTR [rdi+rcx*4]
movsxd  rcx, eax
lea     eax, DWORD PTR [r10+rbp]
vfmadd231ss xmm4, xmm0, DWORD PTR [rdi+rcx*4]
movsxd  rcx, eax
lea     eax, DWORD PTR [r10+r13]
vfmadd231ss xmm5, xmm0, DWORD PTR [rdi+rcx*4]
movsxd  rcx, eax
lea     eax, DWORD PTR [r11+r10]
vfmadd231ss xmm6, xmm0, DWORD PTR [rdi+rcx*4]
movsxd  rcx, eax
lea     eax, DWORD PTR [rbx+r10]
inc     r10d
vfmadd231ss xmm7, xmm0, DWORD PTR [rdi+rcx*4]
movsxd  rcx, eax
vfmadd231ss xmm8, xmm0, DWORD PTR [rdi+rcx*4]
sub     rsi, 1
jne     SHORT $LL13@matmul_for

not as nice as gccs code, but still decent enough. With 4 issue slots, the leas and movs shouldn't actually slow down the fmas.
Here is a godbolt link to play around with this:
https://godbolt.org/z/vqhPqxaq4

azret · 2024-05-14T23:00:15Z

This seems to indicate more that MSVC is doing a terrible job with the second loop, not a particularly good job at the first loop. GCC produces the following beautiful assembly:

👍

rosslwheeler · 2024-05-14T23:00:36Z

@ngc92 and @azret - you can add the /openmp:llvm flag so - /O2 /Oi /Ot /fp:fast /arch:AVX2 /openmp:llvm?

azret · 2024-05-14T23:59:25Z

@ngc92 Thanks for the link!

Yeah. MSVC does a poor job with your version :(

kernel #1 forward (nn.dev.MatMul_+matmul_forward_slow_avx_2), 2640.00 ms
kernel #2 forward (nn.dev.MatMul_+matmul_forward_ngc92_avx_2), 6938.00 ms

dagelf · 2024-05-21T20:49:58Z

It's still 3x slower than Pytorch... but it's a huge jump 😄

Better way to run benchmarks:

( kill -STOP -1  # Stop all processes, NB don't run this outside a script!
timeout 40s ./train_gpt2
kill -CONT -1 )

Benchmark:

$ grep model /proc/cpuinfo |tail -1
model name	: Intel(R) Core(TM) i3-9100F CPU @ 3.60GHz

(this)
step 0: train loss 5.356185 (took 3973.912109 ms)
step 1: train loss 4.301033 (took 3499.634311 ms)
step 2: train loss 4.623316 (took 3498.866642 ms)
step 3: train loss 4.600415 (took 3541.733232 ms)
step 4: train loss 4.616777 (took 3501.966997 ms)
step 5: train loss 4.231482 (took 3500.156813 ms)

vs

(previous)
step 0: train loss 5.356185 (took 5332.631576 ms)
step 1: train loss 4.301033 (took 4840.134017 ms)
step 2: train loss 4.623316 (took 4828.423850 ms)
step 3: train loss 4.600415 (took 4828.398214 ms)
step 4: train loss 4.616777 (took 4829.080307 ms)
step 5: train loss 4.231482 (took 4858.988674 ms)

Pytorch: #253

dagelf · 2024-05-21T20:53:58Z

@azret $ grep model /proc/cpuinfo |tail -1 or wmic cpu get name

ngc92 added 2 commits May 14, 2024 22:32

considerably speed up CPU matmul while still keeping it relatively re…

3b5933e

…adable

constness fixes

b2a5508

karpathy merged commit 6e4296f into karpathy:master May 24, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

considerably speed up CPU matmul #411

considerably speed up CPU matmul #411

ngc92 commented May 14, 2024

azret commented May 14, 2024

azret commented May 14, 2024 •

edited

ngc92 commented May 14, 2024 •

edited

azret commented May 14, 2024

rosslwheeler commented May 14, 2024

azret commented May 14, 2024

dagelf commented May 21, 2024 •

edited

dagelf commented May 21, 2024 •

edited

considerably speed up CPU matmul #411

considerably speed up CPU matmul #411

Conversation

ngc92 commented May 14, 2024

azret commented May 14, 2024

azret commented May 14, 2024 • edited

ngc92 commented May 14, 2024 • edited

azret commented May 14, 2024

rosslwheeler commented May 14, 2024

azret commented May 14, 2024

dagelf commented May 21, 2024 • edited

dagelf commented May 21, 2024 • edited

azret commented May 14, 2024 •

edited

ngc92 commented May 14, 2024 •

edited

dagelf commented May 21, 2024 •

edited

dagelf commented May 21, 2024 •

edited