Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

considerably speed up CPU matmul #411

Merged
merged 2 commits into from
May 24, 2024
Merged

Conversation

ngc92
Copy link
Contributor

@ngc92 ngc92 commented May 14, 2024

Not sure how much we care about perf of the CPU version. This is trying to give a substantial boost, while still keeping complexity in check: without the comments, the matmul still fits on a single screen.

While it would have been possible to either just print an error message for "bad" shapes, or write a function that is more generic and handles the unrollable part followed by an epilogue loop, I think the approach of having a slow, but very simple version, and a fast, but shape-restricted version is better. It keeps the fast algorithm more readable, and having the slow version there as a reference has pedagogical value in itself.

I have also removed the helper row pointers (out_bt), because they make it more difficult to see the strides of the individual memory accesses.

On my system, I observed about 20% end-to-end speedup (~5s to ~4s per step) with this change.

@azret
Copy link
Contributor

azret commented May 14, 2024

Would be nice to setup a dev\CPU area for CPU kernels just like we have for CUDA specifically for cases like this.

@azret
Copy link
Contributor

azret commented May 14, 2024

@ngc92 I don't see the same performance improvements as you do.

== kernel #1 (nn.dev.MatMul_+matmul_forward_slow) ==

out:
OK: CPU: 51.56322 == GPU: 51.56322
OK: CPU: -13.7409 == GPU: -13.74089
OK: CPU: -2.222031 == GPU: -2.222029
OK: CPU: -13.50699 == GPU: -13.50698
OK: CPU: -18.66131 == GPU: -18.6613

OK

== kernel #2 (nn.dev.MatMul_+matmul_forward_new) ==

out:
OK: CPU: 51.56322 == GPU: 51.56321
OK: CPU: -13.7409 == GPU: -13.7409
OK: CPU: -2.222031 == GPU: -2.222035
OK: CPU: -13.50699 == GPU: -13.50699
OK: CPU: -18.66131 == GPU: -18.66132
ERROR: CPU: -67.68067 != GPU: -67.68078

FAILED.

kernel #1 forward (nn.dev.MatMul_+matmul_forward_slow), 3360.00 ms
kernel #2 forward (nn.dev.MatMul_+matmul_forward_new), 7265.00 ms

MSVC compiler does a pretty good job optimizing the matmul_forward_slow

--- Analyzing function: matmul_forward_slow
D:\SRC\nn.cs\nn\cpu\MatMul.c(15) : info C5001: loop vectorized
D:\SRC\nn.cs\nn\cpu\MatMul.c(13) : info C5002: loop not vectorized due to reason '1106'
D:\SRC\nn.cs\nn\cpu\MatMul.c(12) : info C5002: loop not vectorized due to reason '1106'

--- Analyzing function: matmul_forward
D:\SRC\nn.cs\nn\cpu\MatMul.c(46) : info C5002: loop not vectorized due to reason '1200'
D:\SRC\nn.cs\nn\cpu\MatMul.c(55) : info C5002: loop not vectorized due to reason '1200'
D:\SRC\nn.cs\nn\cpu\MatMul.c(53) : info C5002: loop not vectorized due to reason '1106'
D:\SRC\nn.cs\nn\cpu\MatMul.c(62) : info C5002: loop not vectorized due to reason '1200'
D:\SRC\nn.cs\nn\cpu\MatMul.c(43) : info C5002: loop not vectorized due to reason '1106'
D:\SRC\nn.cs\nn\cpu\MatMul.c(42) : info C5002: loop not vectorized due to reason '1106'

Microsoft's compiler messages for loop vectorization https://learn.microsoft.com/en-us/cpp/error-messages/tool-errors/vectorizer-and-parallelizer-messages?view=msvc-170

@ngc92
Copy link
Contributor Author

ngc92 commented May 14, 2024

This seems to indicate more that MSVC is doing a terrible job with the second loop, not a particularly good job at the first loop. GCC produces the following beautiful assembly:

.L11:
vmovups ymm0, YMMWORD PTR [rcx+rax]
vfmadd231ps     ymm1, ymm0, YMMWORD PTR [rsi+rax]
vfmadd231ps     ymm2, ymm0, YMMWORD PTR [r8+rax]
vfmadd231ps     ymm3, ymm0, YMMWORD PTR [r10+rax]
vfmadd231ps     ymm4, ymm0, YMMWORD PTR [r14+rax]
vfmadd231ps     ymm5, ymm0, YMMWORD PTR [r13+0+rax]
vfmadd231ps     ymm6, ymm0, YMMWORD PTR [r12+rax]
vfmadd231ps     ymm7, ymm0, YMMWORD PTR [rbx+rax]
vfmadd231ps     ymm8, ymm0, YMMWORD PTR [r11+rax]
add     rax, 32
cmp     r9, rax
jne     .L11

This relies critically on `-Ofast', though, as it requires reordering floating-point summation.

Could this be just because msvc is compiled without anything like -match=native, and thus runs in SSE2 mode without any AVX instructions? If I add /arch:AVX2 as an option, I get

$LL13@matmul_for:
lea     eax, DWORD PTR [r14+r10]
movsxd  rcx, eax
movsxd  rax, r10d
vmovss  xmm0, DWORD PTR [r9+rcx*4]
vfmadd231ss xmm1, xmm0, DWORD PTR [rdi+rax*4]
lea     eax, DWORD PTR [r10+rdx]
movsxd  rcx, eax
lea     eax, DWORD PTR [r15+r10]
vfmadd231ss xmm2, xmm0, DWORD PTR [rdi+rcx*4]
movsxd  rcx, eax
lea     eax, DWORD PTR [r12+r10]
vfmadd231ss xmm3, xmm0, DWORD PTR [rdi+rcx*4]
movsxd  rcx, eax
lea     eax, DWORD PTR [r10+rbp]
vfmadd231ss xmm4, xmm0, DWORD PTR [rdi+rcx*4]
movsxd  rcx, eax
lea     eax, DWORD PTR [r10+r13]
vfmadd231ss xmm5, xmm0, DWORD PTR [rdi+rcx*4]
movsxd  rcx, eax
lea     eax, DWORD PTR [r11+r10]
vfmadd231ss xmm6, xmm0, DWORD PTR [rdi+rcx*4]
movsxd  rcx, eax
lea     eax, DWORD PTR [rbx+r10]
inc     r10d
vfmadd231ss xmm7, xmm0, DWORD PTR [rdi+rcx*4]
movsxd  rcx, eax
vfmadd231ss xmm8, xmm0, DWORD PTR [rdi+rcx*4]
sub     rsi, 1
jne     SHORT $LL13@matmul_for

not as nice as gccs code, but still decent enough. With 4 issue slots, the leas and movs shouldn't actually slow down the fmas.
Here is a godbolt link to play around with this:
https://godbolt.org/z/vqhPqxaq4

@azret
Copy link
Contributor

azret commented May 14, 2024

This seems to indicate more that MSVC is doing a terrible job with the second loop, not a particularly good job at the first loop. GCC produces the following beautiful assembly:

👍

@rosslwheeler
Copy link
Contributor

@ngc92 and @azret - you can add the /openmp:llvm flag so - /O2 /Oi /Ot /fp:fast /arch:AVX2 /openmp:llvm?

@azret
Copy link
Contributor

azret commented May 14, 2024

@ngc92 Thanks for the link!

Yeah. MSVC does a poor job with your version :(

kernel #1 forward (nn.dev.MatMul_+matmul_forward_slow_avx_2), 2640.00 ms
kernel #2 forward (nn.dev.MatMul_+matmul_forward_ngc92_avx_2), 6938.00 ms

@dagelf
Copy link
Contributor

dagelf commented May 21, 2024

It's still 3x slower than Pytorch... but it's a huge jump 😄

Better way to run benchmarks:

( kill -STOP -1  # Stop all processes, NB don't run this outside a script!
timeout 40s ./train_gpt2
kill -CONT -1 )

Benchmark:

$ grep model /proc/cpuinfo |tail -1
model name	: Intel(R) Core(TM) i3-9100F CPU @ 3.60GHz

(this)
step 0: train loss 5.356185 (took 3973.912109 ms)
step 1: train loss 4.301033 (took 3499.634311 ms)
step 2: train loss 4.623316 (took 3498.866642 ms)
step 3: train loss 4.600415 (took 3541.733232 ms)
step 4: train loss 4.616777 (took 3501.966997 ms)
step 5: train loss 4.231482 (took 3500.156813 ms)

vs

(previous)
step 0: train loss 5.356185 (took 5332.631576 ms)
step 1: train loss 4.301033 (took 4840.134017 ms)
step 2: train loss 4.623316 (took 4828.423850 ms)
step 3: train loss 4.600415 (took 4828.398214 ms)
step 4: train loss 4.616777 (took 4829.080307 ms)
step 5: train loss 4.231482 (took 4858.988674 ms)

Pytorch: #253

@dagelf
Copy link
Contributor

dagelf commented May 21, 2024

@azret $ grep model /proc/cpuinfo |tail -1 or wmic cpu get name

@karpathy karpathy merged commit 6e4296f into karpathy:master May 24, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants