Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

numpy.matmul is slow #7569

Closed
pv opened this issue Apr 23, 2016 · 31 comments
Closed

numpy.matmul is slow #7569

pv opened this issue Apr 23, 2016 · 31 comments

Comments

@pv
Copy link
Member

pv commented Apr 23, 2016

On numpy current master (da6e4c7), np.matmul is apparently not using BLAS:

>>> import numpy as np
>>> x = np.random.rand(5, 512, 512)
>>> %timeit np.matmul(x, x)
1 loops, best of 3: 526 ms per loop
>>> def xmul(a, b):
...     out = np.empty_like(a)
...     for j in range(a.shape[0]):
...         out[j] = np.dot(a[j], b[j])
...     return out
>>> %timeit xmul(x, x)
10 loops, best of 3: 28 ms per loop

Of course, it's a preliminary feature, but probably best to have an issue for this.

@djsutherland
Copy link

The problem is that matmul just calls einsum for the stacked case. Setting it up to use BLAS for each matrix multiply would be useful.

@bordingj
Copy link

intel MKL supports ?gemm_batch

@ihincks
Copy link

ihincks commented Jun 6, 2017

My current project's run time is bottle-necked by matmul. It's not too bad, but any idea how much faster it would be when properly interfaced to BLAS/MKL? I'm mostly multiplying (10000,9,9) by (10000,9,9).

@njsmith
Copy link
Member

njsmith commented Jun 6, 2017

Doing lots of small (9x9) matrix multiplies is actually a case where using a fancy BLAS library might make you slower. The fancy BLAS stuff wins by carefully controlling which parts of your matrices are in cache when, so if your entire matrices fit in cache then there isn't much improvement possible, and they can actually slow things down by trying to be clever and adding overhead.

That said, it's probably worth trying a manual loop like in the first comment in this thread and seeing how fast it goes in your case!

@djsutherland
Copy link

djsutherland commented Jun 6, 2017

I actually just tested it on my machine, calling mkl cblas_dgemm_multiply through ctypes, and found that A @ B takes ~37ms, while the actual call to cblas_dgemm_multiply takes ~1.2ms - a 30x speedup! Unfortunately, naively turning that result (which is an array of pointers) into a Python structure takes ~183ms, so you'd have to speed that processing up if you wanted to use it. :)

https://gist.github.com/45d05765c6291e79489c8c31e1ca99ba

@njsmith
Copy link
Member

njsmith commented Jun 7, 2017

You mean MKL's proprietary cblas_dgemm_batch, right? Even sticking to ctypes that could probably be sped up a lot by taking advantage of the fact that that those arrays full of pointers could be fully constructed by calls to np.arange.

@djsutherland
Copy link

Yep, that's what i was calling. Good point about arange, though I wasn't counting that time, just the actual function call.

@ihincks
Copy link

ihincks commented Jun 7, 2017

Thanks for the tips! I will give things a whirl tomorrow.

@djsutherland
Copy link

Two updates about my example code there:

  1. Reconstructing C was idiotic; there's already a C array. :)

  2. I tried to switch the pointer array construction to be faster as

def pointer_array(ary):
    start = ary[0].ctypes.data
    step = ary[1].ctypes.data - start
    x = np.arange(start, start + step * len(ary), step, dtype=np.int64)
    return x.ctypes.data_as(ctypes.POINTER(ctypes.POINTER(ctypes.c_double)))

; that seemed to work in terms of accessing things, but when I called the function it didn't actually change C. Not sure what's going on there.

@ihincks
Copy link

ihincks commented Jun 7, 2017

Okay, I tried the first suggestion from @njsmith, and replaced my matmul with OP's xmul. This causes my project to run about 50% faster. According to prun, about 95% of my project is spent in xmul.

A @ B takes ~37ms, while the actual call to cblas_dgemm_multiply takes ~1.2ms - a 30x speedup!

This sounds like it could help me quite a bit if I keep everything in c arrays until the end of the algorithm.

@jakirkham
Copy link
Contributor

If matmul is still using einsum, it is probably worth checking if this is still slow because einsum now uses BLAS routines for some cases if possible thanks to PR ( #9425 ). Should add this is only master; so, it is not yet released, but scheduled for 1.14.0.

@jakirkham
Copy link
Contributor

jakirkham commented Mar 31, 2018

FTR retested this with NumPy 1.14.2 to see if this is still an issue given the changes to einsum and it is still there. This uses @pv's original code.

Example:
In [1]: import numpy as np

In [2]: x = np.random.rand(5, 512, 512)

In [3]: %timeit np.matmul(x, x)
491 ms ± 3.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: def xmul(a, b):
   ...:     out = np.empty_like(a)
   ...:     for j in range(a.shape[0]):
   ...:         out[j] = np.dot(a[j], b[j])
   ...:     return out
   ...: 

In [5]: %timeit xmul(x, x)
19.3 ms ± 4.21 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %timeit np.einsum("...ij,...jk->...ik", x, x)
536 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Environment:
name: test
channels:
- conda-forge
- defaults
dependencies:
- appnope=0.1.0=py36_0
- blas=1.1=openblas
- ca-certificates=2018.1.18=0
- certifi=2018.1.18=py36_0
- decorator=4.2.1=py36_0
- ipython=6.2.1=py36_1
- ipython_genutils=0.2.0=py36_0
- jedi=0.11.1=py36_0
- libgfortran=3.0.0=0
- ncurses=5.9=10
- numpy=1.14.2=py36_blas_openblas_200
- openblas=0.2.20=7
- openssl=1.0.2n=0
- parso=0.1.1=py_0
- pexpect=4.4.0=py36_0
- pickleshare=0.7.4=py36_0
- prompt_toolkit=1.0.15=py36_0
- ptyprocess=0.5.2=py36_0
- pygments=2.2.0=py36_0
- python=3.6.5=0
- readline=7.0=0
- setuptools=39.0.1=py36_0
- simplegeneric=0.8.1=py36_0
- six=1.11.0=py36_1
- sqlite=3.20.1=2
- tk=8.6.7=0
- traitlets=4.3.2=py36_0
- wcwidth=0.1.7=py36_0
- xz=5.2.3=0
- zlib=1.2.11=0

Edit: Added explicit einsum test to compare with.

@jakirkham
Copy link
Contributor

@dgasmith, do you have any idea why einsum doesn't appear to use BLAS in my example above? Is my formulation problematic for some reason?

@dgasmith
Copy link
Contributor

@jakirkham Yes, the optimization algorithm does not currently support looped GEMM. It is unclear if it should or not as speed increases compared to the overhead of detection vary greatly depending on the size of the problem. Happy to write this in, just be aware overhead can become a problem as we write more special cases.

@rgommers
Copy link
Member

more discussion on the same topic in gh-8957 (closed as a duplicate of this issue)

@mattip
Copy link
Member

mattip commented Jul 26, 2018

PR #11133 implements matmul as a true ufunc, replicating the dot use of cblas functions. The comments about stacking a n1,n2,n3 3d ndarray as a n1*n2,n3 2d array still apply.

@WarrenWeckesser
Copy link
Member

Relevant question on stackoverflow: MATLAB matrix multiplication performance is 5x faster than NumPy

@mattip
Copy link
Member

mattip commented Dec 12, 2018

The microbenchmark now gives similar times for np.matmul and xmul.

Example: ``` In [1]: import numpy as np

In [2]: np.version
Out[2]: '1.17.0.dev0+a3be55e'

In [3]: x = np.random.rand(5, 512, 512)

In [4]: %timeit np.matmul(x, x)
9.19 ms ± 77.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: def xmul(a, b):
...: out = np.empty_like(a)
...: for j in range(a.shape[0]):
...: out[j] = np.dot(a[j], b[j])
...: return out
...:

In [6]: %timeit xmul(x, x)
9.78 ms ± 43.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %timeit np.einsum("...ij,...jk->...ik", x, x)
394 ms ± 2.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@mattip
Copy link
Member

mattip commented Dec 13, 2018

@pv can we close this now that matmul uses cblas?

@gnool
Copy link

gnool commented Dec 22, 2018

I tried 1.16rc and tested matmul on two matrices of shape (5000,4,4) and (5000,4,1) and found that in new version matmul is 2-3x slower than in 1.15. For these really small matrices is there an alternative to matmul that I can use?
On the other hand for matrices of shape (5000,4,4) and (5000,4,4), the new version was 4x faster. Is this because matrix multiplication between (4,4) and (4,1) does not benefit much from BLAS while (4,4) and (4,4) does?

@eric-wieser
Copy link
Member

eric-wieser commented Dec 22, 2018

@gnool: The memory order of the matrices matters. Can you show how you allocated them, or at least their .strides and .dtype?

@gnool
Copy link

gnool commented Dec 22, 2018

@eric-wieser I have checked that the strides, dtype and memory order as seen from .flags are the same in both versions.

Here's the newbie code that I use for both 1.16rc and 1.15:

import numpy as np
from timeit import default_timer as dt

x = np.random.rand(5000,4,4)
y = np.random.rand(5000,4,1)
start = dt()
for i in range(15000):
    z = np.matmul(x, y)
print(dt()-start)

@jakirkham
Copy link
Contributor

If you run the code below, what does this say?

python -c "from numpy.distutils import __config__; print(__config__.blas_opt_info)"

@gnool
Copy link

gnool commented Dec 22, 2018

@jakirkham

{'libraries': ['openblas', 'openblas'], 'library_dirs': ['/usr/local/lib'], 'language': 'c', 'define_macros': [('HAVE_CBLAS', None)]}

@jakirkham
Copy link
Contributor

Generally BLAS is targeted at larger matrices. Many BLASes like OpenBLAS have a notable startup cost, which is usually amortized over these larger matrices. Though perhaps not for your use case.

Have you looked at BLIS? Users have noted this works well for small matrices.

@njsmith
Copy link
Member

njsmith commented Dec 22, 2018

It would be nice if OpenBLAS would handle small matrices better. It's certainly possible – they can see the size, and decide to do something small and simple if their normal heavy-weight setup isn't going to be worthwhile. I wouldn't hold my breath though; this has been a weakness of theirs for years.

There's some discussion of adding a standard batched GEMM interface to the next version of BLAS, but I wouldn't hold my breath on that either: https://docs.google.com/document/d/1DY4ImZT1coqri2382GusXgBTTTVdBDvtD5I14QHp9OE/edit#

If you need optimal speed for large stacks of small matrices on numpy right now, I'd try np.einsum (e.g. z = np.einsum("ink,ikm", x, y)), or possibly trying the anaconda builds of numpy that use MKL, to check if MKL handles the small matrices better than OpenBLAS does.

@jakirkham
Copy link
Contributor

Do you know if there was an issue raised about optimizing OpenBLAS for smaller matrices?

@jakirkham
Copy link
Contributor

As to the MKL point, you may be on to something. Ran across this article, which suggested MKL may do better for small matrices.

@njsmith
Copy link
Member

njsmith commented Dec 22, 2018

Do you know if there was an issue raised about optimizing OpenBLAS for smaller matrices?

not off the top of my head

@gnool
Copy link

gnool commented Dec 22, 2018

thanks @jakirkham and @njsmith . I replaced matmul with einsum and now I could get the same performance regardless of which version I use. Thanks guys!

@njsmith
Copy link
Member

njsmith commented Dec 22, 2018

Cool!

Anyway, this issue was originally about matmul not using BLAS, and now it does, so closing.

@njsmith njsmith closed this as completed Dec 22, 2018
ricardoV94 added a commit to ricardoV94/pytensor that referenced this issue Dec 16, 2023
…ions

This rewrites replaces a batched matmul by a core matmul by raveling the batched dimensions of the batched input, doing a 2D matmul and then unravelling the batched dimensions of the output.

This sidesteps the Blockwise Dot operation and allows specialization into BLAS routines.

The idea was taken from these two discussions:
numpy/numpy#7569
numpy/numpy#8957
ricardoV94 added a commit to ricardoV94/pytensor that referenced this issue Dec 16, 2023
…ions

This rewrites replaces a batched matmul by a core matmul by raveling the batched dimensions of the batched input, doing a 2D matmul and then unravelling the batched dimensions of the output.

This sidesteps the Blockwise Dot operation and allows specialization into BLAS routines.

The idea was taken from these two discussions:
numpy/numpy#7569
numpy/numpy#8957
ricardoV94 added a commit to ricardoV94/pytensor that referenced this issue Dec 18, 2023
…ions

This rewrites replaces a batched matmul by a core matmul by raveling the batched dimensions of the batched input, doing a 2D matmul and then unravelling the batched dimensions of the output.

This sidesteps the Blockwise Dot operation and allows specialization into BLAS routines.

The idea was taken from these two discussions:
numpy/numpy#7569
numpy/numpy#8957
ricardoV94 added a commit to pymc-devs/pytensor that referenced this issue Dec 20, 2023
…ions

This rewrites replaces a batched matmul by a core matmul by raveling the batched dimensions of the batched input, doing a 2D matmul and then unravelling the batched dimensions of the output.

This sidesteps the Blockwise Dot operation and allows specialization into BLAS routines.

The idea was taken from these two discussions:
numpy/numpy#7569
numpy/numpy#8957
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests