numpy.matmul is slow #7569

pv · 2016-04-23T22:28:09Z

On numpy current master (da6e4c7), np.matmul is apparently not using BLAS:

>>> import numpy as np
>>> x = np.random.rand(5, 512, 512)
>>> %timeit np.matmul(x, x)
1 loops, best of 3: 526 ms per loop
>>> def xmul(a, b):
...     out = np.empty_like(a)
...     for j in range(a.shape[0]):
...         out[j] = np.dot(a[j], b[j])
...     return out
>>> %timeit xmul(x, x)
10 loops, best of 3: 28 ms per loop

Of course, it's a preliminary feature, but probably best to have an issue for this.

The text was updated successfully, but these errors were encountered:

djsutherland · 2016-11-25T17:19:48Z

The problem is that matmul just calls einsum for the stacked case. Setting it up to use BLAS for each matrix multiply would be useful.

bordingj · 2016-12-22T20:40:38Z

intel MKL supports ?gemm_batch

ihincks · 2017-06-06T21:48:23Z

My current project's run time is bottle-necked by matmul. It's not too bad, but any idea how much faster it would be when properly interfaced to BLAS/MKL? I'm mostly multiplying (10000,9,9) by (10000,9,9).

njsmith · 2017-06-06T22:07:05Z

Doing lots of small (9x9) matrix multiplies is actually a case where using a fancy BLAS library might make you slower. The fancy BLAS stuff wins by carefully controlling which parts of your matrices are in cache when, so if your entire matrices fit in cache then there isn't much improvement possible, and they can actually slow things down by trying to be clever and adding overhead.

That said, it's probably worth trying a manual loop like in the first comment in this thread and seeing how fast it goes in your case!

djsutherland · 2017-06-06T23:11:46Z

I actually just tested it on my machine, calling mkl cblas_dgemm_multiply through ctypes, and found that A @ B takes ~37ms, while the actual call to cblas_dgemm_multiply takes ~1.2ms - a 30x speedup! Unfortunately, naively turning that result (which is an array of pointers) into a Python structure takes ~183ms, so you'd have to speed that processing up if you wanted to use it. :)

https://gist.github.com/45d05765c6291e79489c8c31e1ca99ba

njsmith · 2017-06-07T00:22:16Z

You mean MKL's proprietary cblas_dgemm_batch, right? Even sticking to ctypes that could probably be sped up a lot by taking advantage of the fact that that those arrays full of pointers could be fully constructed by calls to np.arange.

djsutherland · 2017-06-07T00:26:52Z

Yep, that's what i was calling. Good point about arange, though I wasn't counting that time, just the actual function call.

ihincks · 2017-06-07T01:58:02Z

Thanks for the tips! I will give things a whirl tomorrow.

djsutherland · 2017-06-07T11:19:36Z

Two updates about my example code there:

Reconstructing C was idiotic; there's already a C array. :)
I tried to switch the pointer array construction to be faster as

def pointer_array(ary):
    start = ary[0].ctypes.data
    step = ary[1].ctypes.data - start
    x = np.arange(start, start + step * len(ary), step, dtype=np.int64)
    return x.ctypes.data_as(ctypes.POINTER(ctypes.POINTER(ctypes.c_double)))

; that seemed to work in terms of accessing things, but when I called the function it didn't actually change C. Not sure what's going on there.

ihincks · 2017-06-07T13:04:19Z

Okay, I tried the first suggestion from @njsmith, and replaced my matmul with OP's xmul. This causes my project to run about 50% faster. According to prun, about 95% of my project is spent in xmul.

A @ B takes ~37ms, while the actual call to cblas_dgemm_multiply takes ~1.2ms - a 30x speedup!

This sounds like it could help me quite a bit if I keep everything in c arrays until the end of the algorithm.

jakirkham · 2017-11-19T22:07:32Z

If matmul is still using einsum, it is probably worth checking if this is still slow because einsum now uses BLAS routines for some cases if possible thanks to PR ( #9425 ). Should add this is only master; so, it is not yet released, but scheduled for 1.14.0.

jakirkham · 2018-03-31T22:04:52Z

FTR retested this with NumPy 1.14.2 to see if this is still an issue given the changes to einsum and it is still there. This uses @pv's original code.

Example:

In [1]: import numpy as np

In [2]: x = np.random.rand(5, 512, 512)

In [3]: %timeit np.matmul(x, x)
491 ms ± 3.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: def xmul(a, b):
   ...:     out = np.empty_like(a)
   ...:     for j in range(a.shape[0]):
   ...:         out[j] = np.dot(a[j], b[j])
   ...:     return out
   ...: 

In [5]: %timeit xmul(x, x)
19.3 ms ± 4.21 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %timeit np.einsum("...ij,...jk->...ik", x, x)
536 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Environment:

name: test
channels:
- conda-forge
- defaults
dependencies:
- appnope=0.1.0=py36_0
- blas=1.1=openblas
- ca-certificates=2018.1.18=0
- certifi=2018.1.18=py36_0
- decorator=4.2.1=py36_0
- ipython=6.2.1=py36_1
- ipython_genutils=0.2.0=py36_0
- jedi=0.11.1=py36_0
- libgfortran=3.0.0=0
- ncurses=5.9=10
- numpy=1.14.2=py36_blas_openblas_200
- openblas=0.2.20=7
- openssl=1.0.2n=0
- parso=0.1.1=py_0
- pexpect=4.4.0=py36_0
- pickleshare=0.7.4=py36_0
- prompt_toolkit=1.0.15=py36_0
- ptyprocess=0.5.2=py36_0
- pygments=2.2.0=py36_0
- python=3.6.5=0
- readline=7.0=0
- setuptools=39.0.1=py36_0
- simplegeneric=0.8.1=py36_0
- six=1.11.0=py36_1
- sqlite=3.20.1=2
- tk=8.6.7=0
- traitlets=4.3.2=py36_0
- wcwidth=0.1.7=py36_0
- xz=5.2.3=0
- zlib=1.2.11=0

Edit: Added explicit einsum test to compare with.

jakirkham · 2018-04-23T21:08:30Z

@dgasmith, do you have any idea why einsum doesn't appear to use BLAS in my example above? Is my formulation problematic for some reason?

dgasmith · 2018-04-24T14:42:24Z

@jakirkham Yes, the optimization algorithm does not currently support looped GEMM. It is unclear if it should or not as speed increases compared to the overhead of detection vary greatly depending on the size of the problem. Happy to write this in, just be aware overhead can become a problem as we write more special cases.

rgommers · 2018-07-26T04:25:22Z

more discussion on the same topic in gh-8957 (closed as a duplicate of this issue)

mattip · 2018-07-26T04:55:38Z

PR #11133 implements matmul as a true ufunc, replicating the dot use of cblas functions. The comments about stacking a n1,n2,n3 3d ndarray as a n1*n2,n3 2d array still apply.

WarrenWeckesser · 2018-10-20T09:47:20Z

Relevant question on stackoverflow: MATLAB matrix multiplication performance is 5x faster than NumPy

mattip · 2018-12-12T07:45:37Z

The microbenchmark now gives similar times for np.matmul and xmul.

Example:

``` In [1]: import numpy as np

In [2]: np.version
Out[2]: '1.17.0.dev0+a3be55e'

In [3]: x = np.random.rand(5, 512, 512)

In [4]: %timeit np.matmul(x, x)
9.19 ms ± 77.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: def xmul(a, b):
...: out = np.empty_like(a)
...: for j in range(a.shape[0]):
...: out[j] = np.dot(a[j], b[j])
...: return out
...:

In [6]: %timeit xmul(x, x)
9.78 ms ± 43.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %timeit np.einsum("...ij,...jk->...ik", x, x)
394 ms ± 2.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

mattip · 2018-12-13T16:31:56Z

@pv can we close this now that matmul uses cblas?

gnool · 2018-12-22T03:24:50Z

I tried 1.16rc and tested matmul on two matrices of shape (5000,4,4) and (5000,4,1) and found that in new version matmul is 2-3x slower than in 1.15. For these really small matrices is there an alternative to matmul that I can use?
On the other hand for matrices of shape (5000,4,4) and (5000,4,4), the new version was 4x faster. Is this because matrix multiplication between (4,4) and (4,1) does not benefit much from BLAS while (4,4) and (4,4) does?

eric-wieser · 2018-12-22T03:32:13Z

@gnool: The memory order of the matrices matters. Can you show how you allocated them, or at least their .strides and .dtype?

gnool · 2018-12-22T04:07:26Z

@eric-wieser I have checked that the strides, dtype and memory order as seen from .flags are the same in both versions.

Here's the newbie code that I use for both 1.16rc and 1.15:

import numpy as np
from timeit import default_timer as dt

x = np.random.rand(5000,4,4)
y = np.random.rand(5000,4,1)
start = dt()
for i in range(15000):
    z = np.matmul(x, y)
print(dt()-start)

jakirkham · 2018-12-22T04:50:33Z

If you run the code below, what does this say?

python -c "from numpy.distutils import __config__; print(__config__.blas_opt_info)"

gnool · 2018-12-22T04:54:43Z

@jakirkham

{'libraries': ['openblas', 'openblas'], 'library_dirs': ['/usr/local/lib'], 'language': 'c', 'define_macros': [('HAVE_CBLAS', None)]}

jakirkham · 2018-12-22T05:07:08Z

Generally BLAS is targeted at larger matrices. Many BLASes like OpenBLAS have a notable startup cost, which is usually amortized over these larger matrices. Though perhaps not for your use case.

Have you looked at BLIS? Users have noted this works well for small matrices.

njsmith · 2018-12-22T05:22:08Z

It would be nice if OpenBLAS would handle small matrices better. It's certainly possible – they can see the size, and decide to do something small and simple if their normal heavy-weight setup isn't going to be worthwhile. I wouldn't hold my breath though; this has been a weakness of theirs for years.

There's some discussion of adding a standard batched GEMM interface to the next version of BLAS, but I wouldn't hold my breath on that either: https://docs.google.com/document/d/1DY4ImZT1coqri2382GusXgBTTTVdBDvtD5I14QHp9OE/edit#

If you need optimal speed for large stacks of small matrices on numpy right now, I'd try np.einsum (e.g. z = np.einsum("ink,ikm", x, y)), or possibly trying the anaconda builds of numpy that use MKL, to check if MKL handles the small matrices better than OpenBLAS does.

jakirkham · 2018-12-22T05:38:32Z

Do you know if there was an issue raised about optimizing OpenBLAS for smaller matrices?

jakirkham · 2018-12-22T05:43:45Z

As to the MKL point, you may be on to something. Ran across this article, which suggested MKL may do better for small matrices.

njsmith · 2018-12-22T05:44:22Z

Do you know if there was an issue raised about optimizing OpenBLAS for smaller matrices?

not off the top of my head

gnool · 2018-12-22T05:51:38Z

thanks @jakirkham and @njsmith . I replaced matmul with einsum and now I could get the same performance regardless of which version I use. Thanks guys!

njsmith · 2018-12-22T06:10:16Z

Cool!

Anyway, this issue was originally about matmul not using BLAS, and now it does, so closing.

…ions This rewrites replaces a batched matmul by a core matmul by raveling the batched dimensions of the batched input, doing a 2D matmul and then unravelling the batched dimensions of the output. This sidesteps the Blockwise Dot operation and allows specialization into BLAS routines. The idea was taken from these two discussions: numpy/numpy#7569 numpy/numpy#8957

jakirkham mentioned this issue Nov 19, 2017

numpy.matmul speedup for multidimensional arrays #8957

Closed

mattjj mentioned this issue Feb 24, 2018

Express matmul vjps in terms of matmul HIPS/autograd#366

Merged

This was referenced Apr 23, 2018

Optimize tensordot with rechunk dask/dask#2225

Open

xarray.dot() dask problems pydata/xarray#2074

Closed

rgommers added 23 - Wish List component: numpy._core labels Jul 26, 2018

njsmith closed this as completed Dec 22, 2018

ricardoV94 mentioned this issue Dec 16, 2023

Add rewrite for matmul when only one of the inputs has batched dimensions pymc-devs/pytensor#558

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

numpy.matmul is slow #7569

numpy.matmul is slow #7569

pv commented Apr 23, 2016

djsutherland commented Nov 25, 2016

bordingj commented Dec 22, 2016

ihincks commented Jun 6, 2017

njsmith commented Jun 6, 2017

djsutherland commented Jun 6, 2017 •

edited

njsmith commented Jun 7, 2017

djsutherland commented Jun 7, 2017

ihincks commented Jun 7, 2017

djsutherland commented Jun 7, 2017

ihincks commented Jun 7, 2017

jakirkham commented Nov 19, 2017

jakirkham commented Mar 31, 2018 •

edited

jakirkham commented Apr 23, 2018

dgasmith commented Apr 24, 2018

rgommers commented Jul 26, 2018

mattip commented Jul 26, 2018

WarrenWeckesser commented Oct 20, 2018

mattip commented Dec 12, 2018

mattip commented Dec 13, 2018

gnool commented Dec 22, 2018

eric-wieser commented Dec 22, 2018 •

edited

gnool commented Dec 22, 2018 •

edited

jakirkham commented Dec 22, 2018

gnool commented Dec 22, 2018

jakirkham commented Dec 22, 2018

njsmith commented Dec 22, 2018

jakirkham commented Dec 22, 2018

jakirkham commented Dec 22, 2018

njsmith commented Dec 22, 2018

gnool commented Dec 22, 2018

njsmith commented Dec 22, 2018

numpy.matmul is slow #7569

numpy.matmul is slow #7569

Comments

pv commented Apr 23, 2016

djsutherland commented Nov 25, 2016

bordingj commented Dec 22, 2016

ihincks commented Jun 6, 2017

njsmith commented Jun 6, 2017

djsutherland commented Jun 6, 2017 • edited

njsmith commented Jun 7, 2017

djsutherland commented Jun 7, 2017

ihincks commented Jun 7, 2017

djsutherland commented Jun 7, 2017

ihincks commented Jun 7, 2017

jakirkham commented Nov 19, 2017

jakirkham commented Mar 31, 2018 • edited

jakirkham commented Apr 23, 2018

dgasmith commented Apr 24, 2018

rgommers commented Jul 26, 2018

mattip commented Jul 26, 2018

WarrenWeckesser commented Oct 20, 2018

mattip commented Dec 12, 2018

mattip commented Dec 13, 2018

gnool commented Dec 22, 2018

eric-wieser commented Dec 22, 2018 • edited

gnool commented Dec 22, 2018 • edited

jakirkham commented Dec 22, 2018

gnool commented Dec 22, 2018

jakirkham commented Dec 22, 2018

njsmith commented Dec 22, 2018

jakirkham commented Dec 22, 2018

jakirkham commented Dec 22, 2018

njsmith commented Dec 22, 2018

gnool commented Dec 22, 2018

njsmith commented Dec 22, 2018

djsutherland commented Jun 6, 2017 •

edited

jakirkham commented Mar 31, 2018 •

edited

eric-wieser commented Dec 22, 2018 •

edited

gnool commented Dec 22, 2018 •

edited