New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
numpy.matmul is slow #7569
Comments
The problem is that |
intel MKL supports ?gemm_batch |
My current project's run time is bottle-necked by |
Doing lots of small (9x9) matrix multiplies is actually a case where using a fancy BLAS library might make you slower. The fancy BLAS stuff wins by carefully controlling which parts of your matrices are in cache when, so if your entire matrices fit in cache then there isn't much improvement possible, and they can actually slow things down by trying to be clever and adding overhead. That said, it's probably worth trying a manual loop like in the first comment in this thread and seeing how fast it goes in your case! |
I actually just tested it on my machine, calling mkl |
You mean MKL's proprietary |
Yep, that's what i was calling. Good point about |
Thanks for the tips! I will give things a whirl tomorrow. |
Two updates about my example code there:
def pointer_array(ary):
start = ary[0].ctypes.data
step = ary[1].ctypes.data - start
x = np.arange(start, start + step * len(ary), step, dtype=np.int64)
return x.ctypes.data_as(ctypes.POINTER(ctypes.POINTER(ctypes.c_double))) ; that seemed to work in terms of accessing things, but when I called the function it didn't actually change |
Okay, I tried the first suggestion from @njsmith, and replaced my
This sounds like it could help me quite a bit if I keep everything in c arrays until the end of the algorithm. |
If |
FTR retested this with NumPy 1.14.2 to see if this is still an issue given the changes to Example:In [1]: import numpy as np
In [2]: x = np.random.rand(5, 512, 512)
In [3]: %timeit np.matmul(x, x)
491 ms ± 3.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: def xmul(a, b):
...: out = np.empty_like(a)
...: for j in range(a.shape[0]):
...: out[j] = np.dot(a[j], b[j])
...: return out
...:
In [5]: %timeit xmul(x, x)
19.3 ms ± 4.21 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [6]: %timeit np.einsum("...ij,...jk->...ik", x, x)
536 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Environment:name: test
channels:
- conda-forge
- defaults
dependencies:
- appnope=0.1.0=py36_0
- blas=1.1=openblas
- ca-certificates=2018.1.18=0
- certifi=2018.1.18=py36_0
- decorator=4.2.1=py36_0
- ipython=6.2.1=py36_1
- ipython_genutils=0.2.0=py36_0
- jedi=0.11.1=py36_0
- libgfortran=3.0.0=0
- ncurses=5.9=10
- numpy=1.14.2=py36_blas_openblas_200
- openblas=0.2.20=7
- openssl=1.0.2n=0
- parso=0.1.1=py_0
- pexpect=4.4.0=py36_0
- pickleshare=0.7.4=py36_0
- prompt_toolkit=1.0.15=py36_0
- ptyprocess=0.5.2=py36_0
- pygments=2.2.0=py36_0
- python=3.6.5=0
- readline=7.0=0
- setuptools=39.0.1=py36_0
- simplegeneric=0.8.1=py36_0
- six=1.11.0=py36_1
- sqlite=3.20.1=2
- tk=8.6.7=0
- traitlets=4.3.2=py36_0
- wcwidth=0.1.7=py36_0
- xz=5.2.3=0
- zlib=1.2.11=0 Edit: Added explicit |
@dgasmith, do you have any idea why |
@jakirkham Yes, the optimization algorithm does not currently support looped GEMM. It is unclear if it should or not as speed increases compared to the overhead of detection vary greatly depending on the size of the problem. Happy to write this in, just be aware overhead can become a problem as we write more special cases. |
more discussion on the same topic in gh-8957 (closed as a duplicate of this issue) |
PR #11133 implements matmul as a true ufunc, replicating the dot use of cblas functions. The comments about stacking a |
Relevant question on stackoverflow: MATLAB matrix multiplication performance is 5x faster than NumPy |
The microbenchmark now gives similar times for Example:``` In [1]: import numpy as npIn [2]: np.version In [3]: x = np.random.rand(5, 512, 512) In [4]: %timeit np.matmul(x, x) In [5]: def xmul(a, b): In [6]: %timeit xmul(x, x) In [7]: %timeit np.einsum("...ij,...jk->...ik", x, x)
|
@pv can we close this now that matmul uses cblas? |
I tried 1.16rc and tested matmul on two matrices of shape (5000,4,4) and (5000,4,1) and found that in new version matmul is 2-3x slower than in 1.15. For these really small matrices is there an alternative to matmul that I can use? |
@gnool: The memory order of the matrices matters. Can you show how you allocated them, or at least their |
@eric-wieser I have checked that the strides, dtype and memory order as seen from .flags are the same in both versions. Here's the newbie code that I use for both 1.16rc and 1.15:
|
If you run the code below, what does this say? python -c "from numpy.distutils import __config__; print(__config__.blas_opt_info)" |
|
Generally BLAS is targeted at larger matrices. Many BLASes like OpenBLAS have a notable startup cost, which is usually amortized over these larger matrices. Though perhaps not for your use case. Have you looked at BLIS? Users have noted this works well for small matrices. |
It would be nice if OpenBLAS would handle small matrices better. It's certainly possible – they can see the size, and decide to do something small and simple if their normal heavy-weight setup isn't going to be worthwhile. I wouldn't hold my breath though; this has been a weakness of theirs for years. There's some discussion of adding a standard batched GEMM interface to the next version of BLAS, but I wouldn't hold my breath on that either: https://docs.google.com/document/d/1DY4ImZT1coqri2382GusXgBTTTVdBDvtD5I14QHp9OE/edit# If you need optimal speed for large stacks of small matrices on numpy right now, I'd try |
Do you know if there was an issue raised about optimizing OpenBLAS for smaller matrices? |
As to the MKL point, you may be on to something. Ran across this article, which suggested MKL may do better for small matrices. |
not off the top of my head |
thanks @jakirkham and @njsmith . I replaced matmul with einsum and now I could get the same performance regardless of which version I use. Thanks guys! |
Cool! Anyway, this issue was originally about |
…ions This rewrites replaces a batched matmul by a core matmul by raveling the batched dimensions of the batched input, doing a 2D matmul and then unravelling the batched dimensions of the output. This sidesteps the Blockwise Dot operation and allows specialization into BLAS routines. The idea was taken from these two discussions: numpy/numpy#7569 numpy/numpy#8957
…ions This rewrites replaces a batched matmul by a core matmul by raveling the batched dimensions of the batched input, doing a 2D matmul and then unravelling the batched dimensions of the output. This sidesteps the Blockwise Dot operation and allows specialization into BLAS routines. The idea was taken from these two discussions: numpy/numpy#7569 numpy/numpy#8957
…ions This rewrites replaces a batched matmul by a core matmul by raveling the batched dimensions of the batched input, doing a 2D matmul and then unravelling the batched dimensions of the output. This sidesteps the Blockwise Dot operation and allows specialization into BLAS routines. The idea was taken from these two discussions: numpy/numpy#7569 numpy/numpy#8957
…ions This rewrites replaces a batched matmul by a core matmul by raveling the batched dimensions of the batched input, doing a 2D matmul and then unravelling the batched dimensions of the output. This sidesteps the Blockwise Dot operation and allows specialization into BLAS routines. The idea was taken from these two discussions: numpy/numpy#7569 numpy/numpy#8957
On numpy current master (da6e4c7), np.matmul is apparently not using BLAS:
Of course, it's a preliminary feature, but probably best to have an issue for this.
The text was updated successfully, but these errors were encountered: