numpy.matmul speedup for multidimensional arrays #8957

tondieker · 2017-04-18T15:16:36Z

I encountered a curious performance issue in numpy.dot multiplication of an N-dimensional array with a 2-dimensional array. I consistently found it to be a factor 15-20 faster to first reshape arrays to 2-dimensional arrays, do the multiplication on the reshaped arrays, and then reshape back.

I'm including my code at the end of this message. It appears that the same observation has been made by hpaulj and buried in a StackOverflow question: http://stackoverflow.com/questions/33004551/why-is-b-numpy-dota-x-so-much-slower-looping-through-doing-bi-numpy. I would love to have numpy.dot automatically do the reshaping, it's much more user-friendly.

Here is a related post: https://www.huyng.com/posts/faster-numpy-dot-product. I believe it is better to invoke BLAS through reshaping than the solution proposed there.

Thanks!

(tested on: numpy 1.12.1 on anaconda with python 3.6.0, mac on intel)

import numpy as np

n1 = 400
n2 = 50
n3 = 100
n4 = 200

A = np.random.random((n1,n2,n3))
B = np.random.random((n3,n4))

direct = A @ B # result has shape (n1,n2,n4)
indirect = (A.reshape((n1*n2,n3)) @ B).reshape((n1,n2,n4))

np.allclose(direct,indirect) # True

%timeit A @ B
#1 loop, best of 3: 195 ms per loop
%timeit (A.reshape((n1*n2,n3)) @ B).reshape((n1,n2,n4))
#10 loops, best of 3: 27.7 ms per loop

The text was updated successfully, but these errors were encountered:

eric-wieser · 2017-04-18T15:28:46Z

I assume you mean np.matmul, not np.dot?

eric-wieser · 2017-04-18T15:36:56Z

Here is a related post

I was under the impression that we already do what is described there. ~~However, it still seems that there is a 4x overhead involved in np.matmul over fastdot, so perhaps more investigation is needed,~~ - edit: was compiling numpy without cblas.

I believe it is better to invoke BLAS through reshaping than the solution proposed there.

There's a tradeoff you're making there - you're making a copy and using more memory.

However, there are some cases where there is no tradeoff to make - if all the iteration axes are contiguous (used in the loosest sense of having no gaps), then no copy is needed.

charris · 2017-04-18T15:56:44Z

Neither matmul nor dot handle stacked arrays efficiently for those types for which blas is available, in fact, blas is only called when the arguments are unstacked arrays. Clever reshaping and loops could do much for those types. BTW, this means that matmul should not be a simple ufunc, although it might call a ufunc at the bottom layer.

eric-wieser · 2017-04-18T15:59:55Z

BTW, this means that matmul should not be a simple ufunc, although it might call a ufunc at the bottom layer.

Alternatively, this could also be seen as suggesting that a mechanism to allow specializing ufunc implementation for performance might be sensible.

charris · 2017-04-18T16:08:08Z

I think the way to do it would be to reshape so that the largest possible matrices will be involved in the matrix multiplications, call the ufunc, and then reshape the result. If we only inplement matrix-matrix multiplication we will want to do that in any case in order to handle vectors.

eric-wieser · 2017-04-18T16:11:12Z

@charris: Do we always want to do that reshape, or only if we can do so without a copy?

On a related note, there are cases where reshape makes a copy where here we wouldn't need to - in this case, we don't care at all what order the "stack" dimensions are flattened in, as long as we can unflatten them in the same order.

charris · 2017-04-18T16:15:59Z

We make copies if needed for blas anyway. Blas has stride limitations for matrix multiplications.

eric-wieser · 2017-04-18T16:17:53Z

Right, but we do so for each matrix in turn, not for the whole array at once - right?

charris · 2017-04-18T16:25:00Z

We only call blas for unstacked arrays and copy the whole arrays when needed.

eric-wieser · 2017-04-18T16:26:26Z

Ah, good point.

Re: "there are cases where reshape makes a copy where here we wouldn't need to" - sorting the strides would work:

order = np.argsort(-np.array(arr.strides[:-2]))
arr_contig_ish = arr.transpose(tuple(order) + (-2, -1))

if arr_contig_ish.flags.c_contiguous:
    # reshape is free
    arr_2d = arr_contig_ish.reshape((-1, 1))

    res_2d = do_the_thing(arr_2d)

    res = undo_the_flatten(res_2d, order)

charris · 2017-04-18T16:40:57Z

Basically, blas only handles contiguous arrays, although the arrays can be transposed, but at least one of the dimensions needs to have stride itemsize, and probably aligned at that.

njsmith · 2017-04-19T04:52:21Z

@charris: for regular ufunc loops, the core function gets a whole array of scalars and the generic machinery attempts to pass in larger arrays when possible. It seems like a potential solution to this while keeping matmul as a gufunc would be if gufunc core functions similarly got a 1d array of core-dimension-sized arrays? (This would also probably be more efficient in general in cases where the core dimensions are small.) Maybe it already works that way, even?

pv · 2017-04-19T07:34:40Z

That's how it works for gufuncs also --- e.g. in numpy.linalg allocated work arrays are reused for the inner loop.
I'm not sure if the iterator unrolls outer contiguous dimensions, but I'd suppose so (without checking)...

seberg · 2017-04-19T07:59:39Z

The iterator should unroll and reorder the outer loop, but the inner loop, or mixed solutions of course not.

tondieker · 2017-04-19T21:10:37Z

Thank you all for your speedy thoughts, and for working on numpy (I am a new user)!
I just want to add to this discussion that there may be similar issues fornumpy.tensordot, if so it would make sense to update that function as well.

springer13 · 2017-05-19T06:22:05Z

It seems that tensordot is solving your concern regarding the "automatic reshaping".

As far as this automatic reshaping is concerned, I encourage you to have a look into one of our publications on Tensor Contractions where we compare different approaches to one another and highlight their strength and weaknesses: https://arxiv.org/abs/1607.00145

As it was already correctly pointed out, the reshaping approach (or Transpose-Transpose-GEMM-Transpose, TTGT) has the disadvantage that it requires additional memory. Moreover, this approach is not well suited for "bandwidth-bound" tensor contractions (see paper) since the transposition of the tensors, before and after the GEMM, account for pure overhead that could be avoided (see GETT section).

Also, if one chooses to follow the TTGT path for tensor contractions, it is of critical importance that the tensor transpositions are performed as efficiently as possible. To this end, I would like to point you to this open-source library for tensor transpositions: https://github.com/springer13/hptt

njsmith · 2017-05-19T07:00:22Z

@springer13: The "reshaping" optimization being discussed here is the observation that in many cases, you can implement a stacked matrix multiply (which I guess is a special case of tensor contractions) by doing a single call to GEMM with no copies or other overhead, but apparently numpy is instead doing a loop and calling GEMM many times. The "reshape" would be to keep the same storage buffer, but switch from viewing it as a (n1, n2, n3)-shaped array and instead view it as a (n1 * n2, n3)-shaped array.

jakirkham · 2017-11-19T22:10:52Z

I could be totally wrong here, but this looks like a duplicate of issue ( #7569 ).

rgommers · 2018-07-26T04:24:47Z

I could be totally wrong here, but this looks like a duplicate of issue ( #7569 ).

Looks like it. Closing as duplicate, and will mention on that issue that there's interesting content here.

…ions This rewrites replaces a batched matmul by a core matmul by raveling the batched dimensions of the batched input, doing a 2D matmul and then unravelling the batched dimensions of the output. This sidesteps the Blockwise Dot operation and allows specialization into BLAS routines. The idea was taken from these two discussions: numpy/numpy#7569 numpy/numpy#8957

eric-wieser added 23 - Wish List component: numpy._core labels Apr 18, 2017

jakirkham mentioned this issue Apr 23, 2018

xarray.dot() dask problems pydata/xarray#2074

Closed

rgommers changed the title ~~numpy.dot speedup for multidimensional arrays~~ numpy.matmul speedup for multidimensional arrays Jul 26, 2018

rgommers closed this as completed Jul 26, 2018

rgommers added 50 - Duplicate and removed 23 - Wish List labels Jul 26, 2018

rgommers mentioned this issue Jul 26, 2018

numpy.matmul is slow #7569

Closed

ricardoV94 mentioned this issue Dec 16, 2023

Add rewrite for matmul when only one of the inputs has batched dimensions pymc-devs/pytensor#558

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

numpy.matmul speedup for multidimensional arrays #8957

numpy.matmul speedup for multidimensional arrays #8957

tondieker commented Apr 18, 2017

eric-wieser commented Apr 18, 2017

eric-wieser commented Apr 18, 2017 •

edited

charris commented Apr 18, 2017

eric-wieser commented Apr 18, 2017

charris commented Apr 18, 2017

eric-wieser commented Apr 18, 2017 •

edited

charris commented Apr 18, 2017

eric-wieser commented Apr 18, 2017 •

edited

charris commented Apr 18, 2017

eric-wieser commented Apr 18, 2017 •

edited

charris commented Apr 18, 2017

njsmith commented Apr 19, 2017

pv commented Apr 19, 2017 •

edited

seberg commented Apr 19, 2017

tondieker commented Apr 19, 2017

springer13 commented May 19, 2017

njsmith commented May 19, 2017

jakirkham commented Nov 19, 2017

rgommers commented Jul 26, 2018

numpy.matmul speedup for multidimensional arrays #8957

numpy.matmul speedup for multidimensional arrays #8957

Comments

tondieker commented Apr 18, 2017

eric-wieser commented Apr 18, 2017

eric-wieser commented Apr 18, 2017 • edited

charris commented Apr 18, 2017

eric-wieser commented Apr 18, 2017

charris commented Apr 18, 2017

eric-wieser commented Apr 18, 2017 • edited

charris commented Apr 18, 2017

eric-wieser commented Apr 18, 2017 • edited

charris commented Apr 18, 2017

eric-wieser commented Apr 18, 2017 • edited

charris commented Apr 18, 2017

njsmith commented Apr 19, 2017

pv commented Apr 19, 2017 • edited

seberg commented Apr 19, 2017

tondieker commented Apr 19, 2017

springer13 commented May 19, 2017

njsmith commented May 19, 2017

jakirkham commented Nov 19, 2017

rgommers commented Jul 26, 2018

eric-wieser commented Apr 18, 2017 •

edited

eric-wieser commented Apr 18, 2017 •

edited

eric-wieser commented Apr 18, 2017 •

edited

eric-wieser commented Apr 18, 2017 •

edited

pv commented Apr 19, 2017 •

edited