Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

numpy.matmul speedup for multidimensional arrays #8957

Closed
tondieker opened this issue Apr 18, 2017 · 19 comments
Closed

numpy.matmul speedup for multidimensional arrays #8957

tondieker opened this issue Apr 18, 2017 · 19 comments

Comments

@tondieker
Copy link

I encountered a curious performance issue in numpy.dot multiplication of an N-dimensional array with a 2-dimensional array. I consistently found it to be a factor 15-20 faster to first reshape arrays to 2-dimensional arrays, do the multiplication on the reshaped arrays, and then reshape back.

I'm including my code at the end of this message. It appears that the same observation has been made by hpaulj and buried in a StackOverflow question: http://stackoverflow.com/questions/33004551/why-is-b-numpy-dota-x-so-much-slower-looping-through-doing-bi-numpy. I would love to have numpy.dot automatically do the reshaping, it's much more user-friendly.

Here is a related post: https://www.huyng.com/posts/faster-numpy-dot-product. I believe it is better to invoke BLAS through reshaping than the solution proposed there.

Thanks!

(tested on: numpy 1.12.1 on anaconda with python 3.6.0, mac on intel)

import numpy as np

n1 = 400
n2 = 50
n3 = 100
n4 = 200

A = np.random.random((n1,n2,n3))
B = np.random.random((n3,n4))

direct = A @ B # result has shape (n1,n2,n4)
indirect = (A.reshape((n1*n2,n3)) @ B).reshape((n1,n2,n4))

np.allclose(direct,indirect) # True

%timeit A @ B
#1 loop, best of 3: 195 ms per loop
%timeit (A.reshape((n1*n2,n3)) @ B).reshape((n1,n2,n4))
#10 loops, best of 3: 27.7 ms per loop


@eric-wieser
Copy link
Member

I assume you mean np.matmul, not np.dot?

@eric-wieser
Copy link
Member

eric-wieser commented Apr 18, 2017

Here is a related post

I was under the impression that we already do what is described there. However, it still seems that there is a 4x overhead involved in np.matmul over fastdot, so perhaps more investigation is needed, - edit: was compiling numpy without cblas.

I believe it is better to invoke BLAS through reshaping than the solution proposed there.

There's a tradeoff you're making there - you're making a copy and using more memory.

However, there are some cases where there is no tradeoff to make - if all the iteration axes are contiguous (used in the loosest sense of having no gaps), then no copy is needed.

@charris
Copy link
Member

charris commented Apr 18, 2017

Neither matmul nor dot handle stacked arrays efficiently for those types for which blas is available, in fact, blas is only called when the arguments are unstacked arrays. Clever reshaping and loops could do much for those types. BTW, this means that matmul should not be a simple ufunc, although it might call a ufunc at the bottom layer.

@eric-wieser
Copy link
Member

BTW, this means that matmul should not be a simple ufunc, although it might call a ufunc at the bottom layer.

Alternatively, this could also be seen as suggesting that a mechanism to allow specializing ufunc implementation for performance might be sensible.

@charris
Copy link
Member

charris commented Apr 18, 2017

I think the way to do it would be to reshape so that the largest possible matrices will be involved in the matrix multiplications, call the ufunc, and then reshape the result. If we only inplement matrix-matrix multiplication we will want to do that in any case in order to handle vectors.

@eric-wieser
Copy link
Member

eric-wieser commented Apr 18, 2017

@charris: Do we always want to do that reshape, or only if we can do so without a copy?

On a related note, there are cases where reshape makes a copy where here we wouldn't need to - in this case, we don't care at all what order the "stack" dimensions are flattened in, as long as we can unflatten them in the same order.

@charris
Copy link
Member

charris commented Apr 18, 2017

We make copies if needed for blas anyway. Blas has stride limitations for matrix multiplications.

@eric-wieser
Copy link
Member

eric-wieser commented Apr 18, 2017

Right, but we do so for each matrix in turn, not for the whole array at once - right?

@charris
Copy link
Member

charris commented Apr 18, 2017

We only call blas for unstacked arrays and copy the whole arrays when needed.

@eric-wieser
Copy link
Member

eric-wieser commented Apr 18, 2017

Ah, good point.

Re: "there are cases where reshape makes a copy where here we wouldn't need to" - sorting the strides would work:

order = np.argsort(-np.array(arr.strides[:-2]))
arr_contig_ish = arr.transpose(tuple(order) + (-2, -1))

if arr_contig_ish.flags.c_contiguous:
    # reshape is free
    arr_2d = arr_contig_ish.reshape((-1, 1))

    res_2d = do_the_thing(arr_2d)

    res = undo_the_flatten(res_2d, order)

@charris
Copy link
Member

charris commented Apr 18, 2017

Basically, blas only handles contiguous arrays, although the arrays can be transposed, but at least one of the dimensions needs to have stride itemsize, and probably aligned at that.

@njsmith
Copy link
Member

njsmith commented Apr 19, 2017

@charris: for regular ufunc loops, the core function gets a whole array of scalars and the generic machinery attempts to pass in larger arrays when possible. It seems like a potential solution to this while keeping matmul as a gufunc would be if gufunc core functions similarly got a 1d array of core-dimension-sized arrays? (This would also probably be more efficient in general in cases where the core dimensions are small.) Maybe it already works that way, even?

@pv
Copy link
Member

pv commented Apr 19, 2017

That's how it works for gufuncs also --- e.g. in numpy.linalg allocated work arrays are reused for the inner loop.
I'm not sure if the iterator unrolls outer contiguous dimensions, but I'd suppose so (without checking)...

@seberg
Copy link
Member

seberg commented Apr 19, 2017

The iterator should unroll and reorder the outer loop, but the inner loop, or mixed solutions of course not.

@tondieker
Copy link
Author

Thank you all for your speedy thoughts, and for working on numpy (I am a new user)!
I just want to add to this discussion that there may be similar issues fornumpy.tensordot, if so it would make sense to update that function as well.

@springer13
Copy link

It seems that tensordot is solving your concern regarding the "automatic reshaping".

As far as this automatic reshaping is concerned, I encourage you to have a look into one of our publications on Tensor Contractions where we compare different approaches to one another and highlight their strength and weaknesses: https://arxiv.org/abs/1607.00145

As it was already correctly pointed out, the reshaping approach (or Transpose-Transpose-GEMM-Transpose, TTGT) has the disadvantage that it requires additional memory. Moreover, this approach is not well suited for "bandwidth-bound" tensor contractions (see paper) since the transposition of the tensors, before and after the GEMM, account for pure overhead that could be avoided (see GETT section).

Also, if one chooses to follow the TTGT path for tensor contractions, it is of critical importance that the tensor transpositions are performed as efficiently as possible. To this end, I would like to point you to this open-source library for tensor transpositions: https://github.com/springer13/hptt

@njsmith
Copy link
Member

njsmith commented May 19, 2017

@springer13: The "reshaping" optimization being discussed here is the observation that in many cases, you can implement a stacked matrix multiply (which I guess is a special case of tensor contractions) by doing a single call to GEMM with no copies or other overhead, but apparently numpy is instead doing a loop and calling GEMM many times. The "reshape" would be to keep the same storage buffer, but switch from viewing it as a (n1, n2, n3)-shaped array and instead view it as a (n1 * n2, n3)-shaped array.

@jakirkham
Copy link
Contributor

I could be totally wrong here, but this looks like a duplicate of issue ( #7569 ).

@rgommers rgommers changed the title numpy.dot speedup for multidimensional arrays numpy.matmul speedup for multidimensional arrays Jul 26, 2018
@rgommers
Copy link
Member

I could be totally wrong here, but this looks like a duplicate of issue ( #7569 ).

Looks like it. Closing as duplicate, and will mention on that issue that there's interesting content here.

ricardoV94 added a commit to ricardoV94/pytensor that referenced this issue Dec 16, 2023
…ions

This rewrites replaces a batched matmul by a core matmul by raveling the batched dimensions of the batched input, doing a 2D matmul and then unravelling the batched dimensions of the output.

This sidesteps the Blockwise Dot operation and allows specialization into BLAS routines.

The idea was taken from these two discussions:
numpy/numpy#7569
numpy/numpy#8957
ricardoV94 added a commit to ricardoV94/pytensor that referenced this issue Dec 16, 2023
…ions

This rewrites replaces a batched matmul by a core matmul by raveling the batched dimensions of the batched input, doing a 2D matmul and then unravelling the batched dimensions of the output.

This sidesteps the Blockwise Dot operation and allows specialization into BLAS routines.

The idea was taken from these two discussions:
numpy/numpy#7569
numpy/numpy#8957
ricardoV94 added a commit to ricardoV94/pytensor that referenced this issue Dec 18, 2023
…ions

This rewrites replaces a batched matmul by a core matmul by raveling the batched dimensions of the batched input, doing a 2D matmul and then unravelling the batched dimensions of the output.

This sidesteps the Blockwise Dot operation and allows specialization into BLAS routines.

The idea was taken from these two discussions:
numpy/numpy#7569
numpy/numpy#8957
ricardoV94 added a commit to pymc-devs/pytensor that referenced this issue Dec 20, 2023
…ions

This rewrites replaces a batched matmul by a core matmul by raveling the batched dimensions of the batched input, doing a 2D matmul and then unravelling the batched dimensions of the output.

This sidesteps the Blockwise Dot operation and allows specialization into BLAS routines.

The idea was taken from these two discussions:
numpy/numpy#7569
numpy/numpy#8957
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants