New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
numpy.matmul speedup for multidimensional arrays #8957
Comments
I assume you mean |
I was under the impression that we already do what is described there.
There's a tradeoff you're making there - you're making a copy and using more memory. However, there are some cases where there is no tradeoff to make - if all the iteration axes are contiguous (used in the loosest sense of having no gaps), then no copy is needed. |
Neither matmul nor dot handle stacked arrays efficiently for those types for which blas is available, in fact, blas is only called when the arguments are unstacked arrays. Clever reshaping and loops could do much for those types. BTW, this means that matmul should not be a simple ufunc, although it might call a ufunc at the bottom layer. |
Alternatively, this could also be seen as suggesting that a mechanism to allow specializing ufunc implementation for performance might be sensible. |
I think the way to do it would be to reshape so that the largest possible matrices will be involved in the matrix multiplications, call the ufunc, and then reshape the result. If we only inplement matrix-matrix multiplication we will want to do that in any case in order to handle vectors. |
@charris: Do we always want to do that reshape, or only if we can do so without a copy? On a related note, there are cases where |
We make copies if needed for blas anyway. Blas has stride limitations for matrix multiplications. |
Right, but we do so for each matrix in turn, not for the whole array at once - right? |
We only call blas for unstacked arrays and copy the whole arrays when needed. |
Ah, good point. Re: "there are cases where reshape makes a copy where here we wouldn't need to" - sorting the strides would work: order = np.argsort(-np.array(arr.strides[:-2]))
arr_contig_ish = arr.transpose(tuple(order) + (-2, -1))
if arr_contig_ish.flags.c_contiguous:
# reshape is free
arr_2d = arr_contig_ish.reshape((-1, 1))
res_2d = do_the_thing(arr_2d)
res = undo_the_flatten(res_2d, order) |
Basically, blas only handles contiguous arrays, although the arrays can be transposed, but at least one of the dimensions needs to have stride itemsize, and probably aligned at that. |
@charris: for regular ufunc loops, the core function gets a whole array of scalars and the generic machinery attempts to pass in larger arrays when possible. It seems like a potential solution to this while keeping matmul as a gufunc would be if gufunc core functions similarly got a 1d array of core-dimension-sized arrays? (This would also probably be more efficient in general in cases where the core dimensions are small.) Maybe it already works that way, even? |
That's how it works for gufuncs also --- e.g. in numpy.linalg allocated work arrays are reused for the inner loop. |
The iterator should unroll and reorder the outer loop, but the inner loop, or mixed solutions of course not. |
Thank you all for your speedy thoughts, and for working on numpy (I am a new user)! |
It seems that tensordot is solving your concern regarding the "automatic reshaping". As far as this automatic reshaping is concerned, I encourage you to have a look into one of our publications on Tensor Contractions where we compare different approaches to one another and highlight their strength and weaknesses: https://arxiv.org/abs/1607.00145 As it was already correctly pointed out, the reshaping approach (or Transpose-Transpose-GEMM-Transpose, TTGT) has the disadvantage that it requires additional memory. Moreover, this approach is not well suited for "bandwidth-bound" tensor contractions (see paper) since the transposition of the tensors, before and after the GEMM, account for pure overhead that could be avoided (see GETT section). Also, if one chooses to follow the TTGT path for tensor contractions, it is of critical importance that the tensor transpositions are performed as efficiently as possible. To this end, I would like to point you to this open-source library for tensor transpositions: https://github.com/springer13/hptt |
@springer13: The "reshaping" optimization being discussed here is the observation that in many cases, you can implement a stacked matrix multiply (which I guess is a special case of tensor contractions) by doing a single call to GEMM with no copies or other overhead, but apparently numpy is instead doing a loop and calling GEMM many times. The "reshape" would be to keep the same storage buffer, but switch from viewing it as a |
I could be totally wrong here, but this looks like a duplicate of issue ( #7569 ). |
Looks like it. Closing as duplicate, and will mention on that issue that there's interesting content here. |
…ions This rewrites replaces a batched matmul by a core matmul by raveling the batched dimensions of the batched input, doing a 2D matmul and then unravelling the batched dimensions of the output. This sidesteps the Blockwise Dot operation and allows specialization into BLAS routines. The idea was taken from these two discussions: numpy/numpy#7569 numpy/numpy#8957
…ions This rewrites replaces a batched matmul by a core matmul by raveling the batched dimensions of the batched input, doing a 2D matmul and then unravelling the batched dimensions of the output. This sidesteps the Blockwise Dot operation and allows specialization into BLAS routines. The idea was taken from these two discussions: numpy/numpy#7569 numpy/numpy#8957
…ions This rewrites replaces a batched matmul by a core matmul by raveling the batched dimensions of the batched input, doing a 2D matmul and then unravelling the batched dimensions of the output. This sidesteps the Blockwise Dot operation and allows specialization into BLAS routines. The idea was taken from these two discussions: numpy/numpy#7569 numpy/numpy#8957
…ions This rewrites replaces a batched matmul by a core matmul by raveling the batched dimensions of the batched input, doing a 2D matmul and then unravelling the batched dimensions of the output. This sidesteps the Blockwise Dot operation and allows specialization into BLAS routines. The idea was taken from these two discussions: numpy/numpy#7569 numpy/numpy#8957
I encountered a curious performance issue in numpy.dot multiplication of an N-dimensional array with a 2-dimensional array. I consistently found it to be a factor 15-20 faster to first reshape arrays to 2-dimensional arrays, do the multiplication on the reshaped arrays, and then reshape back.
I'm including my code at the end of this message. It appears that the same observation has been made by hpaulj and buried in a StackOverflow question: http://stackoverflow.com/questions/33004551/why-is-b-numpy-dota-x-so-much-slower-looping-through-doing-bi-numpy. I would love to have numpy.dot automatically do the reshaping, it's much more user-friendly.
Here is a related post: https://www.huyng.com/posts/faster-numpy-dot-product. I believe it is better to invoke BLAS through reshaping than the solution proposed there.
Thanks!
(tested on: numpy 1.12.1 on anaconda with python 3.6.0, mac on intel)
The text was updated successfully, but these errors were encountered: