You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On the final line there, the result of a*b (this is not matrix multiplication, that is a @ b), is stored into the temporary variable c. This variable is already initialized with the product of two matrices, but this is not quite how variables work in Python. The value currently held by c is a PyObject* containing a NumPy array. At the start of the line c = a*b the thing that happens is a*b, which creates a new PyObject* pointing to a new NumPy array. Now the assignment operator kicks in, which essentially does the following:
Py_DECREF(c); // decrease the reference count of the old value
c = tmp; // re-assign the value pointed to by c to be the temporary value containing a*b
Py_INCREF(c); // increase the reference count of the new value
At the first decref, the reference count of c hits zero, triggering the garbage collector and destroying the value (calling PyObject_Free). Then the new pointer is given the variable spot and incremented.Thus this one line of Python code has essentially performed the following:
PyObject*tmp=PyArray_New(...) // calling PyMem_Alloc(nbytes), which might have been allocated using Python's arenamatmul(PyArray_DATA(tmp), PyArray_DATA(a), PyArray_DATA(b), ...); // not actually, because a*b is coordinate-wise, but in principlePyObject_Free(c) // calling PyMem_Free(nbytes) on former c bufferc=tmp;
To fix this problem, you need to either prevent the garbage collection of c, by pushing the result to a list or something similar instead, or better writing the result of the matmul directly into some already allocated Numpy array. This is my suggestion:
results=np.empty(BENCHMARK_REPEAT, 10, 10) # allocate an array big enough for the whole lot of resultsforiinrange(BENCHMARK_REPEAT):
a=a_matrices[i]
b=b_matrices[i]
np.matmul(a, b, out=results[i, :, :])
Notice that I've been quite careful to make sure that the shape of the results array guarantees that results[i, :, :] is a contiguous block of memory in C ordering (row major) so the write should be as efficient as possible. (Actually F ordering and results[:, :, i] would actually be better for the Fortran backend but let's ignore this.)
I hope this makes sense.
The text was updated successfully, but these errors were encountered:
Nice analysis, thanks @inakleinbottle! I gave these changes a quick run through and the timing on the Intel i7 was definitely an improvement (22.02 seconds vs. 39.92 seconds).
I had to make some adjustments to get it to work, which I did with minimal thought:
If you'd like to make a PR against the repo with the correct changes, I'll be happy to run the i7 and M1 Pro benchmarks again (so they're done in the same environment as the other benchmarks).
Hmm, I'm not sure why the empty setup that I used didn't work, probably the argument for empty needs to be a tuple. But this is pretty close. Note that this is not quite a like-for-like comparison because the old code was using component-wise multiplication and not matrix multiplication.
OK, I think I know why the Numpy matmul is not performing as expected. Consider the benchmarking loop in the Python code:
On the final line there, the result of
a*b
(this is not matrix multiplication, that isa @ b
), is stored into the temporary variablec
. This variable is already initialized with the product of two matrices, but this is not quite how variables work in Python. The value currently held byc
is aPyObject*
containing a NumPy array. At the start of the linec = a*b
the thing that happens isa*b
, which creates a newPyObject*
pointing to a new NumPy array. Now the assignment operator kicks in, which essentially does the following:At the first decref, the reference count of
c
hits zero, triggering the garbage collector and destroying the value (callingPyObject_Free
). Then the new pointer is given the variable spot and incremented.Thus this one line of Python code has essentially performed the following:To fix this problem, you need to either prevent the garbage collection of
c
, by pushing the result to a list or something similar instead, or better writing the result of the matmul directly into some already allocated Numpy array. This is my suggestion:Notice that I've been quite careful to make sure that the shape of the
results
array guarantees thatresults[i, :, :]
is a contiguous block of memory in C ordering (row major) so the write should be as efficient as possible. (Actually F ordering andresults[:, :, i]
would actually be better for the Fortran backend but let's ignore this.)I hope this makes sense.
The text was updated successfully, but these errors were encountered: