Slowness of Python implementation #30

inakleinbottle · 2024-03-25T10:19:31Z

OK, I think I know why the Numpy matmul is not performing as expected. Consider the benchmarking loop in the Python code:

for count in range(BENCHMARK_REPEAT):
	for index in range(total):
		a = a_matrices[index]
		b = b_matrices[index]
		c = a * b

On the final line there, the result of a*b (this is not matrix multiplication, that is a @ b), is stored into the temporary variable c. This variable is already initialized with the product of two matrices, but this is not quite how variables work in Python. The value currently held by c is a PyObject* containing a NumPy array. At the start of the line c = a*b the thing that happens is a*b, which creates a new PyObject* pointing to a new NumPy array. Now the assignment operator kicks in, which essentially does the following:

Py_DECREF(c);  // decrease the reference count of the old value
c = tmp; // re-assign the value pointed to by c to be the temporary value containing a*b
Py_INCREF(c); // increase the reference count of the new value

At the first decref, the reference count of c hits zero, triggering the garbage collector and destroying the value (calling PyObject_Free). Then the new pointer is given the variable spot and incremented.Thus this one line of Python code has essentially performed the following:

PyObject* tmp = PyArray_New(...)    //  calling PyMem_Alloc(nbytes), which might have been allocated using Python's arena
matmul(PyArray_DATA(tmp), PyArray_DATA(a), PyArray_DATA(b), ...);     // not actually, because a*b is coordinate-wise, but in principle
PyObject_Free(c)  //  calling PyMem_Free(nbytes) on former c buffer
c = tmp;

To fix this problem, you need to either prevent the garbage collection of c, by pushing the result to a list or something similar instead, or better writing the result of the matmul directly into some already allocated Numpy array. This is my suggestion:

results = np.empty(BENCHMARK_REPEAT, 10, 10)   # allocate an array big enough for the whole lot of results
for i in range(BENCHMARK_REPEAT):
    a = a_matrices[i]
    b = b_matrices[i]
    np.matmul(a, b, out=results[i, :, :])

Notice that I've been quite careful to make sure that the shape of the results array guarantees that results[i, :, :] is a contiguous block of memory in C ordering (row major) so the write should be as efficient as possible. (Actually F ordering and results[:, :, i] would actually be better for the Fortran backend but let's ignore this.)

I hope this makes sense.

The text was updated successfully, but these errors were encountered:

llewelld · 2024-03-25T11:09:17Z

Nice analysis, thanks @inakleinbottle! I gave these changes a quick run through and the timing on the Intel i7 was definitely an improvement (22.02 seconds vs. 39.92 seconds).

I had to make some adjustments to get it to work, which I did with minimal thought:

results = []
for i in range(total):
	results.append(np.empty(c_matrices[i].shape))
start_time = time.process_time()
for count in range(BENCHMARK_REPEAT):
	for i in range(total):
		a = a_matrices[i]
		b = b_matrices[i]
		np.matmul(a, b, out=results[i])
end_time = time.process_time()

If you'd like to make a PR against the repo with the correct changes, I'll be happy to run the i7 and M1 Pro benchmarks again (so they're done in the same environment as the other benchmarks).

inakleinbottle · 2024-03-25T11:25:57Z

Hmm, I'm not sure why the empty setup that I used didn't work, probably the argument for empty needs to be a tuple. But this is pretty close. Note that this is not quite a like-for-like comparison because the old code was using component-wise multiplication and not matrix multiplication.

EDIT: My apologies, I didn't see that you'd actually used np.matrix, which is deprecated in favour of just using a 2-dimensional ndarray and @ for matrix multiplication. https://numpy.org/doc/stable/reference/generated/numpy.matrix.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slowness of Python implementation #30

Slowness of Python implementation #30

inakleinbottle commented Mar 25, 2024 •

edited

llewelld commented Mar 25, 2024 •

edited

inakleinbottle commented Mar 25, 2024 •

edited

Slowness of Python implementation #30

Slowness of Python implementation #30

Comments

inakleinbottle commented Mar 25, 2024 • edited

llewelld commented Mar 25, 2024 • edited

inakleinbottle commented Mar 25, 2024 • edited

inakleinbottle commented Mar 25, 2024 •

edited

llewelld commented Mar 25, 2024 •

edited

inakleinbottle commented Mar 25, 2024 •

edited