Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slowness of Python implementation #30

Open
inakleinbottle opened this issue Mar 25, 2024 · 2 comments
Open

Slowness of Python implementation #30

inakleinbottle opened this issue Mar 25, 2024 · 2 comments

Comments

@inakleinbottle
Copy link

inakleinbottle commented Mar 25, 2024

OK, I think I know why the Numpy matmul is not performing as expected. Consider the benchmarking loop in the Python code:

for count in range(BENCHMARK_REPEAT):
	for index in range(total):
		a = a_matrices[index]
		b = b_matrices[index]
		c = a * b

On the final line there, the result of a*b (this is not matrix multiplication, that is a @ b), is stored into the temporary variable c. This variable is already initialized with the product of two matrices, but this is not quite how variables work in Python. The value currently held by c is a PyObject* containing a NumPy array. At the start of the line c = a*b the thing that happens is a*b, which creates a new PyObject* pointing to a new NumPy array. Now the assignment operator kicks in, which essentially does the following:

Py_DECREF(c);  // decrease the reference count of the old value
c = tmp; // re-assign the value pointed to by c to be the temporary value containing a*b
Py_INCREF(c); // increase the reference count of the new value

At the first decref, the reference count of c hits zero, triggering the garbage collector and destroying the value (calling PyObject_Free). Then the new pointer is given the variable spot and incremented.Thus this one line of Python code has essentially performed the following:

PyObject* tmp = PyArray_New(...)    //  calling PyMem_Alloc(nbytes), which might have been allocated using Python's arena
matmul(PyArray_DATA(tmp), PyArray_DATA(a), PyArray_DATA(b), ...);     // not actually, because a*b is coordinate-wise, but in principle
PyObject_Free(c)  //  calling PyMem_Free(nbytes) on former c buffer
c = tmp;

To fix this problem, you need to either prevent the garbage collection of c, by pushing the result to a list or something similar instead, or better writing the result of the matmul directly into some already allocated Numpy array. This is my suggestion:

results = np.empty(BENCHMARK_REPEAT, 10, 10)   # allocate an array big enough for the whole lot of results
for i in range(BENCHMARK_REPEAT):
    a = a_matrices[i]
    b = b_matrices[i]
    np.matmul(a, b, out=results[i, :, :])

Notice that I've been quite careful to make sure that the shape of the results array guarantees that results[i, :, :] is a contiguous block of memory in C ordering (row major) so the write should be as efficient as possible. (Actually F ordering and results[:, :, i] would actually be better for the Fortran backend but let's ignore this.)

I hope this makes sense.

@llewelld
Copy link
Contributor

llewelld commented Mar 25, 2024

Nice analysis, thanks @inakleinbottle! I gave these changes a quick run through and the timing on the Intel i7 was definitely an improvement (22.02 seconds vs. 39.92 seconds).

I had to make some adjustments to get it to work, which I did with minimal thought:

results = []
for i in range(total):
	results.append(np.empty(c_matrices[i].shape))
start_time = time.process_time()
for count in range(BENCHMARK_REPEAT):
	for i in range(total):
		a = a_matrices[i]
		b = b_matrices[i]
		np.matmul(a, b, out=results[i])
end_time = time.process_time()

If you'd like to make a PR against the repo with the correct changes, I'll be happy to run the i7 and M1 Pro benchmarks again (so they're done in the same environment as the other benchmarks).

@inakleinbottle
Copy link
Author

inakleinbottle commented Mar 25, 2024

Hmm, I'm not sure why the empty setup that I used didn't work, probably the argument for empty needs to be a tuple. But this is pretty close. Note that this is not quite a like-for-like comparison because the old code was using component-wise multiplication and not matrix multiplication.

EDIT: My apologies, I didn't see that you'd actually used np.matrix, which is deprecated in favour of just using a 2-dimensional ndarray and @ for matrix multiplication. https://numpy.org/doc/stable/reference/generated/numpy.matrix.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants