Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

numpy.vectorize's implementation is essentially a for loop? #16763

Closed
FrankHui opened this issue Jul 6, 2020 · 6 comments
Closed

numpy.vectorize's implementation is essentially a for loop? #16763

FrankHui opened this issue Jul 6, 2020 · 6 comments
Labels
33 - Question Question about NumPy usage or development

Comments

@FrankHui
Copy link

FrankHui commented Jul 6, 2020

https://numpy.org/doc/1.18/reference/generated/numpy.vectorize.html
This tutorial mentions that vectorize's implementation is essentially a for loop. But as far as I know, a vectorized func will use
SIMD, so is it accurate to say numpy.vectorize's implementation is essentially a for loop? If true, so it's faster than unvectorized func only because it's loop implementated in C language?

Many thanks in advance.

@rkern
Copy link
Member

rkern commented Jul 6, 2020

Yes. In the context of interpreted numerical array programming languages like Python (with numpy) and MATLAB™, we often use "vectorization" to refer to replacing explicit loops in the interpreted programming language with a function (or operator) that takes care of all of the looping logic internally. In numpy, the ufuncs implement this logic. This is unrelated to the usage of "vectorization" to refer to using SIMD CPU instructions that compute over multiple inputs concurrently, except that they both use a similar metaphor: they are like their "scalar" counterparts, but perform the computation over multiple input values with a single invocation.

With numpy.vectorize(), there is usually not a whole lot of speed benefit over the explicit Python for loop. The main point of it is to turn the Python function into a ufunc, which implements all of the broadcasting semantics and thus deals with any size of inputs. The Python function that's being "vectorized" still takes up most of the time, as well as converting the raw value of each element to a Python object to pass to the function. You wouldn't expect np.vectorize(lambda x, y: x + y) to be as fast as the ufunc np.add, which is C both in the loop and the contents of the loop.

@FrankHui
Copy link
Author

FrankHui commented Jul 6, 2020

Thank you for your detailed explaination. But to be clear, let me take an example.

import pandas as pd
import numpy as np
df = pd.DataFrame({'a': range(100000), 'b': range(1, 1000001)})
# method1
df.loc[:, 'c'] = df.apply(lambda x: x['a'] + x['b'], axis=1)
# method2 
df.loc[:, 'c'] = np.vectorize(lambda x, y: x + y)(df['a'], df['b'])
# method3
df.loc[:, 'c'] = np.add(df['a'], df['b'])

so with your explaination, I guess

method loop in C loop content in C use SIMD
1 × × ×
2 × ×
3

Right?

@rkern
Copy link
Member

rkern commented Jul 6, 2020

np.add is faster than np.vectorize(lambda x, y: x + y) because it avoids converting C doubles into Python objects and the Python function call overhead. It's possible that it also uses SIMD instructions, depending on whether or not you have the AVX2 extensions, but that's not why it's faster.

@FrankHui
Copy link
Author

FrankHui commented Jul 6, 2020

np.add is faster than np.vectorize(lambda x, y: x + y) because it avoids converting C doubles into Python objects and the Python function call overhead. It's possible that it also uses SIMD instructions, depending on whether or not you have the AVX2 extensions, but that's not why it's faster.

I got it. Thanks.

@bashtage
Copy link
Contributor

bashtage commented Jul 6, 2020

You can use numba's vectorize to produce ufuncs that operate in parallel without Python overheads:

https://numba.pydata.org/numba-doc/latest/user/vectorize.html

@rossbar rossbar added the 33 - Question Question about NumPy usage or development label Jul 10, 2020
@mattip
Copy link
Member

mattip commented Dec 19, 2021

Closing, as the question was answered.

@mattip mattip closed this as completed Dec 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
33 - Question Question about NumPy usage or development
Projects
None yet
Development

No branches or pull requests

5 participants