Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the most efficient way to iterate over Pandas's DataFrame row by row? #10334

Closed
zer0n opened this issue Jun 12, 2015 · 8 comments
Closed

Comments

@zer0n
Copy link

zer0n commented Jun 12, 2015

I have tried the function df.iterrows() but its performance is horrible. Which is not surprising given that iterrows() returns a Series with full schema and meta data, not just the values (which all that I need).

The second method that I have tried is for row in df.values, which is significantly faster. However, I have recently realized that df.values is not the internal data storage of the DataFrame, because df.values converts all dtypes to a common dtype. For example, one of my columns has dtype int64 but the dtype of df.values is all float64. So I suspect that df.values actually creates another copy of the internal data.

Also, another requirement is that the row iteration must return a list of values that preserve the original dtype of the data.

@TomAugspurger
Copy link
Contributor

In python, iterating over the rows is going to be (a lot) slower than doing vectorized operations.

The types are being converted in your second method because that's how numpy arrays (which is what df.values is) work. DataFrames are column based, so you can have a single DataFrame with multiple dtypes. Once you iterate though row-wise, everything has to be upcast to a more general type that holds everything. In your case the ints go to float64.

If you describe your problem with a minimal working example, we might be able to help you vectorize it. You may also have luck on StackOverflow with the pandas tag.

@zer0n
Copy link
Author

zer0n commented Jun 12, 2015

Basically, I want to do the following:

row_handler = RowHandler(sample_df)  # learn how to handle row from sample data
transformed_data = []
for row in df.values:
    transformed_data.append(row_handler.handle(row))
return transformed_data

I don't own the RowHandler class and hence can only operate row by row.

Another similar example is in machine learning, where you may have a model that has predict API at row level only.

@TomAugspurger
Copy link
Contributor

Still a bit too vague to be helpful. But if RowHandler is really out of your control then you'll be out of luck. FWIW all of scikit-learn's APIs operate on arrays (so multiple rows).

@zer0n
Copy link
Author

zer0n commented Jun 12, 2015

I don't see how it can be clearer. Yes, RowHandler is out of my control. What do you mean by out of luck? My question is for the most efficient way to iterate over rows while keeping the dtype of each element intact. Are you suggesting df.iterrows(), or something else?

sklearn is an exception, not the norm, that operates natively on PD's DataFrame. Not many machine learning libraries have APIs that operate on DataFrame.

@shoyer
Copy link
Member

shoyer commented Jun 13, 2015

I think df.itertuples() is what you're looking for -- it's way faster than iterrows:

In [10]: x = pd.DataFrame({'x': range(10000)})

In [11]: %timeit list(x.iterrows())
1 loops, best of 3: 383 ms per loop

In [12]: %timeit list(x.itertuples())
1000 loops, best of 3: 1.39 ms per loop

@zer0n
Copy link
Author

zer0n commented Jun 13, 2015

Thanks @shoyer! That's what I need.

@linehammer
Copy link

Iterating through pandas dataFrame objects is generally slow. Pandas Iteration beats the whole purpose of using DataFrame. It is an anti-pattern and is something you should only do when you have exhausted every other option. It is better look for a List Comprehensions , vectorized solution or DataFrame.apply() method for iterating through DataFrame.

Pandas DataFrame loop using list comprehension

result = [(x, y,z) for x, y,z in zip(df['Name'], df['Promoted'],df['Grade'])]

Pandas DataFrame loop using DataFrame.apply()

result = df.apply(lambda row: row["Name"] + " , " + str(row["TotalMarks"]) + " , " + row["Grade"], axis = 1)

@MarcoGorelli
Copy link
Member

@linehammer no need to keep posting links on closed issues to what I presume is your website

@pandas-dev pandas-dev locked as spam and limited conversation to collaborators Aug 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants