What is the most efficient way to iterate over Pandas's DataFrame row by row? #10334

zer0n · 2015-06-12T01:11:57Z

I have tried the function df.iterrows() but its performance is horrible. Which is not surprising given that iterrows() returns a Series with full schema and meta data, not just the values (which all that I need).

The second method that I have tried is for row in df.values, which is significantly faster. However, I have recently realized that df.values is not the internal data storage of the DataFrame, because df.values converts all dtypes to a common dtype. For example, one of my columns has dtype int64 but the dtype of df.values is all float64. So I suspect that df.values actually creates another copy of the internal data.

Also, another requirement is that the row iteration must return a list of values that preserve the original dtype of the data.

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2015-06-12T01:28:20Z

In python, iterating over the rows is going to be (a lot) slower than doing vectorized operations.

The types are being converted in your second method because that's how numpy arrays (which is what df.values is) work. DataFrames are column based, so you can have a single DataFrame with multiple dtypes. Once you iterate though row-wise, everything has to be upcast to a more general type that holds everything. In your case the ints go to float64.

If you describe your problem with a minimal working example, we might be able to help you vectorize it. You may also have luck on StackOverflow with the pandas tag.

zer0n · 2015-06-12T21:22:49Z

Basically, I want to do the following:

row_handler = RowHandler(sample_df)  # learn how to handle row from sample data
transformed_data = []
for row in df.values:
    transformed_data.append(row_handler.handle(row))
return transformed_data

I don't own the RowHandler class and hence can only operate row by row.

Another similar example is in machine learning, where you may have a model that has predict API at row level only.

TomAugspurger · 2015-06-12T23:12:28Z

Still a bit too vague to be helpful. But if RowHandler is really out of your control then you'll be out of luck. FWIW all of scikit-learn's APIs operate on arrays (so multiple rows).

zer0n · 2015-06-12T23:22:01Z

I don't see how it can be clearer. Yes, RowHandler is out of my control. What do you mean by out of luck? My question is for the most efficient way to iterate over rows while keeping the dtype of each element intact. Are you suggesting df.iterrows(), or something else?

sklearn is an exception, not the norm, that operates natively on PD's DataFrame. Not many machine learning libraries have APIs that operate on DataFrame.

shoyer · 2015-06-13T00:13:48Z

I think df.itertuples() is what you're looking for -- it's way faster than iterrows:

In [10]: x = pd.DataFrame({'x': range(10000)})

In [11]: %timeit list(x.iterrows())
1 loops, best of 3: 383 ms per loop

In [12]: %timeit list(x.itertuples())
1000 loops, best of 3: 1.39 ms per loop

zer0n · 2015-06-13T00:35:53Z

Thanks @shoyer! That's what I need.

linehammer · 2021-08-09T05:22:26Z

Iterating through pandas dataFrame objects is generally slow. Pandas Iteration beats the whole purpose of using DataFrame. It is an anti-pattern and is something you should only do when you have exhausted every other option. It is better look for a List Comprehensions , vectorized solution or DataFrame.apply() method for iterating through DataFrame.

Pandas DataFrame loop using list comprehension

result = [(x, y,z) for x, y,z in zip(df['Name'], df['Promoted'],df['Grade'])]

Pandas DataFrame loop using DataFrame.apply()

result = df.apply(lambda row: row["Name"] + " , " + str(row["TotalMarks"]) + " , " + row["Grade"], axis = 1)

MarcoGorelli · 2021-08-09T09:34:41Z

@linehammer no need to keep posting links on closed issues to what I presume is your website

TomAugspurger added the Usage Question label Jun 12, 2015

TomAugspurger closed this as completed Jun 12, 2015

jorisvandenbossche mentioned this issue Jul 26, 2015

DOC: improve docs on iteration #10680

Merged

pandas-dev locked as spam and limited conversation to collaborators Aug 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the most efficient way to iterate over Pandas's DataFrame row by row? #10334

What is the most efficient way to iterate over Pandas's DataFrame row by row? #10334

zer0n commented Jun 12, 2015

TomAugspurger commented Jun 12, 2015

zer0n commented Jun 12, 2015

TomAugspurger commented Jun 12, 2015

zer0n commented Jun 12, 2015

shoyer commented Jun 13, 2015

zer0n commented Jun 13, 2015

linehammer commented Aug 9, 2021

MarcoGorelli commented Aug 9, 2021

What is the most efficient way to iterate over Pandas's DataFrame row by row? #10334

What is the most efficient way to iterate over Pandas's DataFrame row by row? #10334

Comments

zer0n commented Jun 12, 2015

TomAugspurger commented Jun 12, 2015

zer0n commented Jun 12, 2015

TomAugspurger commented Jun 12, 2015

zer0n commented Jun 12, 2015

shoyer commented Jun 13, 2015

zer0n commented Jun 13, 2015

linehammer commented Aug 9, 2021

MarcoGorelli commented Aug 9, 2021