Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: df.unstack() is 500 times slower since pandas>=2.1 #58391

Open
2 of 3 tasks
sbonz opened this issue Apr 23, 2024 · 5 comments
Open
2 of 3 tasks

PERF: df.unstack() is 500 times slower since pandas>=2.1 #58391

sbonz opened this issue Apr 23, 2024 · 5 comments
Assignees
Labels
Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance

Comments

@sbonz
Copy link

sbonz commented Apr 23, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np
import time
df = pd.DataFrame(np.random.random(size=(10000, 100)))
st = time.time()
df.unstack() # this operation takes 500x more in pandas>=2.1
print(f"time {time.time() -st}")

Installed Versions

INSTALLED VERSIONS ------------------ commit : bdc79c1 python : 3.11.9.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 9, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United Kingdom.1252

pandas : 2.2.1
numpy : 1.26.4
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.3.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : 1.3.7
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : 2.8.7
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

Prior Performance

same code as above is 500x faster for pandas<=2.0.3.
Issue happens on Windows and Linux, with Python 3.10 and 3.12, with backend numpy and pyarrow.
The slow down seems to be in the stack_v3 function in the initial loop.

@sbonz sbonz added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Apr 23, 2024
@jbrockmendel
Copy link
Member

Cc @rhshadrach

@asishm
Copy link
Contributor

asishm commented Apr 24, 2024

on main it's about 5x faster than on 2.2.2 but still extremely slow compared to 2.0.3

on 2.0.3 -> 17ms
2.2.2 -> 5.4 s
main -> 1.08 s

@sam-baumann
Copy link

take

@sam-baumann
Copy link

Looked into this. In the sample code from the original issue, the df being used for testing is just random values, rather than the result of a stack(). The following code actually runs 2-3x faster on main than 2.0.3 on my machine.

Seems like the performance issue only comes up when the df is not in the form expected by unstack(). @sbonz did you see this on real data?

import pandas as pd
import numpy as np
import time
data = np.random.randint(0, 100,size=(100000, 1000))
df = pd.DataFrame(data=data).stack()

st = time.time()
df.unstack() 
print(f"time {time.time() -st}")

@sbonz
Copy link
Author

sbonz commented Apr 28, 2024

@sam-baumann yes, I noticed the slowdown because some tests (with real data) in our pipeline started timing out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

4 participants