-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak in to_parquet? #220
Comments
Given you have 44M rows, having 44K partitions seems way too high in any case, as it results in very small parquet file (also for efficient reading afterwards). How did you specify the partitioning? But having that many partitions might also explain the high memory usage. For example, until recently it would create a |
Edit: Ignore the above. That run used the same
I foolishly used old code that called
Dask: 2022.9.2 For data of this size, what do you recommend for the partition size or number of partitions? |
I ran it again with 1K partitions and it finishes much faster and without the extreme memory usage. The memory footprint does steadily grow over time however and ~1.3 GB is still locked up after completion. I tried it with and without the metadata file. |
I have recently started working with some rather large vector data (44.5M features) and think that it has exposed a potential memory leak. Originally the data was stored in a geodatabase but I converted it to parquet with
to_parquet
. This is where the leak seems to occur. The memory usage steadily grew as the data was saved to disk and was over 95 GB upon completion. The memory was not released and I had to restart my ipython session in order to reclaim it.This may be an issue with the pyarrow backend but I figured I would file an issue here to get the ball rolling. I'll post my
geopandas.show_versions()
info below and I'm usingdask_geopandas
v0.2.0
. Please let me know if there is any more info I can provide.Edit: I checked the number of partitions and it looks like the data was split into 44.5K partitions. I wonder if the large number of partitions is causing problems.
System Info
SYSTEM INFO
python : 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) [GCC 10.3.0]
executable : /home/fred/anaconda3/envs/rstools/bin/python3.8
machine : Linux-5.15.0-47-generic-x86_64-with-glibc2.10
GEOS, GDAL, PROJ INFO
GEOS : 3.11.0
GEOS lib : /home/fred/anaconda3/envs/rstools/lib/libgeos_c.so
GDAL : 3.5.2
GDAL data dir: /home/fred/anaconda3/envs/rstools/share/gdal
PROJ : 9.0.1
PROJ data dir: /home/fred/anaconda3/envs/rstools/share/proj
PYTHON DEPENDENCIES
geopandas : 0.11.1
numpy : 1.23.3
pandas : 1.5.0
pyproj : 3.4.0
shapely : 1.8.4
fiona : 1.8.21
geoalchemy2: None
geopy : None
matplotlib : 3.6.0
mapclassify: 2.4.3
pygeos : 0.13
pyogrio : v0.4.2
psycopg2 : None
pyarrow : 9.0.0
rtree : 1.0.0
The text was updated successfully, but these errors were encountered: