Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in to_parquet? #220

Open
fbunt opened this issue Oct 11, 2022 · 3 comments
Open

Memory leak in to_parquet? #220

fbunt opened this issue Oct 11, 2022 · 3 comments

Comments

@fbunt
Copy link
Contributor

fbunt commented Oct 11, 2022

I have recently started working with some rather large vector data (44.5M features) and think that it has exposed a potential memory leak. Originally the data was stored in a geodatabase but I converted it to parquet with to_parquet. This is where the leak seems to occur. The memory usage steadily grew as the data was saved to disk and was over 95 GB upon completion. The memory was not released and I had to restart my ipython session in order to reclaim it.

This may be an issue with the pyarrow backend but I figured I would file an issue here to get the ball rolling. I'll post my geopandas.show_versions() info below and I'm using dask_geopandas v0.2.0. Please let me know if there is any more info I can provide.

Edit: I checked the number of partitions and it looks like the data was split into 44.5K partitions. I wonder if the large number of partitions is causing problems.

System Info

SYSTEM INFO

python : 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) [GCC 10.3.0]
executable : /home/fred/anaconda3/envs/rstools/bin/python3.8
machine : Linux-5.15.0-47-generic-x86_64-with-glibc2.10

GEOS, GDAL, PROJ INFO

GEOS : 3.11.0
GEOS lib : /home/fred/anaconda3/envs/rstools/lib/libgeos_c.so
GDAL : 3.5.2
GDAL data dir: /home/fred/anaconda3/envs/rstools/share/gdal
PROJ : 9.0.1
PROJ data dir: /home/fred/anaconda3/envs/rstools/share/proj

PYTHON DEPENDENCIES

geopandas : 0.11.1
numpy : 1.23.3
pandas : 1.5.0
pyproj : 3.4.0
shapely : 1.8.4
fiona : 1.8.21
geoalchemy2: None
geopy : None
matplotlib : 3.6.0
mapclassify: 2.4.3
pygeos : 0.13
pyogrio : v0.4.2
psycopg2 : None
pyarrow : 9.0.0
rtree : 1.0.0

@jorisvandenbossche
Copy link
Member

Edit: I checked the number of partitions and it looks like the data was split into 44.5K partitions. I wonder if the large number of partitions is causing problems.

Given you have 44M rows, having 44K partitions seems way too high in any case, as it results in very small parquet file (also for efficient reading afterwards). How did you specify the partitioning?

But having that many partitions might also explain the high memory usage. For example, until recently it would create a _metadata file by default, and for doing this it needs to keep the parquet FileMetaData for each partition in memory to combine those and write to that file (and if you have lots of partitions, that might give memory issues, although I would expect to be a couple of GBs, but not necessarily 95). This can be turned off with write_metadata_file=False. Does that change anything if trying that?
But which version of dask are you using? (the default of write_metadata_file should also be to not write that file anymore in the latest release)

@fbunt
Copy link
Contributor Author

fbunt commented Oct 12, 2022

I let it run again last night with 1K partitions and the result is the same memory footprint which was not released. I'll run it again today without the metadata file.

Edit: Ignore the above. That run used the same chunksize=1000 followed by a repartition so it still had the large number of partitions in the tree.

How did you specify the partitioning?

I foolishly used old code that called read_file with chunksize=1000.

But which version of dask are you using?

Dask: 2022.9.2

For data of this size, what do you recommend for the partition size or number of partitions?

@fbunt
Copy link
Contributor Author

fbunt commented Oct 12, 2022

I ran it again with 1K partitions and it finishes much faster and without the extreme memory usage. The memory footprint does steadily grow over time however and ~1.3 GB is still locked up after completion. I tried it with and without the metadata file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants