memory usage does not reflect size on disk #210

dylanrstewart · 2022-07-15T13:15:13Z

This may be the expected behavior? I discovered this issue because I thought that setting the partition size manually ddf.repartition(partition_size="100MB") to try and balance these partitions would also be reflected on disk (and I was using that to check that partitions were being balanced). If this is the expected behavior, maybe there should be a couple of sentences in the repartition documentation?

Here is an example:

import dask_geopandas
ddf = dask_geopandas.read_parquet(in_file)
ddf.memory_usage_per_partition(deep="True").compute()
0     27158787
1      4173967
2     22696447
3     19924383
4     75885309
5     20496481
.
.
.
.
dtype: int64

When viewing the actual size of each of these partitions on disk they are:

0     179 MB
1     119.7 MB
2     146 MB
3     161.5 MB
4     446.2 MB
5     160.5 MB
.
.
.
.

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2022-07-15T13:35:21Z

This could be clearer in the dask documentation, but repartition, memory usage, etc. all are measuring the size of the objects in memory.

This will differ from the size on disk, depending on the file format, data types, compression, and probably other factors.

martinfleis · 2022-07-18T20:28:41Z

There is one related issue to this that is specific to dask-geopandas. Pandas (used by dask to get memory usage) does not see how large geometry objects are. It sees only a pointer to a C object, which is tiny compared to the actual memory demands of a complex geometry. See

import geopandas

df = geopandas.read_file(geopandas.datasets.get_path('nybb'))

df.memory_usage(deep=True)

Index         128
BoroCode       40
BoroName      326
Shape_Leng     40
Shape_Area     40
geometry       40
dtype: int64

In reality, geometries in nybb are all really complex. There are 76063 coordinates, each composed of 2 float64 values, which takes 1217008 bytes in memory, not 40 bytes as memory_usage suggests. We are aware of this but it is not an easy issue to solve, unless we use some tricks as counting coordinates and guessing the size based on that.

FlorisCalkoen · 2023-04-28T13:24:17Z

I think I have quite some issues because of unequal GeoParquet partitions. So now I'm trying to repartition my directories like this:

import dask_geopandas
import geopandas as gpd
import dask.dataframe as dd

df = gpd.GeoDataFrame(...)
ddf = dd.from_pandas(df, chunksize=1).repartition("50MB")
ddf.to_parquet("/path/to/dir")

What do you think of this workaround? If you want I can provide a reproducible example, but I don't think it's required for this discussion.

martinfleis · 2023-04-28T15:22:22Z

It will do the trick, although I remember getting larger chunks than specified in some cases. But there has been some changes on the dask side in the meantime, so it may be better now. Just note that it is not super scalable.

FlorisCalkoen · 2023-04-28T19:12:25Z

Yes, the partitions are indeed sometimes larger then specified, yet much more equal than before, but..

regarding scalability I now experience new problems. I'm running into memory issues for the full dataset. Probably that happens with the larger partitions (~1GB) that are load into dask with chunksize=1 and then being repartitioned. It's a memory usage error (first warning, then error):

 WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 6.27 GiB -- Worker memory limit: 7.82 GiB
2023-04-28 15:49:42,234 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:34751 (pid=22299) exceeded 95% memory budget.

I'm trying a different approach now, where I'm estimating the geometry size per row like this:

get_geometry_memory_usage = lambda x: sys.getsizeof(x.geometry.wkb)
df["gmu"] = df.apply(get_geometry_memory_usage, axis=1)
df["gmu_cumsum"] = df.gmu.cumsum()
thresholds = np.arange(
    PARTITION_SIZE, df.gmu_cumsum.max() - PARTITION_SIZE, PARTITION_SIZE
)
partition_indices = [
    df.index[df["gmu_cumsum"].searchsorted(threshold)]
    for threshold in thresholds
]

partitions = [
    dask.delayed(df.iloc[start:end])
    for start, end in zip(partition_indices[:-1], partition_indices[1:])
]

ddf = dd.from_delayed(partitions)
ddf.to_parquet("/pat/to/dir")

Unfortunately the first experiment still has similar memory warnings. Would there be a better way to partition geodataframes when you already have them in memory? I can add a reproducible example tomorrow if that's useful.

dylanrstewart mentioned this issue Jul 15, 2022

repartition documentation doesn't specify whether the size will be reflected on output dask/dask#9276

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory usage does not reflect size on disk #210

memory usage does not reflect size on disk #210

dylanrstewart commented Jul 15, 2022

TomAugspurger commented Jul 15, 2022

martinfleis commented Jul 18, 2022

FlorisCalkoen commented Apr 28, 2023 •

edited

martinfleis commented Apr 28, 2023

FlorisCalkoen commented Apr 28, 2023 •

edited

memory usage does not reflect size on disk #210

memory usage does not reflect size on disk #210

Comments

dylanrstewart commented Jul 15, 2022

TomAugspurger commented Jul 15, 2022

martinfleis commented Jul 18, 2022

FlorisCalkoen commented Apr 28, 2023 • edited

martinfleis commented Apr 28, 2023

FlorisCalkoen commented Apr 28, 2023 • edited

FlorisCalkoen commented Apr 28, 2023 •

edited

FlorisCalkoen commented Apr 28, 2023 •

edited