Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory usage does not reflect size on disk #210

Open
dylanrstewart opened this issue Jul 15, 2022 · 5 comments
Open

memory usage does not reflect size on disk #210

dylanrstewart opened this issue Jul 15, 2022 · 5 comments

Comments

@dylanrstewart
Copy link

This may be the expected behavior? I discovered this issue because I thought that setting the partition size manually ddf.repartition(partition_size="100MB") to try and balance these partitions would also be reflected on disk (and I was using that to check that partitions were being balanced). If this is the expected behavior, maybe there should be a couple of sentences in the repartition documentation?

Here is an example:

import dask_geopandas
ddf = dask_geopandas.read_parquet(in_file)
ddf.memory_usage_per_partition(deep="True").compute()
0     27158787
1      4173967
2     22696447
3     19924383
4     75885309
5     20496481
.
.
.
.
dtype: int64

When viewing the actual size of each of these partitions on disk they are:

0     179 MB
1     119.7 MB
2     146 MB
3     161.5 MB
4     446.2 MB
5     160.5 MB
.
.
.
.
@TomAugspurger
Copy link
Contributor

This could be clearer in the dask documentation, but repartition, memory usage, etc. all are measuring the size of the objects in memory.

This will differ from the size on disk, depending on the file format, data types, compression, and probably other factors.

@martinfleis
Copy link
Member

There is one related issue to this that is specific to dask-geopandas. Pandas (used by dask to get memory usage) does not see how large geometry objects are. It sees only a pointer to a C object, which is tiny compared to the actual memory demands of a complex geometry. See

import geopandas

df = geopandas.read_file(geopandas.datasets.get_path('nybb'))

df.memory_usage(deep=True)
Index         128
BoroCode       40
BoroName      326
Shape_Leng     40
Shape_Area     40
geometry       40
dtype: int64

In reality, geometries in nybb are all really complex. There are 76063 coordinates, each composed of 2 float64 values, which takes 1217008 bytes in memory, not 40 bytes as memory_usage suggests. We are aware of this but it is not an easy issue to solve, unless we use some tricks as counting coordinates and guessing the size based on that.

@FlorisCalkoen
Copy link

FlorisCalkoen commented Apr 28, 2023

I think I have quite some issues because of unequal GeoParquet partitions. So now I'm trying to repartition my directories like this:

import dask_geopandas
import geopandas as gpd
import dask.dataframe as dd

df = gpd.GeoDataFrame(...)
ddf = dd.from_pandas(df, chunksize=1).repartition("50MB")
ddf.to_parquet("/path/to/dir")

What do you think of this workaround? If you want I can provide a reproducible example, but I don't think it's required for this discussion.

@martinfleis
Copy link
Member

It will do the trick, although I remember getting larger chunks than specified in some cases. But there has been some changes on the dask side in the meantime, so it may be better now. Just note that it is not super scalable.

@FlorisCalkoen
Copy link

FlorisCalkoen commented Apr 28, 2023

Yes, the partitions are indeed sometimes larger then specified, yet much more equal than before, but..

regarding scalability I now experience new problems. I'm running into memory issues for the full dataset. Probably that happens with the larger partitions (~1GB) that are load into dask with chunksize=1 and then being repartitioned. It's a memory usage error (first warning, then error):

 WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 6.27 GiB -- Worker memory limit: 7.82 GiB
2023-04-28 15:49:42,234 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:34751 (pid=22299) exceeded 95% memory budget.

I'm trying a different approach now, where I'm estimating the geometry size per row like this:

get_geometry_memory_usage = lambda x: sys.getsizeof(x.geometry.wkb)
df["gmu"] = df.apply(get_geometry_memory_usage, axis=1)
df["gmu_cumsum"] = df.gmu.cumsum()
thresholds = np.arange(
    PARTITION_SIZE, df.gmu_cumsum.max() - PARTITION_SIZE, PARTITION_SIZE
)
partition_indices = [
    df.index[df["gmu_cumsum"].searchsorted(threshold)]
    for threshold in thresholds
]

partitions = [
    dask.delayed(df.iloc[start:end])
    for start, end in zip(partition_indices[:-1], partition_indices[1:])
]

ddf = dd.from_delayed(partitions)
ddf.to_parquet("/pat/to/dir")

Unfortunately the first experiment still has similar memory warnings. Would there be a better way to partition geodataframes when you already have them in memory? I can add a reproducible example tomorrow if that's useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants