-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memory usage does not reflect size on disk #210
Comments
This could be clearer in the dask documentation, but repartition, memory usage, etc. all are measuring the size of the objects in memory. This will differ from the size on disk, depending on the file format, data types, compression, and probably other factors. |
There is one related issue to this that is specific to dask-geopandas. Pandas (used by dask to get memory usage) does not see how large geometry objects are. It sees only a pointer to a C object, which is tiny compared to the actual memory demands of a complex geometry. See import geopandas
df = geopandas.read_file(geopandas.datasets.get_path('nybb'))
df.memory_usage(deep=True)
In reality, geometries in |
I think I have quite some issues because of unequal GeoParquet partitions. So now I'm trying to repartition my directories like this: import dask_geopandas
import geopandas as gpd
import dask.dataframe as dd
df = gpd.GeoDataFrame(...)
ddf = dd.from_pandas(df, chunksize=1).repartition("50MB")
ddf.to_parquet("/path/to/dir") What do you think of this workaround? If you want I can provide a reproducible example, but I don't think it's required for this discussion. |
It will do the trick, although I remember getting larger chunks than specified in some cases. But there has been some changes on the dask side in the meantime, so it may be better now. Just note that it is not super scalable. |
Yes, the partitions are indeed sometimes larger then specified, yet much more equal than before, but.. regarding scalability I now experience new problems. I'm running into memory issues for the full dataset. Probably that happens with the larger partitions (~1GB) that are load into dask with WARNING - Worker is at 80% memory usage. Pausing worker. Process memory: 6.27 GiB -- Worker memory limit: 7.82 GiB
2023-04-28 15:49:42,234 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:34751 (pid=22299) exceeded 95% memory budget. I'm trying a different approach now, where I'm estimating the geometry size per row like this: get_geometry_memory_usage = lambda x: sys.getsizeof(x.geometry.wkb)
df["gmu"] = df.apply(get_geometry_memory_usage, axis=1)
df["gmu_cumsum"] = df.gmu.cumsum()
thresholds = np.arange(
PARTITION_SIZE, df.gmu_cumsum.max() - PARTITION_SIZE, PARTITION_SIZE
)
partition_indices = [
df.index[df["gmu_cumsum"].searchsorted(threshold)]
for threshold in thresholds
]
partitions = [
dask.delayed(df.iloc[start:end])
for start, end in zip(partition_indices[:-1], partition_indices[1:])
]
ddf = dd.from_delayed(partitions)
ddf.to_parquet("/pat/to/dir") Unfortunately the first experiment still has similar memory warnings. Would there be a better way to partition geodataframes when you already have them in memory? I can add a reproducible example tomorrow if that's useful. |
This may be the expected behavior? I discovered this issue because I thought that setting the partition size manually
ddf.repartition(partition_size="100MB")
to try and balance these partitions would also be reflected on disk (and I was using that to check that partitions were being balanced). If this is the expected behavior, maybe there should be a couple of sentences in the repartition documentation?Here is an example:
When viewing the actual size of each of these partitions on disk they are:
The text was updated successfully, but these errors were encountered: