Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support variably-sized caches in 'RasterDataset' #1694

Open
pmaldonado opened this issue Oct 24, 2023 · 6 comments · May be fixed by #1695
Open

Support variably-sized caches in 'RasterDataset' #1694

pmaldonado opened this issue Oct 24, 2023 · 6 comments · May be fixed by #1695
Labels
datasets Geospatial or benchmark datasets

Comments

@pmaldonado
Copy link

Summary

Currently, 'RasterDataset' caches warp files with a fixed-sized LRU cache with 128 elements. I propose supporting variably-sized caches for subclasses.

Rationale

When loading large raster files, the fixed-size cache consumes considerable memory. For a given machine, this fixed-size overhead restricts the number of parallel workers usable for DataLoaders.

In our application, training batch creation is limited by the number of parallel workers rather than data access speeds. If we could reduce the size of caches during training, we could spawn additional dataloader workers and remove the present bottleneck.

Implementation

We'd add a member to "RasterDataset" and cache "_load_warp_file" during the constructor.

Alternatives

There may be others, but this one plays (relatively) nicely with MyPy the inability of method decorators to access class- or instance-members.

Additional information

No response

@calebrob6
Copy link
Member

Do you mind expanding on the rationale section a bit (mainly for my curiosity)? E.g. how big are your files / how much RAM is being consumed, etc.

@pmaldonado
Copy link
Author

We've been using a fork of "RasterDataset" and have tweaked the cache size, batch size, and number of workers to get things to behave nicely. Now we are routinely using 60+ gb of RAM before tweaking would run out of RAM on machine configured with 125gb. We're using NAIP quarter quadrangle tiles downsampled to 1m/pixel that are ~160mb (or in some cases, quartered again to be ~40mb/image).

Allowing for variable sized caches, in effect, allows users to optimize the ratio of number of worker processes vs. cache memory per process for batch loading on their platform. We've found that preparing batches (prior to transfer to GPU) is compute-bound rather than IO-bound (thanks to the cache), but would like to speed batch loading by exchanging smaller caches for additional worker processes. The optimal point would be when batch loading again becomes IO-bound due to files getting rotated out of the cache.

@adamjstewart
Copy link
Collaborator

Also relates to #1438 (@patriksabol) and #1578 (@trettelbach)

@adamjstewart adamjstewart added the datasets Geospatial or benchmark datasets label Oct 24, 2023
@adamjstewart
Copy link
Collaborator

Curious if any GDAL config options (especially GDAL_CACHEMAX) help at all here.

@pmaldonado
Copy link
Author

Also relates to #1438 (@patriksabol) and #1578 (@trettelbach)

We've observed a similar "sawtooth" pattern to memory usage.

image

This is when training with a significant number of dataloader workers with pin_memory=True and persistent_workers=False. The former reduces the number of page faults for the worker processes by page-locking the memory used to load their batches (preventing those pages from being swapped out). The latter allows worker processes to be killed between epochs, which introduces overhead at the start of each epoch or validation loop to create new worker processes (and refill their caches from empty).

Ideally, we'd have both pin_memory=True and persistent_workers=True to optimize batch loading for training. However, with both of those settings, we see a percipitous drop in available memory after the first validation loop/second training epoch. My interpretation has been that the warp file caches are not full after the first epoch, so their memory usage continues to grow during subsequent epochs unless they were killed by the validation workers starting. When persistent, the training workers grow their pinned memory until either their warp file caches are saturated or the system OOMs (due to having more pinned memory than physical memory).

If we can decrease the cache size for each worker, then we should be able to have persistent workers whose maximum memory consumption is less than the system's physical memory constraints.

It's easy to do some back of the envelope math to see how the workers' memory consumption explodes:

16 workers * 128 files per worker cache * ~100mb per file > 200gb

That's without considering any python/process overhead.

@adamjstewart Is your idea that if we could adjust GDAL_CACHEMAX and use COGs, we could lower the effective memory usage of the fixed-size LRU cache? From that link you shared, it appears that GDAL_FORCE_CACHING is False by default. Do you have a sense whether the GDAL cache would be per-process or shared across the entire system? The cache defaults to 5%, so if per-process we'd expect it to consume a very large amount of physical memory.

@adamjstewart
Copy link
Collaborator

Is your idea that if we could adjust GDAL_CACHEMAX and use COGs, we could lower the effective memory usage of the fixed-size LRU cache?

Yes, wondering if the bug could be avoided by a simple environment variable.

Do you have a sense whether the GDAL cache would be per-process or shared across the entire system?

I don't have a sense but that may actually explain the issue if it's per-process. Can you experiment with various GDAL_CACHEMAX and num_workers and see if you can find the answer to this question?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants