Support variably-sized caches in 'RasterDataset' #1694

pmaldonado · 2023-10-24T00:32:22Z

Summary

Currently, 'RasterDataset' caches warp files with a fixed-sized LRU cache with 128 elements. I propose supporting variably-sized caches for subclasses.

Rationale

When loading large raster files, the fixed-size cache consumes considerable memory. For a given machine, this fixed-size overhead restricts the number of parallel workers usable for DataLoaders.

In our application, training batch creation is limited by the number of parallel workers rather than data access speeds. If we could reduce the size of caches during training, we could spawn additional dataloader workers and remove the present bottleneck.

Implementation

We'd add a member to "RasterDataset" and cache "_load_warp_file" during the constructor.

Alternatives

There may be others, but this one plays (relatively) nicely with MyPy the inability of method decorators to access class- or instance-members.

Additional information

No response

calebrob6 · 2023-10-24T02:10:17Z

Do you mind expanding on the rationale section a bit (mainly for my curiosity)? E.g. how big are your files / how much RAM is being consumed, etc.

pmaldonado · 2023-10-24T03:56:33Z

We've been using a fork of "RasterDataset" and have tweaked the cache size, batch size, and number of workers to get things to behave nicely. Now we are routinely using 60+ gb of RAM before tweaking would run out of RAM on machine configured with 125gb. We're using NAIP quarter quadrangle tiles downsampled to 1m/pixel that are ~160mb (or in some cases, quartered again to be ~40mb/image).

Allowing for variable sized caches, in effect, allows users to optimize the ratio of number of worker processes vs. cache memory per process for batch loading on their platform. We've found that preparing batches (prior to transfer to GPU) is compute-bound rather than IO-bound (thanks to the cache), but would like to speed batch loading by exchanging smaller caches for additional worker processes. The optimal point would be when batch loading again becomes IO-bound due to files getting rotated out of the cache.

adamjstewart · 2023-10-24T08:15:17Z

Also relates to #1438 (@patriksabol) and #1578 (@trettelbach)

adamjstewart · 2023-10-24T08:19:31Z

Curious if any GDAL config options (especially GDAL_CACHEMAX) help at all here.

pmaldonado · 2023-10-24T18:10:56Z

Also relates to #1438 (@patriksabol) and #1578 (@trettelbach)

We've observed a similar "sawtooth" pattern to memory usage.

This is when training with a significant number of dataloader workers with pin_memory=True and persistent_workers=False. The former reduces the number of page faults for the worker processes by page-locking the memory used to load their batches (preventing those pages from being swapped out). The latter allows worker processes to be killed between epochs, which introduces overhead at the start of each epoch or validation loop to create new worker processes (and refill their caches from empty).

Ideally, we'd have both pin_memory=True and persistent_workers=True to optimize batch loading for training. However, with both of those settings, we see a percipitous drop in available memory after the first validation loop/second training epoch. My interpretation has been that the warp file caches are not full after the first epoch, so their memory usage continues to grow during subsequent epochs unless they were killed by the validation workers starting. When persistent, the training workers grow their pinned memory until either their warp file caches are saturated or the system OOMs (due to having more pinned memory than physical memory).

If we can decrease the cache size for each worker, then we should be able to have persistent workers whose maximum memory consumption is less than the system's physical memory constraints.

It's easy to do some back of the envelope math to see how the workers' memory consumption explodes:

16 workers * 128 files per worker cache * ~100mb per file > 200gb

That's without considering any python/process overhead.

@adamjstewart Is your idea that if we could adjust GDAL_CACHEMAX and use COGs, we could lower the effective memory usage of the fixed-size LRU cache? From that link you shared, it appears that GDAL_FORCE_CACHING is False by default. Do you have a sense whether the GDAL cache would be per-process or shared across the entire system? The cache defaults to 5%, so if per-process we'd expect it to consume a very large amount of physical memory.

adamjstewart · 2023-10-28T13:17:12Z

Is your idea that if we could adjust GDAL_CACHEMAX and use COGs, we could lower the effective memory usage of the fixed-size LRU cache?

Yes, wondering if the bug could be avoided by a simple environment variable.

Do you have a sense whether the GDAL cache would be per-process or shared across the entire system?

I don't have a sense but that may actually explain the issue if it's per-process. Can you experiment with various GDAL_CACHEMAX and num_workers and see if you can find the answer to this question?

pmaldonado linked a pull request Oct 24, 2023 that will close this issue

Implements variably-sized caches in 'RasterDataset' #1695

Open

adamjstewart added the datasets Geospatial or benchmark datasets label Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support variably-sized caches in 'RasterDataset' #1694

Support variably-sized caches in 'RasterDataset' #1694

pmaldonado commented Oct 24, 2023

calebrob6 commented Oct 24, 2023

pmaldonado commented Oct 24, 2023

adamjstewart commented Oct 24, 2023

adamjstewart commented Oct 24, 2023

pmaldonado commented Oct 24, 2023

adamjstewart commented Oct 28, 2023

Support variably-sized caches in 'RasterDataset' #1694

Support variably-sized caches in 'RasterDataset' #1694

Comments

pmaldonado commented Oct 24, 2023

Summary

Rationale

Implementation

Alternatives

Additional information

calebrob6 commented Oct 24, 2023

pmaldonado commented Oct 24, 2023

adamjstewart commented Oct 24, 2023

adamjstewart commented Oct 24, 2023

pmaldonado commented Oct 24, 2023

adamjstewart commented Oct 28, 2023