Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements variably-sized caches in 'RasterDataset' #1695

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pmaldonado
Copy link

@pmaldonado pmaldonado commented Oct 24, 2023

Closes #1694.

@github-actions github-actions bot added datasets Geospatial or benchmark datasets testing Continuous integration testing labels Oct 24, 2023
@pmaldonado
Copy link
Author

@microsoft-github-policy-service agree

@pmaldonado
Copy link
Author

The current error on tests / pytest (windows-latest, 3.10) appears to be unrelated to my changes:

unrar v5.30.0 [Approved]
unrar package files install completed. Performing other installation steps.
Attempt to get headers for http://www.rarlab.com/rar/unrarw32.exe failed.
  The remote file either doesn't exist, is unauthorized, or is forbidden for url 'http://www.rarlab.com/rar/unrarw32.exe'. Exception calling "GetResponse" with "0" argument(s): "The remote server returned an error: (404) Not Found."
Downloading unrar 
  from 'http://www.rarlab.com/rar/unrarw32.exe'
ERROR: The remote file either doesn't exist, is unauthorized, or is forbidden for url 'http://www.rarlab.com/rar/unrarw32.exe'. Exception calling "GetResponse" with "0" argument(s): "The remote server returned an error: (404) Not Found." 

@calebrob6
Copy link
Member

Hey @pmaldonado, we're aware that the windows test is failing for unrelated reasons. Going to give it a day to see if it resolves, then look into it.

@adamjstewart
Copy link
Collaborator

I wonder if we could somehow detect the total RAM size and set the default cache size based on that. I don't really want to add a new dependency on psutil though.

@pmaldonado
Copy link
Author

I wonder if we could somehow detect the total RAM size and set the default cache size based on that. I don't really want to add a new dependency on psutil though.

Should "GeoDataset" change it's behavior based on properties of the DataModule (the number of workers) or system (amount of RAM)?

A lightweight solution to our issue is allow users to overwrite the _cache_size member when subclassing "RasterDataset" (or overwriting it at import time, if desirable). It's backwards compatible and allows users to adjust memory usage to their needs, if needed, but doesn't require their judgement if the current behavior is sufficient.

An alternate implementation could be allow the cache parameter to the constructor to be of type bool | int and interpret an integer value as the desired cache size. This would also be backwards compatible, but would require updating all of the RasterDataset subclasses. The one clear benefit would be configuring the cache size upon dataset creation, rather than definition or import. This doesn't appear significant to me, since cache size should vary much more between application/platforms rather than between datasets in a given application.

@adamjstewart
Copy link
Collaborator

Should "GeoDataset" change it's behavior based on properties of the DataModule (the number of workers) or system (amount of RAM)?

I think the filename LRU cache is shared across all processes so it isn't affected by the number of workers, but correct me if I'm wrong. That's just an assumption, I have no evidence.

cache size should vary much more between application/platforms rather than between datasets in a given application.

This depends. If we know that one dataset uses COGs and one doesn't, that may affect how large we want the cache to be. But let's explore GDAL_CACHEMAX more first before making a decision here. It's possible that the filename LRU cache can be infinitely sized as long as GDAL_CACHEMAX is set appropriately.

@calebrob6
Copy link
Member

This is fine with me!

@adamjstewart
Copy link
Collaborator

@pmaldonado have you tried GDAL_CACHEMAX to see if it's possible to control this via environment variable instead? See your original issue for my suggestions.

@adamjstewart
Copy link
Collaborator

Ping @pmaldonado

@pmaldonado
Copy link
Author

pmaldonado commented Jan 23, 2024

Thanks for the ping! Got pulled into a few other things. I'll give your suggestion with GDAL_CACHEMAX a shot later this week when I get some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support variably-sized caches in 'RasterDataset'
3 participants