Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add verify_existing option to cache #630

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

cisaacstern
Copy link
Member

Adds the option to disable verification of files in the cache against their remote sizes.

For long-running / large caching operations which need to be restarted, simply verifying remote sizes can itself represent a large time cost. In such cases, it may be preferable to disable verification of remote sizes, to allow for more expedient continuation of the caching operation.

@cisaacstern cisaacstern added the test-integration Apply this label to run integration tests on a PR. label Sep 21, 2023
@cisaacstern
Copy link
Member Author

cisaacstern commented Sep 21, 2023

Copy link
Contributor

@moradology moradology left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

def cache_file(self, fname: str, secrets: Optional[dict], **open_kwargs) -> None:
# check and see if the file already exists in the cache
logger.info(f"Caching file '{fname}'")
if self.exists(fname):
exists = self.exists(fname)
if exists and self.verify_existing:
cached_size = self.size(fname)
remote_size = _get_url_size(fname, secrets, **open_kwargs)
if cached_size == remote_size:
# TODO: add checksumming here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference, do we have a way of doing this without downloading the file and computing the checksum? S3 provides that information but arbitrary HTTP endpoints can't be relied upon to provide the SHA in a predictable way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so digging into fsspec source a bit, it appears that fsspec's info method ought to be used here. It will, by default, avoid opening the file and instead attempt to get both size and checksum from a head request. It looks to me as though there's no benefit to caching with the current method if verify_existing is True, as it will get the size of an opened file rather than using a small HEAD request

Would you like me to amend this PR with some logic to lean on those methods? Probably good to provide a warning in case it is absolutely necessary to fully load the file up (as happens here) for size/checksum validation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test-integration Apply this label to run integration tests on a PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants