Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datasets: add azcopy download support #2043

Closed
wants to merge 5 commits into from

Conversation

adamjstewart
Copy link
Collaborator

This PR adds an azcopy function to torchgeo.datasets.utils that makes it easier to download datasets from Azure Blob Storage (such as Source Cooperative). It's basically just a wrapper around subprocess.run, but with a more useful error message if azcopy isn't installed. It can be used as follows:

from torchgeo.datasets.utils import azcopy

azcopy("sync", "https://radiantearth.blob.core.windows.net/mlhub/nasa-tropical-storm-challenge", ".", "--recursive=true")

The hardest part was testing. We don't want our tests to require internet access or download massive datasets, so we need to use local fake data to test. But we also can't get full test coverage unless we actually attempt to "download" the data, and azcopy doesn't support local <-> local file transfers like rsync does. My solution was to create a fake azcopy command that can copy local files and inject this first in the PATH. I don't know of a reliable way to test when this command isn't available, so we may need to change CI a bit.

Prerequisite for #1830
Closes #1887
Closes #1915

@Haimantika @darkblue-b Once this is reviewed and merged, I could use your help in porting our existing datasets to use this (full list in #1830). Unfortunately, many of the datasets seemingly completely changed their file hierarchy, so some of them may require more than just a simple one-function update.

@adamjstewart adamjstewart added this to the 0.6.0 milestone May 3, 2024
@github-actions github-actions bot added datasets Geospatial or benchmark datasets testing Continuous integration testing labels May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant