Datasets: add download_azure_container utility function #1887

Haimantika · 2024-02-16T11:07:22Z

Needed for #1830

Haimantika · 2024-02-16T11:09:53Z

@Haimantika please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree

pyproject.toml

adamjstewart · 2024-02-28T07:21:39Z

@Haimantika are you still working on this? @darkblue-b may be willing to help finish this PR.

Haimantika · 2024-02-28T07:22:52Z

@Haimantika are you still working on this? @darkblue-b may be willing to help finish this PR.

Hi, I have tried wrapping my head around it, but I couldn’t get past. Apologies for not finishing it :(

adamjstewart · 2024-02-28T07:31:11Z

Do you mind if @darkblue-b takes it from here? Can he push directly to your PR branch?

Haimantika · 2024-02-28T07:32:08Z

Do you mind if @darkblue-b takes it from here? Can he push directly to your PR branch?

Yes please. I will follow the changes to learn :)
Thank you again for giving me the opportunity to try!

adamjstewart · 2024-03-29T14:14:17Z

I tested that the TropicalCyclone dataset can be downloaded using:

split = "train"
account_url = "https://radiantearth.blob.core.windows.net"
container_name = "mlhub"
name_starts_with = f"nasa-tropical-storm-challenge/{split}"

However, much of the dataset format has changed, so this dataset will require a large rewrite. I'm going to add tests and merge this just as a new utility function, then we can open separate PRs for each dataset.

adamjstewart · 2024-03-29T16:04:17Z

torchgeo/datasets/utils.py

+        )
+
+    client = ContainerClient(*args, **kwargs)
+    for blob in client.list_blob_names(name_starts_with=name_starts_with):


Could wrap this with multiprocessing to speed things up, tqdm to add a progress bar, and various other improvements. There is also an asynchronous version of the API, although I don't think that will benefit us here.

TODO: compare download speeds to the CLI version and see if multiprocessing is required to make this comparable.

At first glance, this is orders of magnitude slower than azcopy sync. I'll try parallelizing it, but if it's still noticeably slower, I suggest we rely on the CLI tool instead. Downside is that it can't be pip install'ed, but being slow is a bigger sin.

calebrob6 · 2024-03-29T16:55:30Z

torchgeo/datasets/utils.py

+    root: str, name_starts_with: str, *args: Any, **kwargs: Any
+) -> None:
+    """Download a container from Azure blob storage.
+


I would just add some examples of how to actually instantiate ContainerClient here

Would it help if we made account_url and container_name explicit parameters instead of putting them in args/kwargs? They are required by ContainerClient, but I didn't explicitly list them to keep the code as simple as possible. Either way, this function currently doesn't show up in the docs, as it's only intended for use in TorchGeo, not use in other programs.

adamjstewart · 2024-05-02T13:55:19Z

tests/datasets/conftest.py

+        # TODO: filehandle leak
+        f = open(path, "rb", buffering=0)
+        return f


I don't really know what to do about this. On macOS, we're only allowed to open 256 files at a time. I don't think any of our text data will exceed this, but I don't know how else to code this to mimic the actual behavior.

adamjstewart · 2024-05-02T21:36:00Z

Unfortunately I think manual downloads with azure.storage.blob are simply too slow. I think we'll have to give up and use azcopy sync instead.

Migrating from Radiant MLHub to Source Cooperative

66eabb1

github-actions bot added datasets Geospatial or benchmark datasets dependencies Packaging and dependencies labels Feb 16, 2024

adamjstewart reviewed Feb 16, 2024

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

Fixed library

fa25b62

adamjstewart reviewed Feb 16, 2024

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

adamjstewart mentioned this pull request Feb 28, 2024

add $azcopy stubs for radiantEarth repos #1915

Closed

adamjstewart added this to the 0.6.0 milestone Feb 28, 2024

adamjstewart added 2 commits March 29, 2024 12:04

Use azure-storage-blob

8fc51f6

Lazy imports, docs

cd19b91

github-actions bot added the documentation Improvements or additions to documentation label Mar 29, 2024

adamjstewart added 2 commits March 29, 2024 14:02

Fix intersphinx

777e0f4

Download files and write to disk

ad133d8

Undo dataset changes

775efc7

adamjstewart changed the title ~~Migrating from Radiant MLHub to Source Cooperative~~ Datasets: add download_azure_container utility function Mar 29, 2024

adamjstewart reviewed Mar 29, 2024

View reviewed changes

calebrob6 reviewed Mar 29, 2024

View reviewed changes

adamjstewart added 2 commits May 1, 2024 19:59

Merge branch 'main' into migrate_from_radiant_to_source

9070557

Add tests

659abc8

github-actions bot added the testing Continuous integration testing label May 1, 2024

adamjstewart added 4 commits May 1, 2024 21:40

azure-storage-blob 0.37.0

9eb9742

azure-storage-blob 12.0.0

84281dc

Test fewer things

7513ae0

azure-storage-blob 12.14.0

045316d

azure-storage-blob 12.14.0 is the minimum version supported

226ec8b

adamjstewart marked this pull request as draft May 1, 2024 22:08

Add mocked download tests

135bc85

adamjstewart reviewed May 2, 2024

View reviewed changes

adamjstewart mentioned this pull request May 3, 2024

Datasets: add azcopy download support #2043

Closed

adamjstewart removed this from the 0.6.0 milestone May 15, 2024

adamjstewart mentioned this pull request May 16, 2024

Datasets: add CLI support #2064

Merged

adamjstewart closed this in #2064 May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets: add download_azure_container utility function #1887

Datasets: add download_azure_container utility function #1887

Haimantika commented Feb 16, 2024 •

edited by adamjstewart

Haimantika commented Feb 16, 2024

adamjstewart commented Feb 28, 2024

Haimantika commented Feb 28, 2024

adamjstewart commented Feb 28, 2024

Haimantika commented Feb 28, 2024

adamjstewart commented Mar 29, 2024

adamjstewart Mar 29, 2024

adamjstewart May 1, 2024

adamjstewart May 2, 2024

calebrob6 Mar 29, 2024

adamjstewart Mar 30, 2024

adamjstewart May 2, 2024

adamjstewart commented May 2, 2024

Datasets: add download_azure_container utility function #1887

Datasets: add download_azure_container utility function #1887

Conversation

Haimantika commented Feb 16, 2024 • edited by adamjstewart

Haimantika commented Feb 16, 2024

adamjstewart commented Feb 28, 2024

Haimantika commented Feb 28, 2024

adamjstewart commented Feb 28, 2024

Haimantika commented Feb 28, 2024

adamjstewart commented Mar 29, 2024

adamjstewart Mar 29, 2024

Choose a reason for hiding this comment

adamjstewart May 1, 2024

Choose a reason for hiding this comment

adamjstewart May 2, 2024

Choose a reason for hiding this comment

calebrob6 Mar 29, 2024

Choose a reason for hiding this comment

adamjstewart Mar 30, 2024

Choose a reason for hiding this comment

adamjstewart May 2, 2024

Choose a reason for hiding this comment

adamjstewart commented May 2, 2024

Haimantika commented Feb 16, 2024 •

edited by adamjstewart