Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datasets: add download_azure_container utility function #1887

Closed

Conversation

Haimantika
Copy link

@Haimantika Haimantika commented Feb 16, 2024

Needed for #1830

@github-actions github-actions bot added datasets Geospatial or benchmark datasets dependencies Packaging and dependencies labels Feb 16, 2024
@Haimantika
Copy link
Author

@Haimantika please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@microsoft-github-policy-service agree

pyproject.toml Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
@adamjstewart
Copy link
Collaborator

@Haimantika are you still working on this? @darkblue-b may be willing to help finish this PR.

@Haimantika
Copy link
Author

@Haimantika are you still working on this? @darkblue-b may be willing to help finish this PR.

Hi, I have tried wrapping my head around it, but I couldn’t get past. Apologies for not finishing it :(

@adamjstewart
Copy link
Collaborator

Do you mind if @darkblue-b takes it from here? Can he push directly to your PR branch?

@Haimantika
Copy link
Author

Do you mind if @darkblue-b takes it from here? Can he push directly to your PR branch?

Yes please. I will follow the changes to learn :)
Thank you again for giving me the opportunity to try!

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Mar 29, 2024
@adamjstewart
Copy link
Collaborator

I tested that the TropicalCyclone dataset can be downloaded using:

split = "train"
account_url = "https://radiantearth.blob.core.windows.net"
container_name = "mlhub"
name_starts_with = f"nasa-tropical-storm-challenge/{split}"

However, much of the dataset format has changed, so this dataset will require a large rewrite. I'm going to add tests and merge this just as a new utility function, then we can open separate PRs for each dataset.

@adamjstewart adamjstewart changed the title Migrating from Radiant MLHub to Source Cooperative Datasets: add download_azure_container utility function Mar 29, 2024
)

client = ContainerClient(*args, **kwargs)
for blob in client.list_blob_names(name_starts_with=name_starts_with):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could wrap this with multiprocessing to speed things up, tqdm to add a progress bar, and various other improvements. There is also an asynchronous version of the API, although I don't think that will benefit us here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: compare download speeds to the CLI version and see if multiprocessing is required to make this comparable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first glance, this is orders of magnitude slower than azcopy sync. I'll try parallelizing it, but if it's still noticeably slower, I suggest we rely on the CLI tool instead. Downside is that it can't be pip install'ed, but being slow is a bigger sin.

root: str, name_starts_with: str, *args: Any, **kwargs: Any
) -> None:
"""Download a container from Azure blob storage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just add some examples of how to actually instantiate ContainerClient here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it help if we made account_url and container_name explicit parameters instead of putting them in args/kwargs? They are required by ContainerClient, but I didn't explicitly list them to keep the code as simple as possible. Either way, this function currently doesn't show up in the docs, as it's only intended for use in TorchGeo, not use in other programs.

@github-actions github-actions bot added the testing Continuous integration testing label May 1, 2024
@adamjstewart adamjstewart marked this pull request as draft May 1, 2024 22:08
Comment on lines +27 to +29
# TODO: filehandle leak
f = open(path, "rb", buffering=0)
return f
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know what to do about this. On macOS, we're only allowed to open 256 files at a time. I don't think any of our text data will exceed this, but I don't know how else to code this to mimic the actual behavior.

@adamjstewart
Copy link
Collaborator

Unfortunately I think manual downloads with azure.storage.blob are simply too slow. I think we'll have to give up and use azcopy sync instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets dependencies Packaging and dependencies documentation Improvements or additions to documentation testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants