Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate from Radiant MLHub to Source Cooperative #1830

Open
10 tasks
adamjstewart opened this issue Jan 29, 2024 · 16 comments
Open
10 tasks

Migrate from Radiant MLHub to Source Cooperative #1830

adamjstewart opened this issue Jan 29, 2024 · 16 comments
Labels
backwards-incompatible Changes that are not backwards compatible datasets Geospatial or benchmark datasets good first issue A good issue for a new contributor to work on
Milestone

Comments

@adamjstewart
Copy link
Collaborator

adamjstewart commented Jan 29, 2024

Summary

We currently have several datasets from the recently defunct Radiant MLHub that we need to switch to Source Cooperative if we want to be able to automatically download them.

Rationale

Downloads are currently broken, and many of these datasets have completely changed their file structure.

Implementation

See #2068 for an example that converts Tropical Cyclone. The resulting implementation is actually significantly simpler than the original code.

If you would like to volunteer to convert a particular dataset, please comment on this issue to say that you're working on this.

Alternatives

We can also rehost most datasets (depending on license) on Hugging Face.

Additional information

No response

Footnotes

  1. Newer Source Cooperative datasets that have never had download support and are lower priority since they aren't currently "broken" 2

@adamjstewart adamjstewart added the datasets Geospatial or benchmark datasets label Jan 29, 2024
@adamjstewart adamjstewart added this to the 0.5.2 milestone Jan 29, 2024
@adamjstewart adamjstewart added the good first issue A good issue for a new contributor to work on label Jan 29, 2024
@Haimantika
Copy link

Hi @adamjstewart I am interested to contribute to this, but I am fairly new and will need more guidance. What is a good place for me to start?

@adamjstewart
Copy link
Collaborator Author

Hi @Haimantika, thanks for volunteering!

Let's pick a single dataset, maybe torchgeo/datasets/benin_cashews.py, and try to convert it to the new syntax. Once that's working, we can repeat for the other 7 datasets and remove all mention of radiant-mlhub.

This is the new dataset website: https://beta.source.coop/technoserve/cashews-benin/

If you create an account, log in, and click generate credentials, you'll see that the Azure URI is https://radiantearth.blob.core.windows.net/mlhub/technoserve-cashew-benin

We'll add a new dependency on azure-storage-blob in pyproject.toml, requirements/datasets.txt, and requirements/min-reqs.old. I can help determine the minimum supported version.

We'll probably add something similar to download_radiant_mlhub_dataset but for Azure blobs in torchgeo/datasets/utils.py. This can then be imported in torchgeo/datasets/benin_cashews.py and used in _download.

Let me know if anything is unclear. The first dataset is going to be a bit of work, but once we have one working, the rest should be easy.

@Haimantika
Copy link

Hi @Haimantika, thanks for volunteering!

Let's pick a single dataset, maybe torchgeo/datasets/benin_cashews.py, and try to convert it to the new syntax. Once that's working, we can repeat for the other 7 datasets and remove all mention of radiant-mlhub.

This is the new dataset website: https://beta.source.coop/technoserve/cashews-benin/

If you create an account, log in, and click generate credentials, you'll see that the Azure URI is https://radiantearth.blob.core.windows.net/mlhub/technoserve-cashew-benin

We'll add a new dependency on azure-storage-blob in pyproject.toml, requirements/datasets.txt, and requirements/min-reqs.old. I can help determine the minimum supported version.

We'll probably add something similar to download_radiant_mlhub_dataset but for Azure blobs in torchgeo/datasets/utils.py. This can then be imported in torchgeo/datasets/benin_cashews.py and used in _download.

Let me know if anything is unclear. The first dataset is going to be a bit of work, but once we have one working, the rest should be easy.

This is very helpful. Thanks a lot. I will start working on it and get back with doubts, if any.

@Haimantika
Copy link

Hi @adamjstewart I finally got some time to work on it. I see a PR has been raised, is the issue solved already?

@adamjstewart
Copy link
Collaborator Author

I have not seen any PRs that implement download support for Source Cooperative. Which PR are you referring to?

@Haimantika
Copy link

I have not seen any PRs that implement download support for Source Cooperative. Which PR are you referring to?

My bad. This one just mentioned the issue.

@adamjstewart
Copy link
Collaborator Author

Yes, this is a 9th dataset that will benefit from your contribution.

P.S. I reached out to the folks at Source Cooperative. One thing to note is that azure-storage-blob will copy raw files/directories, not zip/tar files. So there won't be an easy way to checksum these. For now, let's just focus on downloading and ignore checksumming.

@Haimantika
Copy link

download_radiant_mlhub_dataset

Yes, this is a 9th dataset that will benefit from your contribution.

P.S. I reached out to the folks at Source Cooperative. One thing to note is that azure-storage-blob will copy raw files/directories, not zip/tar files. So there won't be an easy way to checksum these. For now, let's just focus on downloading and ignore checksumming.

Hi, I was doing a bit of research and the latest version of source cooperative that I could find was - beta.source.coop

Is that it? Or am I missing something? I have made the changes, can make a PR for you to take a look.

@adamjstewart
Copy link
Collaborator Author

Yes, that's the new website.

@Haimantika
Copy link

@adamjstewart I have raised a PR. There are chances that this is not the solution you are looking for. However I would like to give it one more try after your review and then unassign myself if it does not work to respect your time. :)

@darkblue-b
Copy link

darkblue-b commented Mar 9, 2024

review of MSFT azure-sdk-for-python that includes examples like this. Second view of the azcopy tool. python is preferred for torchGeo ; not clear how portable dependency management would work for azcopy .. Spack and conda have hooks but pip does not have good hooks for this kind of binary tool depends. Simply recommending azcopy and failing gracefully when it is not present was discussed briefly. not yet resolved

@adamjstewart
Copy link
Collaborator Author

We definitely don't need all of azure, azure-storage-blob would suffice.

@darkblue-b
Copy link

this file appears to implement basic functionality https://github.com/kartAI/kartAI/blob/master/azure/blobstorage.py

@adamjstewart
Copy link
Collaborator Author

@Haimantika @darkblue-b all preliminary work is now complete. If you want to claim 1 or more datasets from the above list and start working on them, #2068 will show you what is required to convert them. Note that most of the file changes in that PR are auto-generated by data.py. You really only need to change tests/data/foo/data.py to the new data structure and run it, and change torchgeo/datasets/foo.py to read and download the new data structure.

@Haimantika
Copy link

@Haimantika @darkblue-b all preliminary work is now complete. If you want to claim 1 or more datasets from the above list and start working on them, #2068 will show you what is required to convert them. Note that most of the file changes in that PR are auto-generated by data.py. You really only need to change tests/data/foo/data.py to the new data structure and run it, and change torchgeo/datasets/foo.py to read and download the new data structure.

Thanks Adam. I will take a look at the code and the dataset tonight and update you on which one I take up.

@adamjstewart
Copy link
Collaborator Author

adamjstewart commented May 23, 2024

Pinging the original dataset contributors:

If any of you have time, would you be interested in revamping these datasets to download from Source Cooperative?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backwards-incompatible Changes that are not backwards compatible datasets Geospatial or benchmark datasets good first issue A good issue for a new contributor to work on
Projects
None yet
Development

No branches or pull requests

3 participants