Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda112 & add pytorch to ml-notebook & add cupy #345

Closed
wants to merge 17 commits into from

Conversation

ngam
Copy link
Contributor

@ngam ngam commented Jun 13, 2022

fixes #320
fixes #322

@rabernat @scottyhq this minimal PR takes tensorflow to cuda112, adds cupy, and adds pytorch in ml-notebook

Ideally, we should add jaxlib==*=*cuda* (and maybe even pytorch==*=*cuda* if you want to just have one "ML" notebook added a proposal!) but that will have to wait a little longer

Helpful notes for review:

  • pytorch-gpu is just a meta package and so don't worry about its disappearance.
  • openblas need not be present in the case of pytorch (the conda-forge pytorch only uses openblas --- in a bad way actually, long story --- for the case of m1 macs)

@github-actions
Copy link
Contributor

Binder 👈 Try on Mybinder.org!
Binder 👈 Try on Pangeo GCP Binder!
Binder 👈 Try on Pangeo AWS Binder!

@pangeo-bot
Copy link
Collaborator

/condalock
Automatically locking new conda environment, building, and testing images...

@ngam
Copy link
Contributor Author

ngam commented Jun 13, 2022

/condalock

@ngam
Copy link
Contributor Author

ngam commented Jun 13, 2022

/condalock

@ngam
Copy link
Contributor Author

ngam commented Jun 13, 2022

/condalock

@ngam
Copy link
Contributor Author

ngam commented Jun 13, 2022

/condalock

@ngam ngam changed the title try updates takes tensorflow to cuda112 Jun 13, 2022
@ngam
Copy link
Contributor Author

ngam commented Jun 13, 2022

/condalock

@ngam
Copy link
Contributor Author

ngam commented Jun 13, 2022

/condalock

@ngam ngam marked this pull request as ready for review June 13, 2022 22:37
@ngam
Copy link
Contributor Author

ngam commented Jun 13, 2022

/condalock

@ngam ngam changed the title takes tensorflow to cuda112 takes tensorflow to cuda112 & combine pytorch+tf in ML notebook Jun 13, 2022
@ngam ngam changed the title takes tensorflow to cuda112 & combine pytorch+tf in ML notebook tensorflow to cuda112 & combine pytorch+tf in ML notebook Jun 13, 2022
@ngam
Copy link
Contributor Author

ngam commented Jun 13, 2022

/condalock

@ngam
Copy link
Contributor Author

ngam commented Jun 13, 2022

/condalock

@ngam ngam mentioned this pull request Jun 13, 2022
@ngam ngam changed the title tensorflow to cuda112 & combine pytorch+tf in ML notebook cuda112 & add pytorch to ml-notebook & add cupy Jun 13, 2022
@ngam
Copy link
Contributor Author

ngam commented Jun 13, 2022

/condalock

@rabernat
Copy link
Member

Thank you so much for working on this! 🙏

Can you help me understand--if we are adding pytorch to ml-notebook, is there still any point to maintaining the separate pytorch-notebook?

@ngam
Copy link
Contributor Author

ngam commented Jun 14, 2022

if we are adding pytorch to ml-notebook, is there still any point to maintaining the separate pytorch-notebook?

No, there is no point. I simply didn't want to propose that big change here because I wasn't sure if it would be a welcome idea (but we can definitely do it!)

More generally, I don't see a need to worry about separating pytorch and tensorflow going forward. We will likely face some issues from time to time, but the tensorflow and pytorch feedstocks are currently maintained by the same set of volunteers, so the migrations happen around the same time. In fact, the trickiest package here is torchgeo for now (because it has dependencies on things like gdal).

In summary, if we can add a recent enough tensorflow on top of pangeo-notebook with all these geo packages, we should be able to add pytorch as well. As far as I am concerned, I would treat jax and tensorflow very similarly. I don't have that deep of an experience with the rapids-ai collection, but I can look into that next. My impression is that they're on the easier side.

I am still investigating what's blocking us from getting to tensorflow 2.8.1, but I think it is some deep conflict caused by icu and/or absl stuff. Once we get tensorflow 2.8.1, we will have higher performance than the NGC containers (in my testing) and we get the cuda version of jaxlib as well (they have similar migrations).

@ngam
Copy link
Contributor Author

ngam commented Jun 14, 2022

Could @weiji14 weigh on the pytorch discussion? Do you have any issues with merging the two notebooks?

@ngam
Copy link
Contributor Author

ngam commented Jun 14, 2022

Update: I would abandon this PR if people are happy to wait a few days for #346 --- that will bring everything to the very cutting edge. We are just waiting on rasterio now.

@weiji14
Copy link
Member

weiji14 commented Jun 14, 2022

Could @weiji14 weigh on the pytorch discussion? Do you have any issues with merging the two notebooks?

Nope, as long as pytorch is available somewhere! I do worry though about the larger docker file size (especially if the RAPIDS AI stuff is meant to be added later), but other than that, ok with simplifying maintenance down to just one ml-notebook image.

In fact, the trickiest package here is torchgeo for now (because it has dependencies on things like gdal).

Yeah, for better or worse, the dev version of torchgeo is also pinning upper versions of lots of packages (see https://github.com/microsoft/torchgeo/pull/557/files) which might make things more difficult going forward 🙂

@ngam
Copy link
Contributor Author

ngam commented Jun 14, 2022

Image sizes:

pangeo/ml-notebook        PR          5b99cf0c4873   Less than a second ago   10.2GB
pangeo/pytorch-notebook   PR          71b06612a1d4   1 second ago             8.84GB

Will likely grow by a little bit in #346 (up to 10.5GB, I would guess)

update: build from other PR:

pangeo/ml-notebook        PR          dd12e3f61112   Less than a second ago   10.4GB

@ngam
Copy link
Contributor Author

ngam commented Jun 16, 2022

I am going to close this PR now. All issues are resolved with the merge in rasterio feedstock. A seamless update should be possible without tinkering (in a few hours). I am happy to help, but maybe having the bots do it makes more sense for continuity?

Let me know if you face any issues. I recommend:

  1. taking off the pins on tensorflow, jaxlib, and pytorch, except for ==*=*cuda112*
  2. adding tensorflow, jaxlib, jax, pytorch, torchvision, torchgeo, etc. all under ml-notebook (with ==*=*cuda112* where appropriate)

Happy to produce another PR if desired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add cupy to ml notebooks Use cudatoolkit=11 in both tensorflow and pytorch images
4 participants