Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using GPU-optimized NGC images as base for ML (Pytorch/Tensorflow) docker images #457

Open
weiji14 opened this issue May 14, 2023 · 2 comments

Comments

@weiji14
Copy link
Member

weiji14 commented May 14, 2023

Consolidating some of the discussion @ngam had around using NVIDIA GPU Cloud (NGC) containers as the base image for pytorch-notebook and ml-notebook, and potentially cupy (#322)

Is your feature request related to a problem? Please describe.

For machine learning and data analytics work that rely on NVIDIA Graphical Processing Units (GPUs), there are several optimizations related to drivers/hardware that can help to speed up processing workflows. Currently, the pytorch-notebook and ml-notebook docker images rely on CUDA libraries from conda-forge which are less optimized than what exists on NGC.

Describe the solution you'd like

Refactor the pytorch-notebook and ml-notebook to be based on NGC containers instead of the current base-image. This might involve flipping the current installation pipeline from Pangeo-first/ML-second (base-notebook -> pangeo-notebook -> ml-notebook) to ML-first/Pangeo-second (ngc -> ml-notebook -> pangeo-notebook). Something that can help with this is a pangeo-notebook metapackage #359

Describe alternatives you've considered

Spin things off into a different repository (pangeo-gpu-docker-images?), or have a separate build chain (ngc-pytorch-notebook, ngc-ml-notebook) from the current CI/CD infrastructure.

Additional context
Add any other context or screenshots about the feature request here.

One benefit of chaging the build order to ML-first/Pangeo-second is that ML folks who don't need all of the heavy Climate/Ocean packages pangeo-notebook can get a slimmer ml-notebook. For example, if they're deploying a model to some server API, they can base their docker image on ngc-ml-notebook, instead of the current heavy ml-notebook.

Disadvantage is that the refactoring will require some effort, and we need to be careful to ensure this doesn't affect existing JupyterHub deployments.

@ngam
Copy link
Contributor

ngam commented May 15, 2023

Just to note:

  • I think the tensorflow image was as good (if not better) than the NGC one
  • I recently came across this effort which may help in the infrastructure quite a bit https://github.com/rapidsai/mambaforge-cuda
  • I am slightly out of sync due to urgent "standard climate modeling" needs (i.e., no ML) so I could be missing on some updates --- I think the general truth remains that when it comes to tensorflow and PyTorch, our efforts in conda-forge are more delayed than we'd like due to all sorts of issues (inability to build on public ci, the licensing around Nvidia products, etc.; there are positive updates on all fronts, but it simply takes time...)

@weiji14
Copy link
Member Author

weiji14 commented May 15, 2023

I recently came across this effort which may help in the infrastructure quite a bit https://github.com/rapidsai/mambaforge-cuda

Yes, I noticed that too just yesterday :D That looks to be built on top of nvidia/cuda and comes with mamba pre-installed which is pretty much what we have here:

# Install latest mambaforge in ${CONDA_DIR}
RUN echo "Installing Mambaforge..." \
&& URL="https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh" \
&& wget --quiet ${URL} -O installer.sh \
&& /bin/bash installer.sh -u -b -p ${CONDA_DIR} \
&& rm installer.sh \
&& mamba install conda-lock -y \
&& mamba clean -afy \
# After installing the packages, we cleanup some unnecessary files
# to try reduce image size - see https://jcristharif.com/conda-docker-tips.html
# Although we explicitly do *not* delete .pyc files, as that seems to slow down startup
# quite a bit unfortunately - see https://github.com/2i2c-org/infrastructure/issues/2047
&& find ${CONDA_DIR} -follow -type f -name '*.a' -delete

That rapidsai/mambaforge-cuda image will also be super helpful if we decide to have an image for the cupy/RAPIDSAI stack :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants