Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: update notebook server images + support ARM64 #7357

Merged

Conversation

thesuperzapper
Copy link
Member

This PR significantly updates and improves all our example notebook server images.

Key changes

  1. Support for ARM64 in addition to AMD64:
    • NOTE: the CUDA images are not currently built for ARM64, as I have no way to test them
      • PyTorch: I don't think that pre-compiled versions of PyTorch with CUDA on ARM are available.
      • TensorFlow: The official NVIDIA Ubuntu repos for CUDA are a bit sparse on ARM for CUDA 11.8 (which is the latest that TF supports)
  2. Much cleaner build system and Makefiles:
    • Build any image locally by going to its folder and running make docker-build-dep to build with all base images that that image depends on.
  3. Caching in GitHub Action Builds:
    • We are now using the ghcr.io/kubeflow/kubeflow/notebook-servers/build-cache image to store caches, which should significantly speed up builds.
  4. TensorFlow 2.0:
    • We have updated to TensorFlow 2.13.0 by default.
    • We have updated to CUDA 11.8 in the TensorFlow CUDA images.
  5. PyTorch 2.0:
    • We have updated to PyTorch 2.1.0 by default.
    • We have updated to CUDA 12.1 in the PyTorch CUDA images.
  6. JupyterLab 4.0:
  7. Python 3.11:
    • We have updated to Python 3.11.6 by default.

I have tested the images in real-world use cases for Tensorflow and PyTorch (including on GPUs), but we will need to get more feedback after we release Kubeflow 1.8 with these images.

Other Notes

  • We still don't have a sensible way to test build the images on PR (because we have to split the builds up as they are so big, but we don't want to push random user's PRs to any container registries)
  • Each commit to master (that updates anything under the components/example-notebook-servers/ folder, will trigger a build of all notebook servers (which should be fast, because of the caching, unless the PR changes the base images).
  • Previously, we were not publishing the intermediate images to DockerHub (like base, jupyter, etc), this PR changed that, and now all images are always pushed.
  • The CUDA images now have their own folders named:
    • example-notebook-servers/jupyter-pytorch-cuda
    • example-notebook-servers/jupyter-pytorch-cuda-full
    • example-notebook-servers/jupyter-tensorflow-cuda
    • example-notebook-servers/jupyter-tensorflow-cuda-full

Next steps

@thesuperzapper
Copy link
Member Author

thesuperzapper commented Oct 17, 2023

@alekseyolg
Copy link
Contributor

alekseyolg commented Oct 18, 2023

@thesuperzapper
I took the liberty of looking at your dockerfiles and found that there are some mistakes there, like the fact that in one layer we load a kubectl and on the other we change the permissions for it, this creates this file in 2 layers at once due to the fact that it has changed.
I took the liberty of modifying your dockerfile a little using example-notebook-servers and got a size reduction of 50 megabytes!
I also found extra unnecessary commands, such as saving the checksum to the file system and then deleting it - this is not necessary.
There is also no need to run the apt-get clean command, it is executed automatically.
You can run the build yourself, here is the code:

#
# NOTE: Use the Makefiles to build this image correctly.
#

ARG BASE_IMG=<ubuntu>
FROM $BASE_IMG

ARG TARGETARCH

# common environemnt variables
ENV NB_USER=jovyan \
    NB_UID=1000 \
    NB_PREFIX=/ \
    HOME=/home/jovyan \
    SHELL=/bin/bash

# args - software versions
ARG KUBECTL_VERSION=v1.27.6
ARG S6_VERSION=v3.1.5.0

# set shell to bash
SHELL ["/bin/bash", "-c"]

# install - usefull linux packages
RUN export DEBIAN_FRONTEND=noninteractive \
 && apt-get -yq update \
 && apt-get -yq install --no-install-recommends \
    apt-transport-https \
    bash \
    bzip2 \
    ca-certificates \
    curl \
    git \
    gnupg \
    gnupg2 \
    locales \
    lsb-release \
    nano \
    software-properties-common \
    tzdata \
    unzip \
    vim \
    wget \
    xz-utils \
    zip \
 && rm -rf /var/lib/apt/lists/*

# install - s6 overlay
RUN case "${TARGETARCH}" in \
      amd64) S6_ARCH="x86_64" ;; \
      arm64) S6_ARCH="aarch64" ;; \
      ppc64le) S6_ARCH="ppc64le" ;; \
      *) echo "Unsupported architecture: ${TARGETARCH}"; exit 1 ;; \
    esac \
 && wget -q "https://github.com/just-containers/s6-overlay/releases/download/${S6_VERSION}/s6-overlay-noarch.tar.xz" \
 && echo $(curl -fsSL "https://github.com/just-containers/s6-overlay/releases/download/${S6_VERSION}/s6-overlay-noarch.tar.xz.sha256") | sha256sum -c - \
 && wget -q "https://github.com/just-containers/s6-overlay/releases/download/${S6_VERSION}/s6-overlay-${S6_ARCH}.tar.xz" \
 && echo $(curl -fsSL "https://github.com/just-containers/s6-overlay/releases/download/${S6_VERSION}/s6-overlay-${S6_ARCH}.tar.xz.sha256") | sha256sum -c - \
 && tar -C / -Jxpf s6-overlay-noarch.tar.xz \
 && tar -C / -Jxpf s6-overlay-${S6_ARCH}.tar.xz \
 && rm *.tar.xz

# create user and set required ownership, install kubectl
RUN useradd -M -s /bin/bash -N -u ${NB_UID} ${NB_USER} \
 && mkdir -p ${HOME} \
 && chown -R ${NB_USER}:users ${HOME} \
 && cd /usr/local/bin \
 && wget -q "https://dl.k8s.io/release/${KUBECTL_VERSION}/bin/linux/${TARGETARCH}/kubectl" \
 && echo $(curl -fsSL "https://dl.k8s.io/${KUBECTL_VERSION}/bin/linux/${TARGETARCH}/kubectl.sha256") kubectl | sha256sum -c - \
 && chmod +x kubectl \
 && chown -R ${NB_USER}:users ./*

ENV LANG=en_US.UTF-8\
    LANGUAGE=en_US.UTF-8 \
    LC_ALL=en_US.UTF-8

# set locale configs
RUN echo "en_US.UTF-8 UTF-8" > /etc/locale.gen \
 && locale-gen

USER $NB_UID

ENTRYPOINT ["/init"]

@thesuperzapper
Copy link
Member Author

@alekseyolg let's discuss your changes in a follow-up PR (you can make one once we merge this), as they don't impact the functionality of the images, and we need to get these merged so we can test them with Kubeflow 1.8.

Also, my general principle is that layers aren't really that much of a concern, but clarity (and the ability for people to modify them easily, e.g. to remove kubectl) is more important.

@thesuperzapper
Copy link
Member Author

@kimwnasptd I think this is ready to merge, I just made one small final change in the code-server image with 99c7f6a

See the [custom images guide](#custom-images) to learn how to extend them with your own packages.
```mermaid
graph TD
Base[<a href='https://github.com/thesuperzapper/kubeflow/tree/master/components/example-notebook-servers/base'>Base</a>] --> Jupyter[<a href='https://github.com/thesuperzapper/kubeflow/tree/master/components/example-notebook-servers/jupyter'>Jupyter</a>]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be better if we point to the kubeflow/kubeflow repository here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch, will quickly fix that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e38abb7

@thesuperzapper
Copy link
Member Author

Just want to remind any reviewers about the pre-built images from this PR that people can test with.

They are linked in #7357 (comment)

@kimwnasptd
Copy link
Member

@thesuperzapper I tried running the build in my M2 and saw the following:

ARCH=linux/arm64/v8 make docker-build-multi-arch
...
------------------------------------------------------------------------------
Building 'jupyter-pytorch-cuda' image for 'linux/arm64/v8'...
------------------------------------------------------------------------------
...
#7 [2/4] RUN python3 -m pip install --quiet --no-cache-dir --index-url https://download.pytorch.org/whl/cu121     torch==2.1.0     torchvision==0.16.0     torchaudio==2.1.0
#7 1.344 ERROR: Could not find a version that satisfies the requirement torch==2.1.0 (from versions: 2.0.0, 2.0.1)
#7 1.344 ERROR: No matching distribution found for torch==2.1.0
#7 ERROR: executor failed running [/bin/bash -c python3 -m pip install --quiet --no-cache-dir --index-url https://download.pytorch.org/whl/cu121     torch==${PYTORCH_VERSION}     torchvision==${TORCHVISION_VERSION}     torchaudio==${TORCHAUDIO_VERSION}]: exit code: 1
------
 > importing cache manifest from ghcr.io/kubeflow/kubeflow/notebook-servers/build-cache:jupyter-pytorch-cuda:
------
------
 > [2/4] RUN python3 -m pip install --quiet --no-cache-dir --index-url https://download.pytorch.org/whl/cu121     torch==2.1.0     torchvision==0.16.0     torchaudio==2.1.0:
#7 1.344 ERROR: Could not find a version that satisfies the requirement torch==2.1.0 (from versions: 2.0.0, 2.0.1)
#7 1.344 ERROR: No matching distribution found for torch==2.1.0
------
ERROR: failed to solve: executor failed running [/bin/bash -c python3 -m pip install --quiet --no-cache-dir --index-url https://download.pytorch.org/whl/cu121     torch==${PYTORCH_VERSION}     torchvision==${TORCHVISION_VERSION}     torchaudio==${TORCHAUDIO_VERSION}]: exit code: 1
make[1]: *** [../common.mk:88: docker-build-multi-arch] Error 1
make[1]: Leaving directory '/home/ubuntu/Code/git/kubeflow/components/example-notebook-servers/jupyter-pytorch-cuda'
make: *** [Makefile:41: docker-build-multi-arch--jupyter-pytorch-cuda] Error 2

needs: [ base_images ]
secrets: inherit
with:
build_arch: linux/amd64,linux/arm64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we instead use linux/arm64/v8 here for M2 architecture?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the implied default, and I don't know why we are explicitly specifying it as v8 in the other ones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm interesting, I remember when I had checked this wasn't the default. But am also not 100% sure if I had seen this in the docker official docs or someplace else

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kimwnasptd Either way, I have never had an issue with images that are built for linux/arm64, and I actually think that v7 is 32bit.

needs: [ base_images ]
secrets: inherit
with:
build_arch: linux/amd64,linux/arm64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you also tried this build in GH runners? Am afraid that with trying to build both architectures at the same time we'll exhaust the resources, for which we ended up building serially in the other workflows
https://github.com/kubeflow/kubeflow/blob/master/.github/workflows/poddefaults_docker_publish.yaml#L46-L48

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kimwnasptd yep, the workflows are designed very carefully to not exceed resource limits, and even when the build cache is empty, they still run successfully.

See the most recent few runs on my thesuperzapper/kubeflow repo: https://github.com/thesuperzapper/kubeflow/actions

@thesuperzapper
Copy link
Member Author

@thesuperzapper I tried running the build in my M2 and saw the following:

ARCH=linux/arm64/v8 make docker-build-multi-arch
...
------------------------------------------------------------------------------
Building 'jupyter-pytorch-cuda' image for 'linux/arm64/v8'...
------------------------------------------------------------------------------
...
#7 [2/4] RUN python3 -m pip install --quiet --no-cache-dir --index-url https://download.pytorch.org/whl/cu121     torch==2.1.0     torchvision==0.16.0     torchaudio==2.1.0
#7 1.344 ERROR: Could not find a version that satisfies the requirement torch==2.1.0 (from versions: 2.0.0, 2.0.1)
#7 1.344 ERROR: No matching distribution found for torch==2.1.0
#7 ERROR: executor failed running [/bin/bash -c python3 -m pip install --quiet --no-cache-dir --index-url https://download.pytorch.org/whl/cu121     torch==${PYTORCH_VERSION}     torchvision==${TORCHVISION_VERSION}     torchaudio==${TORCHAUDIO_VERSION}]: exit code: 1
------
 > importing cache manifest from ghcr.io/kubeflow/kubeflow/notebook-servers/build-cache:jupyter-pytorch-cuda:
------
------
 > [2/4] RUN python3 -m pip install --quiet --no-cache-dir --index-url https://download.pytorch.org/whl/cu121     torch==2.1.0     torchvision==0.16.0     torchaudio==2.1.0:
#7 1.344 ERROR: Could not find a version that satisfies the requirement torch==2.1.0 (from versions: 2.0.0, 2.0.1)
#7 1.344 ERROR: No matching distribution found for torch==2.1.0
------
ERROR: failed to solve: executor failed running [/bin/bash -c python3 -m pip install --quiet --no-cache-dir --index-url https://download.pytorch.org/whl/cu121     torch==${PYTORCH_VERSION}     torchvision==${TORCHVISION_VERSION}     torchaudio==${TORCHAUDIO_VERSION}]: exit code: 1
make[1]: *** [../common.mk:88: docker-build-multi-arch] Error 1
make[1]: Leaving directory '/home/ubuntu/Code/git/kubeflow/components/example-notebook-servers/jupyter-pytorch-cuda'
make: *** [Makefile:41: docker-build-multi-arch--jupyter-pytorch-cuda] Error 2

@kimwnasptd The CUDA images don't support ARM (and in the CI/CD workflow they are only built for X86)

Also, in an unrelated note, most of the time you will want to use docker-build-multi-arch-dep rather than docker-build-multi-arch, as this will ensure the dependent images are up to date as well.

@kimwnasptd
Copy link
Member

The changes look good and also tried to run a couple of notebooks. @thesuperzapper this is solid work! Exciting to see those images being simplified and also have support for ARM.

As mentioned in the CM I didn't take a full blown view due to time constraints. Let's keep an eye on any user feedback in case any issues occur and I'll try to help with any review necessary.

/lgtm
/approve

@google-oss-prow google-oss-prow bot added the lgtm label Oct 24, 2023
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kimwnasptd

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit a63cf23 into kubeflow:master Oct 24, 2023
2 checks passed
@thesuperzapper
Copy link
Member Author

thesuperzapper commented Oct 24, 2023

@kimwnasptd There is still a small chance that the first build fails, because I was not testing with pushing to DockerHub, but I will quickly follow up if anything goes wrong.

(It might take a while for the first build, so let it finish before cherry picking)

DnPlas pushed a commit to DnPlas/kubeflow that referenced this pull request Oct 25, 2023
* feat: update example notebook servers

* docs: update example notebook servers readme

* feat: update code-server notebook image start args

* docs: update links to use kubeflow/kubeflow repo
google-oss-prow bot pushed a commit that referenced this pull request Oct 25, 2023
* feat: update example notebook servers

* docs: update example notebook servers readme

* feat: update code-server notebook image start args

* docs: update links to use kubeflow/kubeflow repo

Co-authored-by: Mathew Wicks <5735406+thesuperzapper@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants