feat: update notebook server images + support ARM64 #7357

thesuperzapper · 2023-10-17T05:28:41Z

This PR significantly updates and improves all our example notebook server images.

Key changes

Support for ARM64 in addition to AMD64:
- NOTE: the CUDA images are not currently built for ARM64, as I have no way to test them
  - PyTorch: I don't think that pre-compiled versions of PyTorch with CUDA on ARM are available.
  - TensorFlow: The official NVIDIA Ubuntu repos for CUDA are a bit sparse on ARM for CUDA 11.8 (which is the latest that TF supports)
Much cleaner build system and Makefiles:
- Build any image locally by going to its folder and running make docker-build-dep to build with all base images that that image depends on.
Caching in GitHub Action Builds:
- We are now using the ghcr.io/kubeflow/kubeflow/notebook-servers/build-cache image to store caches, which should significantly speed up builds.
TensorFlow 2.0:
- We have updated to TensorFlow 2.13.0 by default.
- We have updated to CUDA 11.8 in the TensorFlow CUDA images.
PyTorch 2.0:
- We have updated to PyTorch 2.1.0 by default.
- We have updated to CUDA 12.1 in the PyTorch CUDA images.
JupyterLab 4.0:
- We have updated to JupyterLab 4.0.7 by default.
- NOTE: we had to remove the jupyterlab-git plugin, as it does not yet support JupyterLab 4.0, but hopefully they will release an update in the coming days.
Python 3.11:
- We have updated to Python 3.11.6 by default.

I have tested the images in real-world use cases for Tensorflow and PyTorch (including on GPUs), but we will need to get more feedback after we release Kubeflow 1.8 with these images.

Other Notes

We still don't have a sensible way to test build the images on PR (because we have to split the builds up as they are so big, but we don't want to push random user's PRs to any container registries)
Each commit to master (that updates anything under the components/example-notebook-servers/ folder, will trigger a build of all notebook servers (which should be fast, because of the caching, unless the PR changes the base images).
Previously, we were not publishing the intermediate images to DockerHub (like base, jupyter, etc), this PR changed that, and now all images are always pushed.
The CUDA images now have their own folders named:
- example-notebook-servers/jupyter-pytorch-cuda
- example-notebook-servers/jupyter-pytorch-cuda-full
- example-notebook-servers/jupyter-tensorflow-cuda
- example-notebook-servers/jupyter-tensorflow-cuda-full

Next steps

Update the Container Images page on kubeflow.org to reflect the same changes made in the README of the example-notebook-servers/ folder
Think about how we can support CUDA on ARM64

thesuperzapper · 2023-10-17T05:37:08Z

/assign @kimwnasptd

I am very happy to finally have this PR ready for review!

If you want to test the images, I have been building them on my master branch of the thesuperzapper/kubeflow fork, which pushes them to my personal GHCR packages repo, for example:

alekseyolg · 2023-10-18T16:10:57Z

@thesuperzapper
I took the liberty of looking at your dockerfiles and found that there are some mistakes there, like the fact that in one layer we load a kubectl and on the other we change the permissions for it, this creates this file in 2 layers at once due to the fact that it has changed.
I took the liberty of modifying your dockerfile a little using example-notebook-servers and got a size reduction of 50 megabytes!
I also found extra unnecessary commands, such as saving the checksum to the file system and then deleting it - this is not necessary.
There is also no need to run the apt-get clean command, it is executed automatically.
You can run the build yourself, here is the code:

#
# NOTE: Use the Makefiles to build this image correctly.
#

ARG BASE_IMG=<ubuntu>
FROM $BASE_IMG

ARG TARGETARCH

# common environemnt variables
ENV NB_USER=jovyan \
    NB_UID=1000 \
    NB_PREFIX=/ \
    HOME=/home/jovyan \
    SHELL=/bin/bash

# args - software versions
ARG KUBECTL_VERSION=v1.27.6
ARG S6_VERSION=v3.1.5.0

# set shell to bash
SHELL ["/bin/bash", "-c"]

# install - usefull linux packages
RUN export DEBIAN_FRONTEND=noninteractive \
 && apt-get -yq update \
 && apt-get -yq install --no-install-recommends \
    apt-transport-https \
    bash \
    bzip2 \
    ca-certificates \
    curl \
    git \
    gnupg \
    gnupg2 \
    locales \
    lsb-release \
    nano \
    software-properties-common \
    tzdata \
    unzip \
    vim \
    wget \
    xz-utils \
    zip \
 && rm -rf /var/lib/apt/lists/*

# install - s6 overlay
RUN case "${TARGETARCH}" in \
      amd64) S6_ARCH="x86_64" ;; \
      arm64) S6_ARCH="aarch64" ;; \
      ppc64le) S6_ARCH="ppc64le" ;; \
      *) echo "Unsupported architecture: ${TARGETARCH}"; exit 1 ;; \
    esac \
 && wget -q "https://github.com/just-containers/s6-overlay/releases/download/${S6_VERSION}/s6-overlay-noarch.tar.xz" \
 && echo $(curl -fsSL "https://github.com/just-containers/s6-overlay/releases/download/${S6_VERSION}/s6-overlay-noarch.tar.xz.sha256") | sha256sum -c - \
 && wget -q "https://github.com/just-containers/s6-overlay/releases/download/${S6_VERSION}/s6-overlay-${S6_ARCH}.tar.xz" \
 && echo $(curl -fsSL "https://github.com/just-containers/s6-overlay/releases/download/${S6_VERSION}/s6-overlay-${S6_ARCH}.tar.xz.sha256") | sha256sum -c - \
 && tar -C / -Jxpf s6-overlay-noarch.tar.xz \
 && tar -C / -Jxpf s6-overlay-${S6_ARCH}.tar.xz \
 && rm *.tar.xz

# create user and set required ownership, install kubectl
RUN useradd -M -s /bin/bash -N -u ${NB_UID} ${NB_USER} \
 && mkdir -p ${HOME} \
 && chown -R ${NB_USER}:users ${HOME} \
 && cd /usr/local/bin \
 && wget -q "https://dl.k8s.io/release/${KUBECTL_VERSION}/bin/linux/${TARGETARCH}/kubectl" \
 && echo $(curl -fsSL "https://dl.k8s.io/${KUBECTL_VERSION}/bin/linux/${TARGETARCH}/kubectl.sha256") kubectl | sha256sum -c - \
 && chmod +x kubectl \
 && chown -R ${NB_USER}:users ./*

ENV LANG=en_US.UTF-8\
    LANGUAGE=en_US.UTF-8 \
    LC_ALL=en_US.UTF-8

# set locale configs
RUN echo "en_US.UTF-8 UTF-8" > /etc/locale.gen \
 && locale-gen

USER $NB_UID

ENTRYPOINT ["/init"]

thesuperzapper · 2023-10-18T21:32:05Z

@alekseyolg let's discuss your changes in a follow-up PR (you can make one once we merge this), as they don't impact the functionality of the images, and we need to get these merged so we can test them with Kubeflow 1.8.

Also, my general principle is that layers aren't really that much of a concern, but clarity (and the ability for people to modify them easily, e.g. to remove kubectl) is more important.

thesuperzapper · 2023-10-19T01:04:06Z

@kimwnasptd I think this is ready to merge, I just made one small final change in the code-server image with 99c7f6a

DnPlas · 2023-10-24T16:14:57Z

components/example-notebook-servers/README.md

-See the [custom images guide](#custom-images) to learn how to extend them with your own packages.
+```mermaid
+graph TD
+  Base[<a href='https://github.com/thesuperzapper/kubeflow/tree/master/components/example-notebook-servers/base'>Base</a>] --> Jupyter[<a href='https://github.com/thesuperzapper/kubeflow/tree/master/components/example-notebook-servers/jupyter'>Jupyter</a>]


I'd be better if we point to the kubeflow/kubeflow repository here.

Ah, good catch, will quickly fix that.

Fixed in e38abb7

thesuperzapper · 2023-10-24T16:38:43Z

Just want to remind any reviewers about the pre-built images from this PR that people can test with.

They are linked in #7357 (comment)

kimwnasptd · 2023-10-24T21:24:00Z

@thesuperzapper I tried running the build in my M2 and saw the following:

ARCH=linux/arm64/v8 make docker-build-multi-arch
...
------------------------------------------------------------------------------
Building 'jupyter-pytorch-cuda' image for 'linux/arm64/v8'...
------------------------------------------------------------------------------
...
#7 [2/4] RUN python3 -m pip install --quiet --no-cache-dir --index-url https://download.pytorch.org/whl/cu121     torch==2.1.0     torchvision==0.16.0     torchaudio==2.1.0
#7 1.344 ERROR: Could not find a version that satisfies the requirement torch==2.1.0 (from versions: 2.0.0, 2.0.1)
#7 1.344 ERROR: No matching distribution found for torch==2.1.0
#7 ERROR: executor failed running [/bin/bash -c python3 -m pip install --quiet --no-cache-dir --index-url https://download.pytorch.org/whl/cu121     torch==${PYTORCH_VERSION}     torchvision==${TORCHVISION_VERSION}     torchaudio==${TORCHAUDIO_VERSION}]: exit code: 1
------
 > importing cache manifest from ghcr.io/kubeflow/kubeflow/notebook-servers/build-cache:jupyter-pytorch-cuda:
------
------
 > [2/4] RUN python3 -m pip install --quiet --no-cache-dir --index-url https://download.pytorch.org/whl/cu121     torch==2.1.0     torchvision==0.16.0     torchaudio==2.1.0:
#7 1.344 ERROR: Could not find a version that satisfies the requirement torch==2.1.0 (from versions: 2.0.0, 2.0.1)
#7 1.344 ERROR: No matching distribution found for torch==2.1.0
------
ERROR: failed to solve: executor failed running [/bin/bash -c python3 -m pip install --quiet --no-cache-dir --index-url https://download.pytorch.org/whl/cu121     torch==${PYTORCH_VERSION}     torchvision==${TORCHVISION_VERSION}     torchaudio==${TORCHAUDIO_VERSION}]: exit code: 1
make[1]: *** [../common.mk:88: docker-build-multi-arch] Error 1
make[1]: Leaving directory '/home/ubuntu/Code/git/kubeflow/components/example-notebook-servers/jupyter-pytorch-cuda'
make: *** [Makefile:41: docker-build-multi-arch--jupyter-pytorch-cuda] Error 2

kimwnasptd · 2023-10-24T21:25:28Z

.github/workflows/example_notebook_servers_publish.yaml

+    needs: [ base_images ]
+    secrets: inherit
+    with:
+      build_arch: linux/amd64,linux/arm64


shouldn't we instead use linux/arm64/v8 here for M2 architecture?

That is the implied default, and I don't know why we are explicitly specifying it as v8 in the other ones.

Hmm interesting, I remember when I had checked this wasn't the default. But am also not 100% sure if I had seen this in the docker official docs or someplace else

@kimwnasptd Either way, I have never had an issue with images that are built for linux/arm64, and I actually think that v7 is 32bit.

kimwnasptd · 2023-10-24T21:27:22Z

.github/workflows/example_notebook_servers_publish.yaml

+    needs: [ base_images ]
+    secrets: inherit
+    with:
+      build_arch: linux/amd64,linux/arm64


Have you also tried this build in GH runners? Am afraid that with trying to build both architectures at the same time we'll exhaust the resources, for which we ended up building serially in the other workflows
https://github.com/kubeflow/kubeflow/blob/master/.github/workflows/poddefaults_docker_publish.yaml#L46-L48

@kimwnasptd yep, the workflows are designed very carefully to not exceed resource limits, and even when the build cache is empty, they still run successfully.

See the most recent few runs on my thesuperzapper/kubeflow repo: https://github.com/thesuperzapper/kubeflow/actions

thesuperzapper · 2023-10-24T21:29:04Z

@thesuperzapper I tried running the build in my M2 and saw the following:

ARCH=linux/arm64/v8 make docker-build-multi-arch
...
------------------------------------------------------------------------------
Building 'jupyter-pytorch-cuda' image for 'linux/arm64/v8'...
------------------------------------------------------------------------------
...
#7 [2/4] RUN python3 -m pip install --quiet --no-cache-dir --index-url https://download.pytorch.org/whl/cu121     torch==2.1.0     torchvision==0.16.0     torchaudio==2.1.0
#7 1.344 ERROR: Could not find a version that satisfies the requirement torch==2.1.0 (from versions: 2.0.0, 2.0.1)
#7 1.344 ERROR: No matching distribution found for torch==2.1.0
#7 ERROR: executor failed running [/bin/bash -c python3 -m pip install --quiet --no-cache-dir --index-url https://download.pytorch.org/whl/cu121     torch==${PYTORCH_VERSION}     torchvision==${TORCHVISION_VERSION}     torchaudio==${TORCHAUDIO_VERSION}]: exit code: 1
------
 > importing cache manifest from ghcr.io/kubeflow/kubeflow/notebook-servers/build-cache:jupyter-pytorch-cuda:
------
------
 > [2/4] RUN python3 -m pip install --quiet --no-cache-dir --index-url https://download.pytorch.org/whl/cu121     torch==2.1.0     torchvision==0.16.0     torchaudio==2.1.0:
#7 1.344 ERROR: Could not find a version that satisfies the requirement torch==2.1.0 (from versions: 2.0.0, 2.0.1)
#7 1.344 ERROR: No matching distribution found for torch==2.1.0
------
ERROR: failed to solve: executor failed running [/bin/bash -c python3 -m pip install --quiet --no-cache-dir --index-url https://download.pytorch.org/whl/cu121     torch==${PYTORCH_VERSION}     torchvision==${TORCHVISION_VERSION}     torchaudio==${TORCHAUDIO_VERSION}]: exit code: 1
make[1]: *** [../common.mk:88: docker-build-multi-arch] Error 1
make[1]: Leaving directory '/home/ubuntu/Code/git/kubeflow/components/example-notebook-servers/jupyter-pytorch-cuda'
make: *** [Makefile:41: docker-build-multi-arch--jupyter-pytorch-cuda] Error 2

@kimwnasptd The CUDA images don't support ARM (and in the CI/CD workflow they are only built for X86)

Also, in an unrelated note, most of the time you will want to use docker-build-multi-arch-dep rather than docker-build-multi-arch, as this will ensure the dependent images are up to date as well.

kimwnasptd · 2023-10-24T21:33:06Z

The changes look good and also tried to run a couple of notebooks. @thesuperzapper this is solid work! Exciting to see those images being simplified and also have support for ARM.

As mentioned in the CM I didn't take a full blown view due to time constraints. Let's keep an eye on any user feedback in case any issues occur and I'll try to help with any review necessary.

/lgtm
/approve

google-oss-prow · 2023-10-24T21:33:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kimwnasptd

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~.github/OWNERS~~ [kimwnasptd]
~~components/OWNERS~~ [kimwnasptd]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

thesuperzapper · 2023-10-24T21:36:59Z

@kimwnasptd There is still a small chance that the first build fails, because I was not testing with pushing to DockerHub, but I will quickly follow up if anything goes wrong.

(It might take a while for the first build, so let it finish before cherry picking)

* feat: update example notebook servers * docs: update example notebook servers readme * feat: update code-server notebook image start args * docs: update links to use kubeflow/kubeflow repo

* feat: update example notebook servers * docs: update example notebook servers readme * feat: update code-server notebook image start args * docs: update links to use kubeflow/kubeflow repo Co-authored-by: Mathew Wicks <5735406+thesuperzapper@users.noreply.github.com>

thesuperzapper added 2 commits October 16, 2023 21:57

feat: update example notebook servers

eda8b88

docs: update example notebook servers readme

71efef8

google-oss-prow bot requested review from elikatsis, kimwnasptd and StefanoFioravanzo October 17, 2023 05:28

google-oss-prow bot added the size/XXL label Oct 17, 2023

google-oss-prow bot assigned kimwnasptd Oct 17, 2023

thesuperzapper mentioned this pull request Oct 17, 2023

chore(example-notebooks): update notebook images with latest versions of packages #7303

Closed

feat: update code-server notebook image start args

99c7f6a

This was referenced Oct 24, 2023

Adding v2 draft for kubeflow release 1.8 kubeflow/blog#138

Merged

Release v1.8.0-rc.3 #7380

Merged

DnPlas reviewed Oct 24, 2023

View reviewed changes

docs: update links to use kubeflow/kubeflow repo

e38abb7

kimwnasptd reviewed Oct 24, 2023

View reviewed changes

google-oss-prow bot added the lgtm label Oct 24, 2023

google-oss-prow bot added the approved label Oct 24, 2023

google-oss-prow bot merged commit a63cf23 into kubeflow:master Oct 24, 2023
2 checks passed

thesuperzapper deleted the update-notebook-server-images branch October 24, 2023 21:37

thesuperzapper mentioned this pull request Oct 25, 2023

update Kubeflow Notebooks container images docs kubeflow/website#3609

Merged

This was referenced Oct 25, 2023

[kubeflow 1.8] Kubeflow 1.8 Tracking Issue kubeflow/manifests#2442

Closed

[kubeflow 1.8] Distributions and Kubeflow 1.8 kubeflow/manifests#2449

Closed

This was referenced Oct 25, 2023

ci: fix version tagging of notebook server images #7386

Merged

fix: downgrade jupyterlab to 3.6.6 #7398

Merged

thesuperzapper mentioned this pull request May 12, 2024

Introduce workflows for building notebook-server images upon changes #6898

Closed

This was referenced May 23, 2024

feat: update default notebook images to 1.8.0 deployKF/deployKF#164

Merged

Update all notebook server images #7313

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: update notebook server images + support ARM64 #7357

feat: update notebook server images + support ARM64 #7357

thesuperzapper commented Oct 17, 2023

thesuperzapper commented Oct 17, 2023 •

edited

alekseyolg commented Oct 18, 2023 •

edited

thesuperzapper commented Oct 18, 2023

thesuperzapper commented Oct 19, 2023

DnPlas Oct 24, 2023

thesuperzapper Oct 24, 2023

thesuperzapper Oct 24, 2023

thesuperzapper commented Oct 24, 2023

kimwnasptd commented Oct 24, 2023

kimwnasptd Oct 24, 2023

thesuperzapper Oct 24, 2023

kimwnasptd Oct 24, 2023

thesuperzapper Oct 24, 2023

kimwnasptd Oct 24, 2023

thesuperzapper Oct 24, 2023

thesuperzapper commented Oct 24, 2023

kimwnasptd commented Oct 24, 2023

google-oss-prow bot commented Oct 24, 2023

thesuperzapper commented Oct 24, 2023 •

edited

feat: update notebook server images + support ARM64 #7357

feat: update notebook server images + support ARM64 #7357

Conversation

thesuperzapper commented Oct 17, 2023

Key changes

Other Notes

Next steps

thesuperzapper commented Oct 17, 2023 • edited

alekseyolg commented Oct 18, 2023 • edited

thesuperzapper commented Oct 18, 2023

thesuperzapper commented Oct 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thesuperzapper commented Oct 24, 2023

kimwnasptd commented Oct 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thesuperzapper commented Oct 24, 2023

kimwnasptd commented Oct 24, 2023

google-oss-prow bot commented Oct 24, 2023

thesuperzapper commented Oct 24, 2023 • edited

thesuperzapper commented Oct 17, 2023 •

edited

alekseyolg commented Oct 18, 2023 •

edited

thesuperzapper commented Oct 24, 2023 •

edited