AI on GKE Assets

This repository contains assets related to AI/ML workloads on Google Kubernetes Engine (GKE).

Overview

Run optimized AI/ML workloads with Google Kubernetes Engine (GKE) platform orchestration capabilities. A robust AI/ML platform considers the following layers:

Infrastructure orchestration that support GPUs and TPUs for training and serving workloads at scale
Flexible integration with distributed computing and data processing frameworks
Support for multiple teams on the same infrastructure to maximize utilization of resources

Infrastructure

The AI-on-GKE application modules assumes you already have a functional GKE cluster. If not, follow the instructions under infrastructure/README.md to install a Standard or Autopilot GKE cluster.

.
├── LICENSE
├── README.md
├── infrastructure
│   ├── README.md
│   ├── backend.tf
│   ├── main.tf
│   ├── outputs.tf
│   ├── platform.tfvars
│   ├── variables.tf
│   └── versions.tf
├── modules
│   ├── gke-autopilot-private-cluster
│   ├── gke-autopilot-public-cluster
│   ├── gke-standard-private-cluster
│   ├── gke-standard-public-cluster
│   ├── jupyter
│   ├── jupyter_iap
│   ├── jupyter_service_accounts
│   ├── kuberay-cluster
│   ├── kuberay-logging
│   ├── kuberay-monitoring
│   ├── kuberay-operator
│   └── kuberay-serviceaccounts
└── tutorial.md

To deploy new GKE cluster update the platform.tfvars file with the appropriate values and then execute below terraform commands:

terraform init
terraform apply -var-file platform.tfvars

Applications

The repo structure looks like this:

.
├── LICENSE
├── Makefile
├── README.md
├── applications
│   ├── jupyter
│   └── ray
├── contributing.md
├── dcgm-on-gke
│   ├── grafana
│   └── quickstart
├── gke-a100-jax
│   ├── Dockerfile
│   ├── README.md
│   ├── build_push_container.sh
│   ├── kubernetes
│   └── train.py
├── gke-batch-refarch
│   ├── 01_gke
│   ├── 02_platform
│   ├── 03_low_priority
│   ├── 04_high_priority
│   ├── 05_compact_placement
│   ├── 06_jobset
│   ├── Dockerfile
│   ├── README.md
│   ├── cloudbuild-create.yaml
│   ├── cloudbuild-destroy.yaml
│   ├── create-platform.sh
│   ├── destroy-platform.sh
│   └── images
├── gke-disk-image-builder
│   ├── README.md
│   ├── cli
│   ├── go.mod
│   ├── go.sum
│   ├── imager.go
│   └── script
├── gke-dws-examples
│   ├── README.md
│   ├── dws-queues.yaml
│   ├── job.yaml
│   └── kueue-manifests.yaml
├── gke-online-serving-single-gpu
│   ├── README.md
│   └── src
├── gke-tpu-examples
│   ├── single-host-inference
│   └── training
├── indexed-job
│   ├── Dockerfile
│   ├── README.md
│   └── mnist.py
├── jobset
│   └── pytorch
├── modules
│   ├── gke-autopilot-private-cluster
│   ├── gke-autopilot-public-cluster
│   ├── gke-standard-private-cluster
│   ├── gke-standard-public-cluster
│   ├── jupyter
│   ├── jupyter_iap
│   ├── jupyter_service_accounts
│   ├── kuberay-cluster
│   ├── kuberay-logging
│   ├── kuberay-monitoring
│   ├── kuberay-operator
│   └── kuberay-serviceaccounts
├── saxml-on-gke
│   ├── httpserver
│   └── single-host-inference
├── training-single-gpu
│   ├── README.md
│   ├── data
│   └── src
├── tutorial.md
└── tutorials
    ├── e2e-genai-langchain-app
    ├── finetuning-llama-7b-on-l4
    └── serving-llama2-70b-on-l4-gpus

Jupyter Hub

This repository contains a Terraform template for running JupyterHub on Google Kubernetes Engine. We've also included some example notebooks ( under applications/ray/example_notebooks), including one that serves a GPT-J-6B model with Ray AIR (see here for the original notebook). To run these, follow the instructions at applications/ray/README.md to install a Ray cluster.

This jupyter module deploys the following resources, once per user:

JupyterHub deployment
User namespace
Kubernetes service accounts

Learn more about JupyterHub on GKE here

Ray

This repository contains a Terraform template for running Ray on Google Kubernetes Engine.

This module deploys the following, once per user:

User namespace
Kubernetes service accounts
Kuberay cluster
Prometheus monitoring
Logging container

Learn more about Ray on GKE here

Important Considerations

Make sure to configure terraform backend to use GCS bucket, in order to persist terraform state across different environments.

Licensing

The use of the assets contained in this repository is subject to compliance with Google's AI Principles
See LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 795 Commits
.github/workflows		.github/workflows
applications		applications
benchmarks		benchmarks
best-practices		best-practices
gke-batch-refarch		gke-batch-refarch
infrastructure		infrastructure
modules		modules
ray-on-gke		ray-on-gke
tools		tools
tpu-provisioner		tpu-provisioner
tutorials-and-examples		tutorials-and-examples
.gitignore		.gitignore
CODEOWNERS		CODEOWNERS
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
cloudbuild.yaml		cloudbuild.yaml
contributing.md		contributing.md
jupyter-on-gke		jupyter-on-gke

License

GoogleCloudPlatform/ai-on-gke

Folders and files

Latest commit

History

Repository files navigation

AI on GKE Assets

Overview

Infrastructure

Applications

Jupyter Hub

Ray

Important Considerations

Licensing

About

Resources

License

Stars

Watchers

Forks

Languages