AI Operator

The AI Operator is a Kubernetes operator that simplifies running AI fine-tuning jobs on Kubernetes clusters. It manages resources needed to fine-tune large language models, including model downloads, persistence volumes, and Hugging Face authentication.

Overview

This operator introduces a new Custom Resource Definition (CRD) called Job that wraps and automates:

Setting up persistent storage for model files
Managing Hugging Face authentication tokens securely
Downloading LLM models from Hugging Face
Running fine-tuning jobs with GPU support
Managing the lifecycle of resources

Installation

Prerequisites

Kubernetes cluster with GPU support
kubectl configured to access your cluster
NVIDIA runtime configured on nodes

Install the operator:

kubectl apply -f https://raw.githubusercontent.com/re-cinq/ai-operator/main/dist/install.yaml

Usage

1. Create a Hugging Face Token Secret

apiVersion: v1
kind: Secret
metadata:
  name: hf-token
  namespace: default
data:
  token: <base64-encoded-token>

2. Create an AI Job

apiVersion: ai.re-cinq.com/v1
kind: Job
metadata:
  name: finetune-job
spec:
  # NVIDIA runtime class for GPU access
  runtimeClassName: "nvidia"

  # Container image with training code
  image: "silentehrec/torchtune:latest" 

  # Model to download from Hugging Face
  model: "Qwen/Qwen2.5-0.5B-Instruct"

  # Storage size in GB for model files
  diskSize: 50

  # Fine-tuning command and arguments 
  command:
    - "tune"
    - "run"
    - "full_finetune_single_device"
    - "-r=3"
    - "--config" 
    - "qwen2_5/0.5B_full_single_device"

  # Hugging Face token for downloading models
  huggingFaceSecret: "hf-token"

Configuration

Job CRD Specification

The following table describes the configuration fields available in the Job CRD:

Field	Type	Description	Default
`runtimeClassName`	string	Runtime class name for GPU support	`nvidia`
`image`	string	Container image containing the training code	`silentehrec/torchtune:latest`
`model`	string	Hugging Face model identifier to download	`Qwen/Qwen2.5-0.5B-Instruct`
`diskSize`	integer	Storage size in gigabytes for model files	`50`
`storageClassName`	string	Storage class name for the PersistentVolumeClaim	`local-path`
`accessModes`	array	PVC access modes	`[ReadWriteOnce]`
`command`	array	Training command and arguments array	-
`huggingFaceSecret`	string	Name of the Kubernetes secret containing the HF token	Required

Architecture

The operator implements the following workflow:

Creates a PersistentVolumeClaim for model storage
Manages a Kubernetes Secret for the HF token
Runs an init container to download the model
Executes the training job with access to:
- Downloaded model files
- GPU resources
- HF authentication

Development

Requirements

Go 1.23+
Docker
make
kubectl

Local Development

Build and run locally:

# Install CRDs
make install

# Run the controller
make run

# Run tests
make test
make test-e2e

Deployment

Build and deploy to cluster:

make docker-build docker-push IMG=<registry>/ai-operator:tag
make deploy IMG=<registry>/ai-operator:tag

Contributing

We welcome contributions! Please see our Contributing Guidelines for details on how to submit pull requests, report issues, and contribute to the project.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
api/v1		api/v1
cmd		cmd
config		config
dist		dist
hack		hack
internal/controller		internal/controller
test		test
.dockerignore		.dockerignore
.gitignore		.gitignore
.golangci.bck.yml		.golangci.bck.yml
.golangci.yml		.golangci.yml
Dockerfile		Dockerfile
Makefile		Makefile
PROJECT		PROJECT
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI Operator

Overview

Installation

Prerequisites

Usage

1. Create a Hugging Face Token Secret

2. Create an AI Job

Configuration

Job CRD Specification

Architecture

Development

Requirements

Local Development

Deployment

Contributing

License

About

Uh oh!

Releases 1

Packages

Languages

re-cinq/ai-operator

Folders and files

Latest commit

History

Repository files navigation

AI Operator

Overview

Installation

Prerequisites

Usage

1. Create a Hugging Face Token Secret

2. Create an AI Job

Configuration

Job CRD Specification

Architecture

Development

Requirements

Local Development

Deployment

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages