Skip to content

canonical/training-operator

Repository files navigation

Training Operator

Overview

This repository hosts the Kubernetes Training Operator for Kubeflow training jobs.

Description

The Kubeflow Training Operator provides Kubernetes custom resources to run distributed or non-distributed training jobs, such as TFJobs and PytorchJobs. The Training Operator in this repository is a Python script which wraps the latest released Kubeflow Training Operator manifests, providing lifecycle management and handling events (install, upgrade, integrate, remove). It is one of the Charmed Kubeflow operators.

Usage

While it is possible to deploy the Training Operator as a standalone operator, it works best when deployed alongside other components included in the Kubeflow bundle. For installation steps, please refer to the installation guide.