Skip to content

arjun921/aws-spot-instances-kubeflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kubeflow on Spot instances

Kubeflow Logo

Kubernetes GitHub issues GitHub forks GitHub Stars GitHub License

Config files for setting up Multitenant Kubeflow on AWS with spot instances Repo contains supporting code for How we reduced our ML training costs by 78%

By the end of this tutorial, you will have:

  • An EKS cluster with Kubernetes 1.14 on AWS
  • Autoscaling with Nodegroup autodiscovery enabled
  • GPU nodes
    • With scale-down-to-zero at no workload
    • Spot Instance purchase enabled by default
  • Kubeflow 1.0.1 running on the cluster with only GPU requesting resources running on GPU nodes

TLDR;

# setup environment
export ENVIRONMENT=staging
export AWS_PROFILE=<your profile>
source envs/$ENVIRONMENT/variables.sh
# Create cluster
eksctl create cluster -f envs/$ENVIRONMENT/cluster-spec.yml
kubectl cluster-info # to check if the cluster is connected
# set executable
chmod a+x *.sh
# Deploy Kubeflow
./deploy_kubeflow.sh

Prequisites

AWS

CLI

Cluster Spec

The cluster that gets spun up will have the following specs:

  • ng-1
    • m5a.2xlarge
    • min nodes: 0
    • max nodes: 3
    • vol: 100 GB
  • ng-2
    • m5a.2xlarge
    • min: 0
    • max: 10
    • vol: 20 GB
  • 1-gpu-spot-p2-xlarge
    • p2.xlarge
    • min nodes: 0
    • max nodes: 10
    • max price: $1.2
  • 1-gpu-spot-p3-2xlarge
    • p3.2xlarge
    • min nodes: 0
    • max nodes: 10
    • max price: $1.2
  • 4-gpu-spot-p3-8xlarge
    • p3.8xlarge
    • min nodes: 0
    • max nodes: 4
    • max price: OnDemand
  • 8-gpu-spot-p3dn-24xlarge -- Disabled by default
    • p3dn.24xlarge
    • min nodes: 0
    • max nodes: 1
    • max price: $11

Releases

No releases published

Packages

No packages published

Languages