Skip to content

saeid93/smart-kube

Repository files navigation

1. Introduction

Abstract

One of the most challenging problems in the popular orchestration framework Kubernetes is assigning sufficient resources to containers to operate at a required level while also avoiding excessive resource allocation which can delay other jobs in the cluster. A variety of heuristic approaches have been proposed to tackle this problem but these require considerable manual adjustments which can be laborious. Reinforcement learning approaches have been proposed to address this issue but these proposals do not consider the energy consumption of the cluster. This is an important component of the problem due to the commitments of large cloud operators to carbon neutrality. We have proposed a system called Smart-Kube to achieve a target utilization on nodes while maintaining energy consumption at a reasonable level. An experimental framework is designed on top of real-world Kubernetes clusters and real-world traces of container jobs are used to evaluate the framework. Experimental results show that Smart-Kube can approach the target utilization and reduce energy consumption in a variety of ways depending on the preferences of the cluster operator for a variety of cluster sizes.

Setup the environment in your machine

  1. Download source code from GitHub

     git clone https://github.com/saeid93/smart-scheduler
    
  2. Download and install miniconda

  3. Create conda virtual-environment

     conda create --name smartscheduler python=3
    
  4. Activate conda environment

     conda activate smartscheduler
    
  5. If you want to use GPUs make sure that you have the correct version of CUDA and cuDNN installed from here. Alternatively, you can check cudnn-compatibility to find compatible versions and install CUDA and cuDNN with conda from cudatoolkit and cudnn, respectively. Make sure ther versions of python, cuda, cudnn, and tensorflow in your conda environment are compatible.

  6. Use PyTorch or Tensorflow isntallation manual to install one of them based-on your preference

  7. Install the followings

     sudo apt install cmake libz-dev
    
  8. Install requirements

     pip install -r requirements.txt
    
  9. setup tensorboard monitoring

2. Kubernetes Cluster Setup

If you want to do the real world Kubernetes experiemnts of the paper you should also do the following steps. There are several options for setting up a Kuberentes cluster. The repo codes can connect to the cluster through the Python client API as long as you have access to the kube config address e.g. ~/.kube/config in your config files specified.

We have used Google Cloud Platform for our experiments. You can find the toturial for creating the cluster on google cloud and locally in the followings:

If you want to train then checkout tips for training

3. Project Structure

  1. data
  2. docs
  3. experiments
  4. smart-scheduler

The code is separated into three modules

  1. data: This is the folder containing all the configs and results of the project. Could be anywhere in the project.
  2. smart-scheduler: the core simulation library with Open-AI gym interface
  3. experiments: experiments of the paper and the reinforcement learning side of codes.

Structure

  • src: The folder containing the smart-scheduler simulators. This should be installed for using.

Usage

Go to the smart-scheduler and install the library in the editable mode with

pip install -e .

3.2. data

Structure

Link the data folder (could be placed anywhere in your harddisk) to the project. A sample of the data folder is available at data.

Usage

Go to experiments/utils/constants.py and set the path to your data and project folders in the file. For example:

DATA_PATH = "/Users/saeid/Codes/smart-scheduler/data"

3.4.1.1 Arabesque

The cluster and workloads are generated in the following order:

  1. Clusters: Nodes, services, their capacities, requested resources and their initial placements.
  2. Workloads: The workload for each cluster that determines the resource usage at each time step. This is built on top of the clusters built on step 1. Each cluster can have several workloads.

To generate the clusters, workloads, networks and traces, first go to your data folder (remember data could be anywhere in your disk just point the data folder as experiments/utils/constants.py).

Go to the your cluster generation config data/configs/cluster-generation/ make a folder named after your config and make the config.json in the folder e.g. see the my-cluster in the sample data folder data/configs/generation-configs/cluster-generation/my-cluster/config.json. Then run the experiments/cluster/generate_cluster.py with the following script:

python generate_cluster.py [OPTIONS]

Options:
  --cluster-config-folder TEXT      config-folder
  [default:                         my-cluster] 

For a full list of config.json parameters options see cluster-configs-options. The results will be saved in data/clusters/<cluster_id>.

4.4.1.2. Generating the Workloads

Go to the your workload generation config data/configs/generation-configs/workload-generation make a folder named after your config and make the config.json in the folder e.g. see the my-workload in the sample data folder data/configs/generation-configs/workload-generation/my-workload/config.json. For a full list of config.json see. Then run the experiments/cluster/generate_cluster.py with the following script:

python generate_workload.py [OPTIONS]

Options:
  --workload-config-folder TEXT      config-folder
  [default:                          my-workload] 

For a full list of config.json parameters options see workload-configs-options. The results will be saved in data/clusters/<cluster_id>/<workload_id>.

Change it to Alibaba and Arabesque Go to the your workload generation config data/configs/generation-configs/workload-generation make a folder named after your config and make the config.json in the folder e.g. see the my-workload in the sample data folder data/configs/generation-configs/workload-generation/my-workload/config.json. For a full list of config.json see. Then run the experiments/cluster/generate_cluster.py with the following script:

python generate_workload.py [OPTIONS]

Options:
  --workload-config-folder TEXT      config-folder
  [default:                          my-workload] 

For a full list of config.json parameters options see workload-configs-options. The results will be saved in data/clusters/<cluster_id>/<workload_id>.

4.4.2. Training and analysis

  1. change the training parameters in <configs-path>/real/<experiment-folder>/config_run.json. For more information about the hyperparamters in this json file see hyperparameter guide
  2. To train the environments go to the parent folder and run the following command.
python experiments/learning/learners.py --mode real --local-mode false --config-folder PPO --type-env 0 --cluster-id 0 --workload-id 0 --use-callback true

4.4.2. Analysis

4.4.2.1 check_env

The Kubernetes interface is designed based-on the Kubernetes api version 1.

The main operations that are currently implemented are:

  • creating
    • cluster
    • utilisation server
    • pods
  • actions
    • scheduling pods to nodes
    • moving pods (not used in this work)
    • deleting pods
    • cleaning nemespace
  • monitoring
    • get nodes resource usage
    • get pods resource usage

a sample of using the interface can be found here

4. Sample run on GKE cluster

Log of a running emulation - moving service 0 from node 1 to node 0 (s0n1 -> s0n1)

logs

Google cloud console of a running emulation - moving service 0 from node 1 to node 0 (s0n1 -> s0n1)

images

5. Other

  1. Step by step guide to trainig the code on EECS

  2. Step by step guide to trainig the code on GKE

  3. List of running ray problems

  4. List of QMUL EECS problems

  5. Tensorboard Monitoring

  6. Cluster Monitoring

About

"Smart-Kube" SmartCloud paper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published