1. Introduction

Abstract

One of the most challenging problems in the popular orchestration framework Kubernetes is assigning sufficient resources to containers to operate at a required level while also avoiding excessive resource allocation which can delay other jobs in the cluster. A variety of heuristic approaches have been proposed to tackle this problem but these require considerable manual adjustments which can be laborious. Reinforcement learning approaches have been proposed to address this issue but these proposals do not consider the energy consumption of the cluster. This is an important component of the problem due to the commitments of large cloud operators to carbon neutrality. We have proposed a system called Smart-Kube to achieve a target utilization on nodes while maintaining energy consumption at a reasonable level. An experimental framework is designed on top of real-world Kubernetes clusters and real-world traces of container jobs are used to evaluate the framework. Experimental results show that Smart-Kube can approach the target utilization and reduce energy consumption in a variety of ways depending on the preferences of the cluster operator for a variety of cluster sizes.

Setup the environment in your machine

Download source code from GitHub

 git clone https://github.com/saeid93/smart-scheduler

Download and install miniconda

Create conda virtual-environment

 conda create --name smartscheduler python=3

Activate conda environment
```
 conda activate smartscheduler
```
If you want to use GPUs make sure that you have the correct version of CUDA and cuDNN installed from here. Alternatively, you can check cudnn-compatibility to find compatible versions and install CUDA and cuDNN with conda from cudatoolkit and cudnn, respectively. Make sure ther versions of python, cuda, cudnn, and tensorflow in your conda environment are compatible.
Use PyTorch or Tensorflow isntallation manual to install one of them based-on your preference
Install the followings
```
 sudo apt install cmake libz-dev
```
Install requirements
```
 pip install -r requirements.txt
```
setup tensorboard monitoring

2. Kubernetes Cluster Setup

If you want to do the real world Kubernetes experiemnts of the paper you should also do the following steps. There are several options for setting up a Kuberentes cluster. The repo codes can connect to the cluster through the Python client API as long as you have access to the kube config address e.g. ~/.kube/config in your config files specified.

We have used Google Cloud Platform for our experiments. You can find the toturial for creating the cluster on google cloud and locally in the followings:

If you want to train then checkout tips for training

3. Project Structure

The code is separated into three modules

data: This is the folder containing all the configs and results of the project. Could be anywhere in the project.
smart-scheduler: the core simulation library with Open-AI gym interface
experiments: experiments of the paper and the reinforcement learning side of codes.

3.1. smart-scheduler

Structure

src: The folder containing the smart-scheduler simulators. This should be installed for using.

Usage

Go to the smart-scheduler and install the library in the editable mode with

pip install -e .

3.2. data

Structure

Link the data folder (could be placed anywhere in your harddisk) to the project. A sample of the data folder is available at data.

Usage

Go to experiments/utils/constants.py and set the path to your data and project folders in the file. For example:

DATA_PATH = "/Users/saeid/Codes/smart-scheduler/data"

3.4. experiments

3.4.1. Dataset Preprocessing

3.4.1.1 Arabesque

3.4.1. Data Generation

The cluster and workloads are generated in the following order:

Clusters: Nodes, services, their capacities, requested resources and their initial placements.
Workloads: The workload for each cluster that determines the resource usage at each time step. This is built on top of the clusters built on step 1. Each cluster can have several workloads.

To generate the clusters, workloads, networks and traces, first go to your data folder (remember data could be anywhere in your disk just point the data folder as experiments/utils/constants.py).

3.4.1.1. Generating the clusters

Go to the your cluster generation config data/configs/cluster-generation/ make a folder named after your config and make the config.json in the folder e.g. see the my-cluster in the sample data folder data/configs/generation-configs/cluster-generation/my-cluster/config.json. Then run the experiments/cluster/generate_cluster.py with the following script:

python generate_cluster.py [OPTIONS]

Options:
  --cluster-config-folder TEXT      config-folder
  [default:                         my-cluster]

For a full list of config.json parameters options see cluster-configs-options. The results will be saved in data/clusters/<cluster_id>.

4.4.1.2. Generating the Workloads

4.4.1.2.1. Generating the Workloads - Option one - Random

Go to the your workload generation config data/configs/generation-configs/workload-generation make a folder named after your config and make the config.json in the folder e.g. see the my-workload in the sample data folder data/configs/generation-configs/workload-generation/my-workload/config.json. For a full list of config.json see. Then run the experiments/cluster/generate_cluster.py with the following script:

python generate_workload.py [OPTIONS]

Options:
  --workload-config-folder TEXT      config-folder
  [default:                          my-workload]

For a full list of config.json parameters options see workload-configs-options. The results will be saved in data/clusters/<cluster_id>/<workload_id>.

4.4.1.2.2. Generating the Workloads - Option two - Real-world

Change it to Alibaba and Arabesque Go to the your workload generation config data/configs/generation-configs/workload-generation make a folder named after your config and make the config.json in the folder e.g. see the my-workload in the sample data folder data/configs/generation-configs/workload-generation/my-workload/config.json. For a full list of config.json see. Then run the experiments/cluster/generate_cluster.py with the following script:

python generate_workload.py [OPTIONS]

Options:
  --workload-config-folder TEXT      config-folder
  [default:                          my-workload]

For a full list of config.json parameters options see workload-configs-options. The results will be saved in data/clusters/<cluster_id>/<workload_id>.

4.4.2. Training and analysis

4.4.2.1. Training the agent

change the training parameters in <configs-path>/real/<experiment-folder>/config_run.json. For more information about the hyperparamters in this json file see hyperparameter guide
To train the environments go to the parent folder and run the following command.

python experiments/learning/learners.py --mode real --local-mode false --config-folder PPO --type-env 0 --cluster-id 0 --workload-id 0 --use-callback true

4.4.2. Analysis

4.4.2.1 check_env

4.4.2.2 check_learned

4.4.2.3 test_baselines

4.4.2.4 check environment

4.4.2.5 check environment

4.4.2.6 check environment

4.4.4. Kubernetes interface

The Kubernetes interface is designed based-on the Kubernetes api version 1.

The main operations that are currently implemented are:

creating
- cluster
- utilisation server
- pods
actions
- scheduling pods to nodes
- moving pods (not used in this work)
- deleting pods
- cleaning nemespace
monitoring
- get nodes resource usage
- get pods resource usage

a sample of using the interface can be found here

4. Sample run on GKE cluster

Log of a running emulation - moving service 0 from node 1 to node 0 (s0n1 -> s0n1)

Google cloud console of a running emulation - moving service 0 from node 1 to node 0 (s0n1 -> s0n1)

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
Dockerfiles		Dockerfiles
data/configs		data/configs
docs		docs
experiments		experiments
jobs-qmul-EECS		jobs-qmul-EECS
obs-103		obs-103
smart_scheduler		smart_scheduler
.gitignore		.gitignore
1-1.png		1-1.png
1-2.pdf		1-2.pdf
1-2.png		1-2.png
README.md		README.md
logs		logs
logs.txt		logs.txt
requirements.txt		requirements.txt
rewards-17.png		rewards-17.png
rewards-18.png		rewards-18.png
rewards-19.png		rewards-19.png
rewards-20.png		rewards-20.png
rewards-21.png		rewards-21.png
rewards-22.png		rewards-22.png
rewards-23.png		rewards-23.png

saeid93/smart-kube

Folders and files

Latest commit

History

Repository files navigation

1. Introduction

Abstract

Setup the environment in your machine

2. Kubernetes Cluster Setup

3. Project Structure

3.1. smart-scheduler

Structure

Usage

3.2. data

Structure

Usage

3.4. experiments

3.4.1. Dataset Preprocessing

3.4.1.1 Arabesque

3.4.1. Data Generation

3.4.1.1. Generating the clusters

4.4.1.2. Generating the Workloads

4.4.1.2.1. Generating the Workloads - Option one - Random

4.4.1.2.2. Generating the Workloads - Option two - Real-world

4.4.2. Training and analysis

4.4.2.1. Training the agent

4.4.2. Analysis

4.4.2.1 check_env

4.4.2.2 check_learned

4.4.2.3 test_baselines

4.4.2.4 check environment

4.4.2.5 check environment

4.4.2.6 check environment

4.4.4. Kubernetes interface

4. Sample run on GKE cluster

5. Other

About

Resources

Stars

Watchers

Forks

Languages