Skip to content

A reference architecture for training and deploying machine learning models using best practice engineering and DevOps principles

License

Notifications You must be signed in to change notification settings

jakkaj/ml-train-deploy-vsts-k8s

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine Learning Training and Scoring Pipeline

A devops based pipeline to train models and orchestrate them for scoring in a Kubernetes cluster

This article is produced from work done by Microsoft and Janison engineers to produce a machine learning model delivery pipeline.

It describes how we built this pipeline to train a machine learning model and deliver it to production for scoring - using industry standard DevOps practices.

The project is an automated build and release system based on Visual Studio Team Services.

The main tenets were:

The Model

  • Model to be trained as a Kubernetes job (automatically by build process)
  • Model to be saved to a pre-determined location for use by the scoring side
  • Model to be managed as part of a build process
    • Linked back to PR, tasks, comments etc in VSTS
    • Must allow for model provenance
    • Can be kicked of automatically or manually by passing in required parameters
  • The deployment to include the scoring site
    • Model training code and scoring site deployed as a unit

Containers

  • All components of system must operate inside Docker containers

  • DevOps concepts:

    • Enables blue/green deployments
    • Promotions between environments achieved by the VSTS release mechanism
    • Enable CI/CD scenarios
    • Automated from end-to-end
    • Can be run on a cron job daily and remain immutable
      • Builds/Deployments do not overwrite each other
  • Immutable deployments

    • Deployments must be versioned
    • Immutable
    • Replicable (re-create later)
    • Clearly defined deployment boundary
  • Service mesh

    • Deployment versioning to allow for Istio control
    • Can support intelligent routing
    • A/B splits
    • Traffic % based splits
    • Other intelligent traffic scrutiny / routing

The Solution - immutable deployments

Each deployment in the system is immutable. It contains all the important code, configuration and settings. Importantly it contains both the model training code and the scoring code.

It's important to note that the type of model we're deploying needs to be updated regularly with the latest data. This deployment system is not for data scientist / engineer inner loop iterative development... e.g. modify code, run, get score, improve code etc. This model is for production operationalisation of a model -> training -> scoring pipeline. This pipeline is not just for CI/CD but also for usage with a cron job - repeating each day as new data is made available from the ETL process.

Model training job

The model training process runs as a Kubernetes job. It takes a range of environment variables pass in via VSTS -> Helm Chart -> Kubernetes environment variable -> container -> Python script. These variables are things like source data location, target data location, access keys etc.

The model will produce the trained model and other outputs and save them to a shared location based on the values passed in via the environment variables.

In our system the path was based on the VSTS build number.

Scoring site

The other portion of the system is the scoring site which exposes an API endpoint that can be used to pass in data and return a scoring result.

Again, this site takes environment variables including the shared model output path.

Why deploy them like this

There are a range of reasons. When the scoring site is coupled to a model such as this forming a coupled, immutable deployment you get a range of things that would normally be reserved for regular software deployments.

  • Have a number of model/scoring pairs deployed at a time
  • Ability for blue/green deployments
  • Traffic splits, A/B testing
  • Roll forward and back deployments
    • Including re-build and re-deploy after time (years later)
  • CI/CD deployments, including developer test areas and similar concepts
  • Unit and integration testing across the entire deployment
  • Simplified model to scoring api syncronisation (they often have coupled dependencies)
  • Code and model provenance
    • Model and scoring site are linked to the people, pull requests, stores, tests and tasks that went in to making the deployment
  • Probably other cool stuff :)

Ability to Cron the build

Why not just store all the configs in source control?

In our system there is a nightly ETL process that exports data from the production system ready for training with the model.

This systems allows a cron job to be configured to kick off new builds daily. The system can then use release management to migrate traffic to the newly built model once it's passed any testing or other validation work.

Parameterising via the build and applying those parameters to build artifacts allows for this daily build scenario.

It also allows for the CI/CD scenario.

Synchonisation

The model could take some time to compute so during deployment there needs to be a synconisation step before the scoring pods come up to allow traffic to be directed to them.

Synchronisation is achieved by using a Kubernetes init container which holds off the initilisation process until the model is read for use. More on this below.

The next section will guide you through how to stand up a similar system.

Getting Started

Before you start you'll need the following things installed / setup.

For new users (and for demos!) a graphical Kubernetes app can be helpful. Kubernetic is one such app. Of interest is that the app refreshes immediately when the cluster state changes, unlike the web ui which needs to be refreshed manually.

Training and Deploying Model

The Model Training Job

Our statement at the beginning of the project was to set up a machine learning delivery capability via DevOps based practices.

The practical aim of the project is to build a model through training and delivering to a scoring system - the model itself is not all that important... i.e. this process could be used with any training system based around any technology (thank you containers!).

Input Data

The model in our scenario was a collaborative filtering model to assist with recommendations. It takes data that has been extracted from the production databases. The ETL is orchestrated by Azure Data Factory and saves data to an Azure Blob Storage account in CSV format under a convention based folder structure. This section is out of scope for this document.

The input locations are passed in to the containers as environment variables.

The actual model type is not important - the model in this sample code simply has a delay of 15 seconds before writing out a sample model file.

Parameters

The model training container takes a series of parameters for the source data location as well as the output location. These parameters are passed in to the model by the build as environment variables to the Kubernetes job. The model output location is based on the build number of the VSTS build. The other parameters are generated and passed in as build variables.

This parameterisation is used heavily in the system. To help make parameterisation simpler we've opted to use Helm Charts to make packages which form the system's deployments. The build system prepares the Helm Charts and deploys them to the cluster.

Set-up the Environment

Before the model can be trained the environment needs some set-up. In this system the model is output to an Azure File share - which is stored in blob storage. You'll need to create a PVC (PersistentVolumeClaim) as somewhere to save the model.

The output path is taken in as an environment variable MODELFOLDER. This is passed through the helm chart to the kube yaml file which is applied to the cluster. This folder path is parameterised based on the build number, so each build will output to a different folder.

By having these variables passed in as environment variables as opposed to something like config maps or similar the deployment is immutable, the settings are always locked once deployed. Importantly the deployment can be repeated later with the same settings from the same historical Git tag.

Building the Model Container

With VSTS it's super easy to build and deploy containers to Kubernetes. This article by Jessica Deen gives a great overview of the process. We'll not repeat those steps here.

We'll be doing similar steps, with the addition that we'll be creating and updating a Helm package as part of the build that will result in a build artefact.

Note Make sure to set your image name in both the build and push stages to sampletrainer:$(Build.BuildNumber)

Parameteriseing the Deployment

The model takes a range of parameters - things like where to source the training data from as well as where to save the output. All these parameters are "baked in" to the output build - keeping in line with the immutable deployment requirement.

The values are placed in to the Helm Chart's values.yaml file. It is possible to pass in values to Helm Charts during publication using --set but that would mean an outside config requirement during publication.

Variable Groups

The values are passed in to the build using Variable Groups. These allow you to create centrally managed variables for builds across your VSTS collection. These variables can be adjusted at build time. This allows users of the system to fire off parameterised builds without having to know "how" to build and deploy an instance of the system to the cluster.

Processing the variables

To help parameterise the deployment (with the environment variables) we're using Helm Charts. Helm Charts provide great flexability at multiple points along the development workflow. The large potion of the configuration can be performed as part of the template by the developer - then checked in to source control. This work is performed in the templates director.

The only section the build needs to worry about is the values.yaml file which is passed in to Helm during deployment.

The build will modify this values.yaml file and package it with the rest of the chart as a build artifact. This chart contains all the environment variables the app needs to operate. It's all baked in to the chart artifact so it can be re-deployed at a later time, and later on operations can go back and see what parameters were used when the system was deployed.

The template Helm Chart (under Source\helm\modelsystem) looks like this:

replicaCount: 1

image:
  repository: jakkaj/sampletrainer
  tag: dev
  pullPolicy: IfNotPresent

outputs:
  modelfolder: /mnt/azure/
  mountpath: /mnt/azure

build:
  number: 13

We need a way to modify that to add in some build environment variables, including the values that were passed in to the build via VSTS variables.

Yaml Writer

To update an existing yaml file or to create a new one from VSTS build arguments you can use the VSTS. Deploy the extension by following the instructions here. Also check out the YamlWriter GitHub Repository.

In your build, add the YamlWriter as a new build task and set the following parameters:

File: Source/helm/modelsystem/values.yaml.

Our Parameters look as following - although yours will differ!

build.number='$(Build.BuildNumber)',scoringimage.repository='<your acr>/samplescorer',scoringimage.tag='$(Build.BuildNumber)',image.repository='<your acr>/sampletrainer',image.tag='$(Build.BuildNumber)',outputs.modelfolder='/mnt/azure/$(Build.BuildNumber)'

Note: Make sure you replace <your acr> with your real registry names including the full path. e.g. someregistry.azurecr.io.

This task will open values.yaml and add/update values according to the parameters passed in.

When running the build the output will look something like this

2018-05-18T03:41:45.6562552Z ##[section]Starting: YamlWriter 
2018-05-18T03:41:45.6632990Z ==============================================================================
2018-05-18T03:41:45.6645860Z Task         : Yaml Writer
2018-05-18T03:41:45.6659314Z Description  : Feed in parameters to write out yaml to new or existing files
2018-05-18T03:41:45.6674165Z Version      : 0.18.0
2018-05-18T03:41:45.6688693Z Author       : Jordan Knight
2018-05-18T03:41:45.6704359Z Help         : Pass through build params and other interesting things by using a comma separated list of name value pairs. Supports deep creation - e.g. something.somethingelse=10,something.somethingelseagain='hi'. New files will be created, and existing files updated. 
2018-05-18T03:41:45.6721664Z ==============================================================================
2018-05-18T03:41:46.0603680Z File: /opt/vsts/work/1/s/Source/helm/modelsystem/values.yaml (exists: true)
2018-05-18T03:41:46.0622913Z Settings: image.repository='<youwish>/Documentation',image.tag='115',outputs.modelfolder='/mnt/azure/115',env.BLOB_STORAGE_ACCOUNT='<youwish>',env.BLOB_STORAGE_KEY='<youwish>', env.BLOB_STORAGE_CONTAINER='<youwish>',env.BLOB_STORAGE_CSV_FOLDER='"2018\/05\/08"',env.TENANTID='<youwish>'
2018-05-18T03:41:46.0640773Z Dry run: false
2018-05-18T03:41:46.0709407Z Writing file: /opt/vsts/work/1/s/Source/helm/modelsystem/values.yaml
2018-05-18T03:41:46.0726587Z Result:
2018-05-18T03:41:46.0746248Z  replicaCount: 1
2018-05-18T03:41:46.0759265Z image:
2018-05-18T03:41:46.0771294Z   repository: <youwish>/Documentation
2018-05-18T03:41:46.0785611Z   tag: '115'
2018-05-18T03:41:46.0801686Z   pullPolicy: IfNotPresent   
2018-05-18T03:41:46.0829768Z     
2018-05-18T03:41:46.0877844Z outputs:
2018-05-18T03:41:46.0889861Z   modelfolder: /mnt/azure/115
2018-05-18T03:41:46.0902953Z   mountpath: /mnt/azure
2018-05-18T03:41:46.0915630Z build:
2018-05-18T03:41:46.0927381Z   number: 13
2018-05-18T03:41:46.0939422Z env:
2018-05-18T03:41:46.0950803Z   BLOB_STORAGE_ACCOUNT: <youwish>
2018-05-18T03:41:46.0965333Z   BLOB_STORAGE_KEY: >-
2018-05-18T03:41:46.0978866Z     <youwish>
2018-05-18T03:41:46.0990595Z   BLOB_STORAGE_CONTAINER: <youwish>
2018-05-18T03:41:46.1004951Z   BLOB_STORAGE_CSV_FOLDER: '"2018/05/08"'
2018-05-18T03:41:46.1019701Z   TENANTID: <youwish>
2018-05-18T03:41:46.1028046Z 
2018-05-18T03:41:46.1054089Z ##[section]Finishing: YamlWriter 

This process loaded the values.yaml file, modified and saved it back. Then the build process then archives the Helm Chart directory and saves it as a build artefact.

  • Create a Archive task and set Root folder to Source/helm/modelsystem the archive type to tar and gz and archive output $(Build.ArtifactStagingDirectory)/$(Build.BuildId).tar.gz
  • Create a Publish Artifact task to prep the chart for release (Path to publish: $(Build.ArtifactStagingDirectory)/$(Build.BuildId).tar.gz).

The next step is to test the Helm Chart works before setting up the scoring side of the build.

Running the Training Job in Kubernetes

Prep The Cluster

The cluster will need some preparation before the workloads can run - in this case there needs to be a shared location to save the trained models to.

Add the Persistent Volume Claim (PVC)

In Azure we add the PVC as an Azure File based PVC, which links to Azure Files over the SMB protocol. This location is accessible outside the cluster and can be used between nodes.

Create an new Storage Account in the same resource group as the AKS / acs-engine cluster. By default for this example it's named samplestorage - you can change this in pvc_azure.yml.

From the Source\kube sub-folder run:

kubectl apply -f pvc_azure.yml

Or if you're running locally in something like Minikube run the following to set up a local PVC that will act as a stand in for Azure Files during your development.

kubectl apply -f pvc_minikube.yml

You can of course use any PVC type you'd like as long as they are accessible across nodes and pods.

Note: This step was not performed as part of the build process. You could automate this, but in our case it was just as easy to apply it as it only has to be done once for the cluster (per namespace).

Test the Helm Chart

Before we automate the Helm Chart deployment it's a good idea to try it out. Next step is to pull the chart and manually deploy it to do the cluster.

Navigate to the completed build in VSTS and download the Helm Chat artifact. Extract zip, and the Helm Chart will be the tar.gz inside.

Install Helm

This article will not full go over Helm, Helm Charts and other set-up requirements. Suffice it to say you need the Helm CLI client installed on your machine and Tiller installed in the cluster.

Check out the Helm Installation Instructions to get started.

Note: Remember to run helm init on your cluster to set up the required resources.

Download the Helm Chart locally

In VSTS, grab the asset .zip file from the latest successful build. Expand it (and expand the internal Helm Chart so you can see it).

If you open the chart folder, check out values.yaml you should see the updated parameters including the correct build number and container names.

In this case to test you can change the image name to jakkaj/sampletrainer and the tag to dev.

Switch to the Helm Chart location and run:

helm install . 
kubectl get jobs
kubectl describe job <jobname>

Look for things like:

Pods Statuses:  0 Running / 1 Succeeded / 0 Failed`

and

Type    Reason            Age   From            Message
----    ------            ----  ----            -------
Normal  SuccessfulCreate  2m    job-controller  Created pod: modelsystem-122-stnmn

That means the job is running!

The next stage is to prepare the scoring side of the site.

Building and Deploying the Scoring Site/API

The scoring site and the model are paired from build through to deployment and operation - the model is trained by the job at the same time as the scoring site is deployed. A scoring site always has the same training model underneath. Only when the model is ready (trained, which can take quite some time) will the scoring site pods be initialised and ready to take requests.

Scoring sites never use any other model, and as such we can use concepts like blue/green deployments, traffic splits and other intelligent routing capabilities in the cluster. We can manage endpoints as if they represent a single model version - meaning we can roll forward and back just by changing the labels that the in cluster routers look for.

It's all very flexible.

Build the Scoring Container

Add container build and push steps like before for Source/scoring/site/Dockerfile.

For this sample call the container samplescorer:$(Build.BuildNumber).

Wait for Training to Complete

As this process is an immutable atomic operation that includes the actual training of the model as well as the scoring site deployment - the scoring site will most likely be deployed long before the trained model is ready. This means we should hold off access to the scoring site until it's to serve with the new model.

To achieve this we used the Kubernetes Init Containers feature. These containers run and the other regular pods will not come up until the init pods successfully exit. If they do not succeed or timeout, the pod may be killed.

The "Waiter" Container

Under Source\scoring\init there is a bash script and Docker file that will build a container that will "wait" for $MODELFOLDER/complete.txt to be written. Model folder is passed in from the cluster via environment variables from the Helm Chart as before. Because the paths are the same from model to scoring side and the data location is shared via a cluster PVC a basic file watcher can do the job here.

Once the file is found, the script exist with code 0.

Note: There is a line in the script that has been commented out for next time - it's a way to log the trained model score to prometheus... that's a story for another day.

Making the Scoring Pod Wait

Running the init container is achieved by adding an initContainers section to the pod spec for the deployment in the Helm Chart. This can be seen in Source\helm\modelsystem\templates\scoring_deployment.yaml

initContainers:
      - name: waiter-container
        image: jakkaj/modelwaiter:dev
        ...

With this in place, the scoring pod will wait until the file (complete.txt) is found in the correct path (as supplied by the MODELFOLDER environment variable).

Run the build and download the Helm Chart and test it locally again. Remember to replace the image names with jakkaj/sampletrainer and jakkaj/samplescorer and the tag settings to dev.

Push the Wait Container to your Registry

It's a good idea to push the waiter container to your own registry. You can either build the container from the Dockerfile, or you can:

pull jakkaj/modelwaiter:dev
tag jakkaj/modelwaiter:dev <youracr>.azurecr.io/modelwaiter:dev
push <youracr>.azurecr.io/modelwaiter:dev

That will re-tag the image with the naming convention to upload to your own container registry. Remember to update the source in the initContainers to this new image location.

Logs for the Wait

When testing your Helm Chart, you'll need to see if the waiter is working, and ensure that the scoring pod eventually comes up.

Get all pods with this command. The -a switch ensures to get pods that are stopped as the training job may have ended and existed.

kubectl get pods -a
NAME                                                              READY     STATUS      RESTARTS   AGE
frontend-service-5787b48745-lzw6p                                 1/1       Running     0          1d
frontend-service-5787b48745-tfms4                                 1/1       Running     0          1d
scoring-deployment-152-65fc95dc48-l4wld                           1/1       Running     0          5m
scoring-deployment-20180523.5-1e8b7415-7c2e-e411-80bb-0015qj9ps   1/1       Running     0          1d
scoring-deployment-20180523.5-1e8b7415-7c2e-e411-80bb-0015rqdht   1/1       Running     0          1d
scoring-deployment-20180523.7-1e8b7415-7c2e-e411-80bb-0015g2kd8   1/1       Running     0          1d
scoring-deployment-20180523.7-1e8b7415-7c2e-e411-80bb-0015t8w7c   1/1       Running     0          1d
trainerjob-20180511.1-th628                                       0/1       Completed   0          13d
trainerjob-20180514.1-jgg8s                                       0/1       Completed   0          10
trainingjob-152-mh7kk                                             0/1       Completed   0          5m

Now log out the training job:

kubectl logs trainingjob-152-mh7kk
Taking some time to render a pretend model :)
Model Folder: /mnt/azure/152
Build Number: 152
Wrote: /mnt/azure/152/complete.txt
Random Score: 0.6716458949225429

Now log the scoring side. First show the logs for the init container with the -c switch.

kubectl logs scoring-deployment-152-65fc95dc48-l4wld -c waiter-container
File not found! /mnt/azure/152
File not found! /mnt/azure/152
File not found! /mnt/azure/152
File not found! /mnt/azure/152
File not found! /mnt/azure/152
File found

You can see the container is waiting for the file to be written. It then finds it and exits.

Now log the actual scoring site:

kubectl logs scoring-deployment-152-65fc95dc48-l4wld
Example app listening on port 3000!

That's it, the system waited, and now it's hosting!

Deployment

With manual testing of the build completed, it's time to release it via proper means.

Next -> Setting up the release;

Links

Microsoft Stuff

Kubernetes and Container Stuff

Next

Release

About

A reference architecture for training and deploying machine learning models using best practice engineering and DevOps principles

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published