An example of how to apply DevOps concepts like CI/CD to machine learning.
The repo shows basic concepts, such as:
- Saving and loading a trained model
- Versioning models
- scikit-learn pipelines to bundle preprocessing and modeling
- Serving a model from a web service
- Hosting the web service on Heroku
- Pull request / review concepts, like reviewing a model build, reviewing service changes, and basic testing for the service
The same patterns could work with Jenkins instead of Github Actions or with AWS instead of Heroku.
This repo uses DVC, in contrast to the other one that uses git-lfs.
You change:
- Code or data under
training/
Before merging, automation handles:
- Training the model on the full data, which verifies that it works
- Evaluating the model
- Updating the dependencies for the service
- Testing that it works in the service
After merging, automation handles:
- Deploying the new model
You change:
- Code and data under
training/
- Code under
serving/
Before merging, automation handles:
- Training the model on the full data, which verifies that it works
- Evaluating the model
- Updating the dependencies for the service
- Testing that it works in the service
After merging, automation handles:
- Deploying the new model
One example is that we want to have human-readable labels for the model's output. The output labels might event have categories or URLs that point to definitions. In this case, you'd maintain your metadata under serving/data, possibly as a json file.
You change:
- Code and data under
serving/
Before merging, automation handles:
- Testing that it works
After merging, automation handles:
- Deploying the new model
- All code has been tested and reviewed before reaching production
- All models have been tested and reviewed before reaching production
- Code and data are versioned together
- The pipeline to train a model is versioned, including data sources
- The environment needed to train a model is versioned and reviewed like the rest of the code
I've seen these kinds of issues happen in industry:
- Training code issues
- The code works on one person's computer but nobody else's
- The code isn't held to similar standards as production code so it can be very messy
- The code might not even be in version control and might only be on one person's computer
- Some degree of manual review is needed for evaluation due to changes in the way it's evaluated, such as the testing data or error handling in any metrics
- Coordination issues
- The service broke because it's using the wrong version of
- The service runs but the predictions are poor quality due to differences in feature preprocessing between training and testing
- Model X is only compatible with code versions Y-Z so we have to maintain a compatibility table because we need to continue support for old code versions
- The production code was a re-implementation of the training code and implemented some things wrong
Issues I heard about:
- Someone reverted the new code but it didn't revert the new model and the old code doesn't work with the new model so we can't revert; we just need to fix the bug asap. (The same thing can happen the other way around)
- Someone deployed a model update to the database and it went to production without any tests or verification
You'll need to set up your github actions for the Heroku configuration. deploy_service.yml
needs two secrets:
HEROKU_API_KEY
: You can create a key by runningheroku authorizations:create
HEROKU_APP
: You can create an app by runningheroku create
. This is the app name, like "dancing-bear-1234"
Also:
- You'll need to create an IAM access key and secret and put them in your github secrets.
- Create the S3 bucket that you want to use. You can go into the infrastructure directory, edit the bucket prefix, and run
terraform init
thenterraform apply
. It'll show the bucket name that's created. Then you'll need to configure that to be your DVC remote, either via the command line or.dvc/config
- Remove the old way of tracking metrics
- The swagger/openapi documentation site doesn't work on Heroku because of the https proxy. This might fix it: https://werkzeug.palletsprojects.com/en/1.0.x/middleware/proxy_fix/
- the training data might be hitting sklearn servers too much. Switch to dvc data to protect their servers and provide a more realistic example
- Max size of downloaded resources like word vectors on github actions?
- What's the max size Docker image for Heroku?
- Github actions are limited to 6 hours of runtime. If you have a bigger training job, you'd want to setup an short-lived, high-power instance in the cloud such as AWS/Azure/GCP/etc
- Heroku free tier limits you to 512MB of RAM
- Heroku free tier goes to sleep automatically and doesn't handle autoscaling, so you wouldn't want to use this for production deployments
- dev/staging/prod environments
- all the nice things on the serving side: blue/green deployments, autoscaling, authentication, monitoring, logging
- Test mode for the training code
- Example output for the training code, ELI5, or other
- API versioning
- OpenAPI/Swagger documentation
- Many standard practices, like:
- Auto-formatter, linter
For sensitive data you may also need to take care not to have anything sensitive inside your model, such as ngrams containing personal information. You may also need to switch out Github Actions and Heroku for platforms that comply with your requirements.
You can add a github action on a schedule for this!
- DevOps
- MLOps
- Switch from Heroku to AWS ECR + ECS
- Switch the model from scikit-learn to Tensorflow, PyTorch, or Keras. Bonus points for transfer learning with pretrained embeddings or models.
- Fix some of the limitations
- Switch serving from Flask + Docker + Heroku to Cortex.ai + AWS
- Switch serving from Flask + Docker + Heroku to AWS Lambda + API Gateway + SAM
- Add an endpoint for batch processing of predictions
- Have custom code for your model that needs to be imported in both training and serving, for example a custom tokenizer