Guidance for Low Latency, High Throughput Inference using Efficient Compute on Amazon EKS

The guidance-for-machine-learning-inference-on-aws repository contains an end-to-end automation framework example for running model inference locally on Docker or at scale on Amazon EKS Kubernetes cluster. It supports EKS compute nodes based on CPU, GPU, AWS Graviton and AWS Inferentia processor architectures and can pack multiple models in a single processor core for improved cost efficiency. While this example focuses on one processor architecture at a time, iterating over the steps below for various CPU/GPU Efficient Compute and Inferentia architectures enables hybrid deployments where the best processor/accelerator is used to serve each model depending on its resource consumption profile. In this sample repository, we use a bert-base NLP model from huggingface.co, however the project structure and workflow is generic and can be adapted for use with other models.

Fig. 1 - Sample Amazon EKS cluster infrastructure and deploying, running and testing of ML Inference workloads

The ML inference workloads in this project are deployed on the CPU, GPU, or Inferentia based EKS compute nodes as shown on Fig. 1. The control scripts may run in any location that has a full access to the cluster Kubernetes API. To eliminate latency concern related to the EKS cluster ingress, load tests run in pods deployed within the same cluster and send requests to the models directly through the cluster pod network.

The Amazon EKS cluster has several node groups, with one Amazon EC2 instance family for each node group. Each node group can support different instance types, such as CPU (C5,C6i, C7gn), GPU (G4dn), AWS Inferentia (inf1, inf2) and can pack multiple models for each EKS node to maximize the number of served ML models that are running in a node group. Model bin packing is used to maximize compute and memory utilization of the Amazon EC2 instances in the cluster node groups.
The natural language processing (NLP) open-source PyTorch model from Hugging Face, serving application and ML framework dependencies, are built by users as container images use an automation framework. These images are uploaded to Amazon Elastic Container Registry - Amazon ECR.
Using the automation framework, the model container images are obtained from Amazon ECR and deployed to an Amazon EKS cluster using generated deployment and service manifests through the Kubernetes API (exposed through Elastic Load Balancing (ELB)). Model deployments are customized for each deployment target EKS compute node instance type through settings in the central configuration file.
Following the best practices of the separation of model data from containers that run it, the ML model microservice design allows it to scale out to a large number of models. In the sample project, model containers are pulling data from Amazon Simple Storage Service (Amazon S3) and other public model data sources each time they are initialized.
Using the automation framework, the test container images are deployed to an Amazon EKS cluster using generated deployment and service manifests through the Kubernetes API. Test deployments are customized for each deployment target EKS compute node instance type through settings in the central configuration file. Load or scale testing is performed by sending simultaneous requests to the model service pool from test pods. Performance test results and metrics are obtained, recorded, and aggregated.

Fig. 2 - ML Inference video walkthrough

Please watch this end-to-end accelerated video walkthrough (7 min) or follow the instructions below to build and run your own inference solution.

Prerequisites

This sample can be run on a single machine using Docker, or at scale on a Amazon EKS cluster.

It is assumed that the following basic tools are present: docker, kubectl, envsubst, kubetail, bc.

Operation

The project is operated through a set of action scripts as described below. To complete a full cycle from beginning-to-end, first configure the project, then follow steps 1 through 5 executing the corresponding action scripts. Each of the action scripts has a help screen, which can be invoked by passing "help" as argument: <script>.sh help

Optional - Provision an EKS cluster with 3 node groups

To provision this "opinionated" EKS cluster infrastructure optimized for running this guidance, run the ./provision.sh script. Optionally, you can use an existing EKS cluster you have or provision a new one using one of Terraform EKS blueprint that would contains nodegroups of desired target instance types.

./provision.sh

This command will execute a script that creates a CloudFormation stack which deploys an EC2 "management" instance in your default AWS region. That instance contains a userData script that provisions an EKS cluster in the us-west-2 region, pre-defined per specification based on the following template which is a part of another Git repo project. After that EKS cluster is provisoned, it is fully acessible from that EC2 "management" instance and this repository is copied there as well, ready to proceed to next steps.

Configure

./config.sh

A centralized configuration file config.properties contains all settings that are customizeable for the project. This file comes pre-configured with reasonable defaults that work out of the box. To set the processor target or any other setting edit the config file, or execute the config.sh script. Configuration changes take effect immediately upon execution of the next action script.

1. Build

./build.sh

This step builds a base container for the selected processor. A base container is required for any of the subsequent steps. This step can be executed on any instance type, regardless of processor target.

Optionally, if you'd like to push the base image to a container registry, execute ./build.sh push. Pushing the base image to a container registry is required if you are planning to run the test step against models deployed to Kubernetes. If you are using a private registry and you need to login before pushing, execute ./login.sh. This script will login to AWS ECR, other private registry implementations can be added to the script as needed.

2. Trace

./trace.sh

Compiles the model into a TorchScript serialized graph file (.pt). This step requires the model to run on the target processor. Therefore it is necessary to run this step on an instance that has the target processor available.

Upon successful compilation, the model will be saved in a local folder named trace-{model_name}.

Note

It is recommended to use the AWS Deep Learning AMI to launch the instance where your model will be traced.

To trace a model for GPU, run the trace step on a GPU instance launched with the AWS DLAMI.
To trace a model for Inferentia, run the trace step on an Inferentia instance launched with the AWS DLAMI with Neuron and activate the Neuron compiler conda environment

3. Pack

./pack.sh

Packs the model in a container with FastAPI, also allowing for multiple models to be packed within the same container. FastAPI is used as an example here for simplicity and performance, however it can be interchanged with any other model server. For the purpose of this project we pack several instances of the same model in the container, however a natural extension of the same concept is to pack different models in the same container.

To push the model container image to a registry, execute ./pack.sh push. The model container must be pushed to a registry if you are deploying your models to Kubernetes.

4. Deploy

./deploy.sh

This script runs your models on the configured runtime. The project has built-in support for both local Docker runtimes and Kubernetes. The deploy script also has several sub-commands that facilitate the management of the full lifecycle of your model server containers.

./deploy.sh run - (default) runs model server containers
./deploy.sh status [number] - show container / pod / service status. Optionally show only specified instance number
./deploy.sh logs [number] - tail container logs. Optionally tail only specified instance number
./deploy.sh exec <number> - open bash into model server container with the specified instance number
./deploy.sh stop - stop and remove deployed model contaiers from runtime

5. Test

./test.sh

The test script helps run a number of tests against the model servers deployed in your runtime environment.

./test.sh build - build test container image
./test.sh push - push test image to container registry
./test.sh pull - pull the current test image from the container registry if one exists
./test.sh run - run a test client container instance for advanced testing and exploration
./test.sh exec - open shell in test container
./test.sh status- show status of test container
./test.sh stop - stop test container
./test.sh help - list the available test commands
./test.sh run seq - run sequential test. One request at a time submitted to each model server and model in sequential order.
./test.sh run rnd - run random test. One request at a time submitted to a randomly selected server and model at a preset frequency.
./test.sh run bmk - run benchmark test client to measure throughput and latency under load with random requests
./test.sh run bma - run benchmark analysis - aggregate and average stats from logs of all completed benchmark containers

Clean up

You can uninstall the sample code for this Guidance using the AWS Command Line Interface. You must also delete the EKS cluster if it was deployed using references from this Guidance, since removal of the scale testing framework does not automatically delete Cluster and its resources.

To stop or uninstall scale Inferencetest job(s), run the following command:

./test.sh stop

It should delete all scale test pods and jobs from the specified EKS K8s namespace.

To stop or uninstall Inference model services, run the following command:

./deploy.sh stop

It should delete all Model deployments, pods, and services from the specified EKS K8s namespace.

If you provisioned an EKS cluster when setting up your prerequisites for the project as described in the "Optional - Provision an EKS cluster with 3 node groups" above, you can clean up the cluster and all resources associated with it by running this script:

./remove.sh

It should delete EKS cluster compute node groups first, then IAM service account used in that cluster, then cluster itself and, finally, ManagementInstance EC2 instance via corresponding Cloud Formations. Sometimes you may need to run that command a few times as individual stack deletion commands may time out - that should not create any problem.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
0-provision		0-provision
1-build		1-build
2-trace		2-trace
3-pack		3-pack
4-deploy		4-deploy
5-test		5-test
6-remove		6-remove
.dockerignore		.dockerignore
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
aws-do-inference-video.png		aws-do-inference-video.png
aws-do-inference.png		aws-do-inference.png
build.sh		build.sh
config.properties		config.properties
config.properties_graviton_tests		config.properties_graviton_tests
config.properties_inferentia_tests		config.properties_inferentia_tests
config.sh		config.sh
deploy.sh		deploy.sh
login.sh		login.sh
low-latency-high-bandwidth-updated-architecture.jpg		low-latency-high-bandwidth-updated-architecture.jpg
low-latency-high-throughput-inference-on-amazon-eks.png		low-latency-high-throughput-inference-on-amazon-eks.png
pack.sh		pack.sh
provision.sh		provision.sh
remove.sh		remove.sh
test.sh		test.sh
trace.sh		trace.sh

License

aws-solutions-library-samples/guidance-for-machine-learning-inference-on-aws

Folders and files

Latest commit

History

Repository files navigation

Guidance for Low Latency, High Throughput Inference using Efficient Compute on Amazon EKS

Prerequisites

Operation

Optional - Provision an EKS cluster with 3 node groups

Configure

1. Build

2. Trace

Note

3. Pack

4. Deploy

5. Test

Clean up

Security

License

References

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages