Triton Inference Server and NGINX+ Ingress Controller on Kubernetes

This repository provide a working example of how NGINX Plus Ingress Controller can provide secure external access -as well as load balancing- to a Kubernetes hosted NVIDIA Triton Inference Server cluster. The repository is based on forks from both the NVIDIA Triton Inference Server repo and NGINX Plus Ingress Controller. The repo includes a Helm chart along with instructions for installing a scalable NVIDIA Triton Inference Server and NGINX+ Ingress Controller in an on premises or cloud-based Kubernetes cluster.

This guide assumes you already have Helm installed (see Installing Helm for instructions). For more information on Helm and Helm charts, visit the Helm documentation. Please note the following requirements:

The Triton server requires access to a models repository via and external NFS server. If you already have an NFS server to host the model repository, you may use that with this Helm chart. If you do not, an NFS server (k8s manifest) is included which may be deployed and loaded with the included model repository.
To deploy Prometheus and Grafana to collect and display Triton metrics, your cluster must contain sufficient CPU resources to support these services.
Triton-server works with both CPUs and GPUs. To use GPUs for inferencing, your cluster must be configured to contain the desired number of GPU nodes, with support for the NVIDIA driver and CUDA version required by the version of the inference server you are using.
To enable autoscaling, your cluster's kube-apiserver must have the aggregation layer enabled. This will allow the horizontal pod autoscaler to read custom metrics from the prometheus adapter.

Want to get a feel for it before putting hands to keys? Here is a Deployment Walkthrough Video.

Deployment Instructions

First, clone this repository to a local machine.

git clone https://github.com/f5devcentral/triton-server-ngxin-plus-ingress.git
cd Triton-Server-NGINX-Plus-Ingress-Controller

Create a new NGINX private registry secret

You will need to use your NGINX Ingress Controller subscription JWT token to get the NGINX Plus Ingress Controller image. Create a secret that will be referenced by the NGINX Ingress Controller deployment allowing for automatic image access and pulling.

kubectl create secret docker-registry regcred --docker-server=private-registry.nginx.com --docker-username=<JWT Token> --docker-password=none [-n nginx-ingress]

Create a new TLS secret named tls-secret

kubectl create secret tls tls-secret --cert=<path/to/tls.cert> --key=<path/to/tls.key>

Model Repository

If you already have a model repository, you may use that with this Helm chart. If you do not have a model repository, you can make use of the local repo copy located in the at /model_repository to create an example model repository:

Triton Server needs a repository of models that it will make available for inferencing. For this example, we are using an existing NFS server and placing our model files there. Copy the local model_repository directory onto your NFS server. Then, add the url or IP address of your NFS server and the server path of your model repository to deploy/values.yaml.

If you do not have an NFS currently available, you can deploy a NFS server (k8s manifest) which may be deployed and loaded with the included model repository.

cd Triton-Server-NGINX-Plus-Ingress-Controller
kubectl apply -f nfs-server.yaml

Connect to the NFS server pod, clone the repo onto the container and move the model_repository directory.

kubectl exec <nfs-server POD name> --stdin --tty -- /bin/bash
yum install git wget -y
git clone https://github.com/f5devcentral/Triton-Server-NGINX-Plus-Ingress-Controller.git
mv /Triton-Server-NGINX-Plus-Ingress-Controller/model_repository /exports
exit

Deploy Prometheus and Grafana

The inference server metrics are collected by Prometheus and viewable through Grafana. The inference server Helm chart assumes that Prometheus and Grafana are available so this step must be followed even if you do not want to use Grafana.

Use the kube-prometheus-stack Helm chart to install these components. The serviceMonitorSelectorNilUsesHelmValues flag is needed so that Prometheus can find the inference server metrics in the example release deployed in a later section.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install example-metrics --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false prometheus-community/kube-prometheus-stack

Enable Autoscaling

To enable autoscaling, ensure that autoscaling tag in deploy/values.yamlis set to true. This will do two things:

Deploy a Horizontal Pod Autoscaler that will scale replicas of the triton-inference-server based on the information included in deploy/values.yaml.
Install the prometheus-adapter helm chart, allowing the Horizontal Pod Autoscaler to scale based on custom metrics from prometheus.

The included configuration will scale Triton pods based on the average queue time, as described in this blog post. To customize this, you may replace or add to the list of custom rules in deploy/values.yaml. If you change the custom metric, be sure to change the values in autoscaling.metrics.

If autoscaling is disabled, the number of Triton server pods is set to the minReplicas variable in deploy/values.yaml.

Updating the `values.yaml` file

Before deploying the Inference server and NGINX+ Ingress Controller update the deploy/values.yaml specifying your modelRepositoryServer IP and path (default is '/'), service FQDNs, and autoscaling preference, (see below).

Deploy Trition Inference Server & NGINX Plus

Deploy the inference server and NGINX Plus Ingress Controller using the default configuration with the following commands. Here, and in the following commands we use the name mytest for our chart. This name will be added to the beginning of all resources created during the helm installation. With the deploy/values.yaml file updated, you are ready to deploy the Helm Chart.

cd <directory containing Chart.yaml>/deploy
helm install mytest .

Use kubectl to see status and wait until the inference server pods are running.

$ kubectl get pods
NAME                                               READY   STATUS    RESTARTS   AGE
mytest-triton-inference-server-5f74b55885-n6lt7   1/1     Running   0          2m21s

Using Triton Inference Server

Now that the inference server is running you can send HTTP or GRPC requests to it to perform inferencing.

$ kubectl get svc
NAME                                     TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)                      AGE
kubernetes                               ClusterIP      10.0.0.1       <none>         443/TCP                      10d
mytest-nginx-ingress-controller          LoadBalancer   10.0.179.216   20.252.89.78   80:31336/TCP,443:31862/TCP   39m
mytest-triton-inference-server           ClusterIP      10.0.231.100   <none>         8000/TCP,8001/TCP,8002/TCP   39m
mytest-triton-inference-server-metrics   ClusterIP      10.0.21.98     <none>         8080/TCP                     39m
nfs-service                              ClusterIP      10.0.194.248   <none>         2049/TCP,20048/TCP,111/TCP   123m...

Enable port forwarding from the the Grafana service so you can access it from your local browser.

kubectl port-forward service/example-metrics-grafana 8088:80

Now you should be able to navigate in your browser to 127.0.0.1:8088 and see the Grafana login page. Use username=admin and password=prom-operator to log in.

An example Grafana dashboard is available -dashboard.json- in the repo. Use the import function in Grafana to import and view this dashboard, (see below).

Enable port forwarding from the /NGINX Ingress Controller pod to view service access metrics.

kubectl port-forward *<NGINX ingress controller pod name>* 8080:8080

The NGINX+ dashboard can be reached at 127.0.0.1/dashboard.html, (see below).

Run a couple sample queries

If the included sample models are loaded, you can test connectivity to the Triton Inference server(s) by running the included simple_http_infer_client.py python script. After running the script a few times, you can return to the NGINX+ and Grafana dashboards to monitor.

python3 simple_http_infer_client.py -u *<triton server http URL>> --ssl --insecure

Example: python3 simple_http_infer_client.py -u triton-http.f5demo.net --ssl --insecure

Cleanup

After you have finished using the inference server, you can use Helm and kubectl to delete the deployment.

helm list
NAME             NAMESPACE   REVISION    UPDATED                                 STATUS      CHART                           APP VERSION
example-metrics  default     1           2024-04-15 18:56:42.479571 -0700 PDT    deployed    kube-prometheus-stack-58.1.3    v0.73.
mytest           default     1           2024-04-15 19:01:31.772857 -0700 PDT    deployed    triton-inference-server-1.0.0   1.0        

helm uninstall example-metrics mytest
kubectl delete -f nfs-server.yaml
kubectl delete secret tls-secret regcred

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
deploy		deploy
images		images
model_repository		model_repository
.DS_Store		.DS_Store
README.md		README.md
nfs-server.yaml		nfs-server.yaml
simple_http_infer_client.py		simple_http_infer_client.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deploy

deploy

images

images

model_repository

model_repository

.DS_Store

.DS_Store

README.md

README.md

nfs-server.yaml

nfs-server.yaml

simple_http_infer_client.py

simple_http_infer_client.py

Repository files navigation

Triton Inference Server and NGINX+ Ingress Controller on Kubernetes

Deployment Instructions

Create a new NGINX private registry secret

Create a new TLS secret named tls-secret

Model Repository

Deploy Prometheus and Grafana

Enable Autoscaling

Updating the `values.yaml` file

Deploy Trition Inference Server & NGINX Plus

Using Triton Inference Server

Run a couple sample queries

Cleanup

About

Releases

Packages

Languages

f5devcentral/Triton-Server-NGINX-Plus-Ingress-Controller

Folders and files

Latest commit

History

Repository files navigation

Triton Inference Server and NGINX+ Ingress Controller on Kubernetes

Deployment Instructions

Create a new NGINX private registry secret

Create a new TLS secret named tls-secret

Model Repository

Deploy Prometheus and Grafana

Enable Autoscaling

Updating the values.yaml file

Deploy Trition Inference Server & NGINX Plus

Using Triton Inference Server

Run a couple sample queries

Cleanup

About

Resources

Stars

Watchers

Forks

Languages

Updating the `values.yaml` file