Skip to content

Releases: triton-inference-server/server

Release 1.11.0 corresponding to NGC container 20.02

26 Feb 21:47
Compare
Choose a tag to compare

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.11.0

  • The TensorRT backend is improved to have significantly better performance. Improvements include reducing thread contention, using pinned memory for faster CPU<->GPU transfers, and increasing compute and memory copy overlap on GPUs.
  • Reduce memory usage of TensorRT models in many cases by sharing weights across multiple model instances.
  • Boolean data-type and shape tensors are now supported for TensorRT models.
  • A new model configuration option allows the dynamic batcher to create “ragged” batches for custom backend models. A ragged batch is a batch where one or more of the input/output tensors have different shapes in different batch entries.
  • Local S3 storage endpoints are now supported for model repositories. A local S3 endpoint is specified as 's3://host:port/path/to/repository'.
  • The Helm chart showing an example Kubernetes deployment is updated to include Prometheus and Grafana support so that inference server metrics can be collected and visualized.
  • The inference server container no longer sets LD_LIBRARY_PATH, instead the server uses RUNPATH to locate its shared libraries.
  • Python 2 is end-of-life so all support has been removed. Python 3 is still supported.
  • Ubuntu 18.04 with January 2020 updates

Known Issues

  • TensorRT reformat-free I/O is not supported.
  • Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.11.0_ubuntu1604.clients.tar.gz and v1.11.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.11.0_ubuntu1604.custombackend.tar.gz and v1.11.0_ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.

Release 1.10.0 corresponding to NGC container 20.01

28 Jan 17:56
Compare
Choose a tag to compare

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.10.0

  • Server status can be requested in JSON format using the HTTP/REST API. Use endpoint /api/status?format=json.
  • The dynamic batcher now has an option to preserve the ordering of batched requests when there are multiple model instances. See model_config.proto for more information.

Known Issues

  • TensorRT reformat-free I/O is not supported.
  • Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.10.0_ubuntu1604.clients.tar.gz and v1.10.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.10.0_ubuntu1604.custombackend.tar.gz and v1.10.0_ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.

Release 1.9.0, corresponding to NGC container 19.12

21 Dec 01:24
Compare
Choose a tag to compare

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.9.0

  • The model configuration now includes a model warmup option. This option provides the ability to tune and optimize the model before inference requests are received, avoiding initial inference delays. This option is especially useful for frameworks like TensorFlow that perform network optimization in response to the initial inference requests. Models can be warmed-up with one or more synthetic or realistic workloads before they become ready in the server.
  • An enhanced sequence batcher now has multiple scheduling strategies. A new Oldest strategy integrates with the dynamic batcher to enable improved inference performance for models that don’t require all inference requests in a sequence to be routed to the same batch slot.
  • The perf_client now has an option to generate requests using a realistic poisson distribution or a user provided distribution.
  • A new repository API (available in the shared library API, HTTP, and GRPC) returns an index of all models available in the model repositories) visible to the server. This index can be used to see what models are available for loading onto the server.
  • The server status returned by the server status API now includes the timestamp of the last inference request received for each model.
  • Inference server tracing capabilities are now documented in the Optimization section of the User Guide. Tracing support is enhanced to provide trace for ensembles and the contained models.
  • A community contributed Dockerfile is now available to build the TensorRT Inference Server clients on CentOS.

Known Issues

  • The beta of the custom backend API version 2 has non-backwards compatible changes to enable complete support for input and output tensors in both CPU and GPU memory:
    • The signature of the CustomGetNextInputV2Fn_t function adds the memory_type_id argument.
    • The signature of the CustomGetOutputV2Fn_t function adds the memory_type_id argument.
  • The beta of the inference server library API has non-backwards compatible changes to enable complete support for input and output tensors in both CPU and GPU memory:
    • The signature and operation of the TRTSERVER_ResponseAllocatorAllocFn_t function has changed. See src/core/trtserver.h for a description of the new behavior.
    • The signature of the TRTSERVER_InferenceRequestProviderSetInputData function adds the memory_type_id argument.
    • The signature of the TRTSERVER_InferenceResponseOutputData function add the memory_type_id argument.
  • TensorRT reformat-free I/O is not supported.
  • Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.9.0_ubuntu1604.clients.tar.gz and v1.9.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.9.0_ubuntu1604.custombackend.tar.gz and v1.9.0_ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.

Release 1.8.0, corresponding to NGC container 19.11

27 Nov 18:13
Compare
Choose a tag to compare

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.8.0

  • Shared-memory support is expanded to include CUDA shared memory.
  • Improve efficiency of pinned-memory used for ensemble models.
  • The perf_client application has been improved with easier-to-use
    command-line arguments (which maintaining compatibility with existing
    arguments).
  • Support for string tensors added to perf_client.
  • Documentation contains a new “Optimization” section discussing some common
    optimization strategies and how to use perf_client to explore these
    strategies.

Deprecated Features

  • The asynchronous inference API has been modified in the C++ and Python client libraries.
    • In the C++ library:
      • The non-callback version of the AsyncRun function was removed.
      • The GetReadyAsyncRequest function was removed.
      • The signature of the GetAsyncRunResults function was changed to remove the is_ready and wait arguments.
    • In the Python library:
      • The non-callback version of the async_run function was removed.
      • The get_ready_async_request function was removed.
      • The signature of the get_async_run_results function was changed to remove the wait argument.

Known Issues

  • The beta of the custom backend API version 2 has non-backwards compatible changes to enable complete support for input and output tensors in both CPU and GPU memory:
    • The signature of the CustomGetNextInputV2Fn_t function adds the memory_type_id argument.
    • The signature of the CustomGetOutputV2Fn_t function adds the memory_type_id argument.
  • The beta of the inference server library API has non-backwards compatible changes to enable complete support for input and output tensors in both CPU and GPU memory:
    • The signature and operation of the TRTSERVER_ResponseAllocatorAllocFn_t function has changed. See src/core/trtserver.h for a description of the new behavior.
    • The signature of the TRTSERVER_InferenceRequestProviderSetInputData function adds the memory_type_id argument.
    • The signature of the TRTSERVER_InferenceResponseOutputData function add the memory_type_id argument.
  • TensorRT reformat-free I/O is not supported.
  • Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.8.0_ubuntu1604.clients.tar.gz and v1.8.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.8.0_ubuntu1604.custombackend.tar.gz and v1.8.0_ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.

Release 1.7.0, corresponding to NGC container 19.10

30 Oct 00:03
Compare
Choose a tag to compare

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.7.0

  • A Client SDK container is now provided on NGC in addition to the inference server container. The client SDK container includes the client libraries and examples.

  • TensorRT optimization may now be enabled for any TensorFlow model by enabling the feature in the optimization section of the model configuration.

  • The ONNXRuntime backend now includes the TensorRT and Open Vino execution providers. These providers are enabled in the optimization section of the model configuration.

  • Automatic configuration generation (--strict-model-config=false) now works correctly for TensorRT models with variable-sized inputs and/or outputs.

  • Multiple model repositories may now be specified on the command line. Optional command-line options can be used to explicitly load specific models from each repository.

  • Ensemble models are now pruned dynamically so that only models needed to calculate the requested outputs are executed.

  • The example clients now include a simple Go example that uses the GRPC API.

Known Issues

  • In TensorRT 6.0.1, reformat-free I/O is not supported.

  • Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.6.0_ubuntu1604.clients.tar.gz and v1.6.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.6.0_ubuntu1604.custombackend.tar.gz and v1.6.0_ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.

Release 1.6.0, corresponding to NGC container 19.09

27 Sep 21:50
546b5cb
Compare
Choose a tag to compare

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.6.0

  • Added TensorRT 6 support, which includes support for TensorRT dynamic
    shapes.

  • Shared memory support is added as an alpha feature in this release. This
    support allows input and output tensors to be communicated via shared
    memory instead of over the network. Currently only system (CPU) shared
    memory is supported.

  • Amazon S3 is now supported as a remote file system for model repositories.
    Use the s3:// prefix on model repository paths to reference S3 locations.

  • The inference server library API is available as a beta in this release.
    The library API allows you to link against libtrtserver.so so that you can
    include all the inference server functionality directly in your application.

  • GRPC endpoint performance improvement. The inference server’s GRPC endpoint
    now uses significantly less memory while delivering higher performance.

  • The ensemble scheduler is now more flexible in allowing batching and
    non-batching models to be composed together in an ensemble.

  • The ensemble scheduler will now keep tensors in GPU memory between models
    when possible. Doing so significantly increases performance of some ensembles
    by avoiding copies to and from system memory.

  • The performance client, perf_client, now supports models with variable-sized
    input tensors.

Known Issues

  • The ONNX Runtime backend could not be updated to the 0.5.0 release due to multiple performance and correctness issues with that release.

  • In TensorRT 6:

    • Reformat-free I/O is not supported.
    • Only models that have a single optimization profile are currently supported.
  • Google Kubernetes Engine (GKE) version 1.14 contains a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version to avoid this issue.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.6.0_ubuntu1604.clients.tar.gz and v1.6.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files.

Custom Backend SDK

Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.6.0_ubuntu1604.custombackend.tar.gz and v1.6.0_ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.

Release 1.5.0, corresponding to NGC container 19.08

03 Sep 23:53
Compare
Choose a tag to compare

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.5.0

  • Added a new execution mode allows the inference server to start without
    loading any models from the model repository. Model loading and unloading
    is then controlled by a new GRPC/HTTP model control API.

  • Added a new instance-group mode allows TensorFlow models that explicitly
    distribute inferencing across multiple GPUs to run in that manner in the
    inference server.

  • Improved input/output tensor reshape to allow variable-sized dimensions in
    tensors being reshaped.

  • Added a C++ wrapper around the custom backend C API to simplify the creation
    of custom backends. This wrapper is included in the custom backend SDK.

  • Improved the accuracy of the compute statistic reported for inference
    requests. Previously the compute statistic included some additional time
    beyond the actual compute time.

  • The performance client, perf_client, now reports more information for ensemble
    models, including statistics for all contained models and the entire ensemble.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.5.0_ubuntu1604.clients.tar.gz and v1.5.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files.

Custom Backend SDK

Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.5.0_ubuntu1604.custombackend.tar.gz and v1.5.0_ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.

Release 1.4.0, corresponding to NGC container 19.07

30 Jul 23:06
Compare
Choose a tag to compare

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.4.0

  • Added libtorch as a new backend. PyTorch models manually decorated or automatically traced to produce TorchScript can now be run directly by the inference server.

  • Build system converted from bazel to CMake. The new CMake-based build system is more transparent, portable and modular.

  • To simplify the creation of custom backends, a Custom Backend SDK and improved documentation is now available.

  • Improved AsyncRun API in C++ and Python client libraries.

  • perf_client can now use user-supplied input data (previously perf_client could only use random or zero input data).

  • perf_client now reports latency at multiple confidence percentiles (p50, p90, p95, p99) as well as a user-supplied percentile that is also used to stabilize latency results.

  • Improvements to automatic model configuration creation (--strict-model-config=false).

  • C++ and Python client libraries now allow additional HTTP headers to be specified when using the HTTP protocol.

Known Issues

  • Google Cloud Storage (GCS) support has been restored in this release.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.4.0_ubuntu1604.clients.tar.gz and v1.4.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files.

Custom Backend SDK

Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.4.0_ubuntu1604.custombackend.tar.gz and v1.4.0_ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.

Release 1.3.0, corresponding to NGC container 19.06

28 Jun 16:36
Compare
Choose a tag to compare

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.3.0

  • The ONNX Runtime (github.com/Microsoft/onnxruntime) is now integrated into inference server. ONNX models can now be used directly in a model repository.

  • HTTP health port may be specified independently of inference and status HTTP port with --http-health-port flag.

  • Fixed bug in perf_client that caused high CPU usage that could lower the measured inference/sec in some cases.

Known Issues

  • Google Cloud Storage (GCS) support is not available in the 19.06 release. Support for GCS is available on the master branch and will be re-enabled in the 19.07 release.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.3.0_ubuntu1604.clients.tar.gz and v1.3.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files.

Release 1.2.0, corresponding to NGC container 19.05

24 May 16:20
Compare
Choose a tag to compare

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.2.0

  • Ensembling is now available. An ensemble represents a pipeline of one or more models and the connection of input and output tensors between those models. A single inference request to an ensemble will trigger the execution of the entire pipeline.

  • Added Helm chart that deploys a single TensorRT Inference Server into a Kubernetes cluster.

  • The client Makefile now supports building for both Ubuntu 16.04 and Ubuntu 18.04. The Python wheel produced from the build is now compatible with both Python2 and Python3.

  • The perf_client application now has a --percentile flag that can be used to report latencies instead of reporting average latency (which remains the default). For example, using --percentile=99 causes perf_client to report the 99th percentile latency.

  • The perf_client application now has a -z option to use zero-valued input tensors instead of random values.

  • Improved error reporting of incorrect input/output tensor names for TensorRT models.

  • Added --allow-gpu-metrics option to enable/disable reporting of GPU metrics.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.2.0_ubuntu1604.clients.tar.gz and v1.2.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files.