Documentation - API Reference - Changelog - Bug reports - Discord
β οΈ Nitro is currently in Development: Expect breaking changes and bugs!
Nitro TensorRT-LLM is an experimental implementation of Nitro that runs LLMs using Nvidia's TensorRT-LLM on Windows.
- Pure C++ inference server on top of TensorRT-LLM's C++ Runtime
- OpenAI-compatible API with
/chat/completion
andloadmodel
endpoints - Packageable as a single runnable package (e.g.
nitro.exe
) to run seamlessly on bare metal in Windows - Can be embedded in Windows Desktop apps
You can try this in Jan using the TensorRT-LLM Extension.
Read more about Nitro at https://nitro.jan.ai/
NOTE: Nvidia Driver >=535 and CUDA Toolkit >=12.2 are prerequisites, and are often pre-installed with Nvidia GPUs
- NVIDIA Driver for your specific GPU >=535
- CUDA toolkit >=12.2
# Verify Prerequisites
nvidia-smi # Nvidia Driver
nvcc --version # CUDA toolkit
We have compiled Nitro TensorRT-LLM into a single Windows package that can run seamlessly on bare metal, without needing manual installation of dependencies.
Package | Size | Download |
---|---|---|
Nitro TensorRT-LLM (zipped) | 336mb | Download |
Note: The Nitro TensorRT-LLM package is approximately ~730mb. This excludes the TensorRT-LLM Engine for the Model.
The Nitro TensorRT-LLM package contains nitro.exe
and dependent .dll
files.
Contents | Purpose | Size |
---|---|---|
nitro.exe | Nitro | Negligible |
tensorrt_llm.dll | TensorRT-LLM | ~450mb |
nvinfer.dll | TensorRT-LLM | ~200mb |
nvinfer_plugin_tensorrt_llm.dll | TensorRT-LLM | Negligible |
cudnn_ops_infer64_8.dll | cuDNN | ~80mb |
cudnn64_8.dll | cuDNN | Negligible |
msmpi.dll | Microsoft MPI | Negligible |
zlib.dll | Negligible | |
Total | ~730mb |
Models in TensorRT-LLM are compiled to TensorRT-LLM Engines for your GPU and Operating System.
Jan has fine-tuned LlamaCorn-1.1b, a small model that can be run even on laptop GPUs with <6 GB of VRAM.
- Based on TinyLlama-1.1b
- Finetuned to be usable for simple tasks and have acceptable conversational quality
Model | OS | Size | Architecture | GPU Supported | Download |
---|---|---|---|---|---|
Llamacorn 1.1b | Windows | Ampere | >3050 | Download | |
Llamacorn 1.1b | Windows | ~2.05gb | Ada | >4050 | Download |
OpenHermes 7b | Windows | Ampere | 3090 | Download | |
OpenHermes 7b | Windows | Ada | 4090 | Download |
You can also build the TensorRT Engine directly on your machine, using your preferred model.
- This process can take upwards of 1 hour.
- See Building a TensorRT-LLM Engine instructions below.
# Go to folder with `nitro.exe`
.\nitro.exe [thread_num] [host] [port] [uploads_folder_path]
.\nitro.exe 1 http://0.0.0.0 3928
# Powershell
Invoke-WebRequest -Uri "http://localhost:3928/inferences/tensorrtllm/loadmodel" `
-Method Post `
-ContentType "application/json" `
-Body "{ `
`"engine_path`": `"./openhermes-7b`", `
`"ctx_len`": 512, `
`"ngl`": 100 `
}"
# WSL
curl --location 'http://localhost:3928/inferences/tensorrtllm/loadmodel' \
--header 'Content-Type: application/json' \
--data '{
"engine_path": "./llamacorn-1.1b",
"ctx_len": 512,
"ngl": 100
}'
Parameter | Type | Description |
---|---|---|
engine_path |
String | The file path to the TensorRT-LLM engine. |
ctx_len |
Integer | The context length for engine operations. |
ngl |
Integer | The number of GPU layers to use. |
Nitro TensorRT-LLM offers a drop-in replacement for OpenAI's' /chat/completions
, including streaming responses.
Note:
model
field is a placeholder for OpenAI compatibility. It is not used as Nitro TensorRT-LLM currently only loads 1 model at a time
# Powershell
$url = "http://localhost:3928/v1/chat/completions"
$headers = @{
"Content-Type" = "application/json"
"Accept" = "text/event-stream"
"Access-Control-Allow-Origin" = "*"
}
$body = @{
"messages" = @(
@{
"content" = "Hello there π"
"role" = "assistant"
},
@{
"content" = "Write a long story about NVIDIA!!!!"
"role" = "user"
}
)
"stream" = $true
"model" = "operhermes-mistral"
"max_tokens" = 2048
} | ConvertTo-Json
Invoke-RestMethod -Uri $url -Method Post -Headers $headers -Body $body -UseBasicParsing -TimeoutSec 0
# WSL
curl --location 'http://0.0.0.0:3928/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Accept: text/event-stream' \
--header 'Access-Control-Allow-Origin: *' \
--data '{
"messages": [
{
"content": "Hello there π",
"role": "assistant"
},
{
"content": "Write a long story about NVIDIA!!!!",
"role": "user"
}
],
"stream": true,
"model": <NON-NULL STRING>,
"max_tokens": 2048
}'
NOTE: Jan will be releasing a TensorRT-LLM Extension that wraps Nitro TEnsorRT-LLM. These steps are only if you want to set it up manually.
-
Download Jan Windows
-
Navigate to the
~/jan/engines
folder and editopenai.json
. -
Modify
openai.json
to point to the URL of your Nitro TensorRT-LLM API endpoint.
# openai.json
{"full_url":"http://localhost:3928/v1/chat/completions","api_key":""}
-
In
~/jan/models
, duplicate thegpt-4
folder. Name the new folder:your-model-name-tensorrt-llm
-
In this folder, edit the
model.json
file.
id
matches theyour-model-name
.Name
is any vanity name you want call your TensorRT EngineFormat
is set toapi
.Engine
is set toopenai
{
"sources": [
{
"url": "http://localhost:3928/v1/chat/completions"
}
],
"id": "llamacorn-1.1b-tensorrt-llm",
"object": "model",
"name": "Llamacorn-1.1b (TensorRT-LLM)",
"version": "1.0",
"description": "TensorRT-LLM is extremely good",
"format": "api",
"settings": {},
"parameters": {},
"metadata": {
"author": "Nvidia",
"tags": ["General", "Big Context Length"]
},
"engine": "openai"
}
-
Restart the app
-
Create a new chat thread. Select
Remote
and your engineName
.
The actual Nitro code is in a subfolder, which is then used in the Build process. We have chosen to work off a fork of TensorRT-LLM, given the need to keep in sync with the fast pace of upstream developments.
+-- cpp
| +-- tensorrt_llm
| | +-- nitro
| | | +-- nitro_deps
| | | +-- main.cc
| | | +-- ...
| | +-- CMakeLists.txt
- TODO
- For support, please file a GitHub ticket.
- For questions, join our Discord here.
- For long-form inquiries, please email hello@jan.ai.