TensorRT-LLM Requests #632

ncomly-nvidia · 2023-12-11T19:35:22Z

Hi all, this issue will track the feature requests you've made to TensorRT-LLM & provide a place to see what TRT-LLM is currently working on.

Last update: Jan 14th, 2024
🚀 = in development

Models

Features & Optimizations

Context Chunking - [Feature request] Dynamic splitfuse from Deepspeed (2x throughput) #317
Speculative Decoding - Feature: Speculative sampling / Assisted Generation #169, Smaller available space for paged KV cache compared with vLLM #224, Falcon-40b build causing memory leaks and failure #226
implementation done - documentation in progress

KV Cache

Reuse KV Cache - [Feature reuqest] support interactive-generation #292, Add automatic reuse of common key value cache blocks between requests #620
Attention Sinks (StreamingLLM, H2O) - Attention sink #104

Quantization

StarCoder INT8 SQ - Feature request: Support SmoothQuant variant of StarCoder #324
Qwen INT4 - [Feature request] AutoAWQ support #345
INT8 Weight only - Support weight only quantization from bfloat16 to int8? #110

Sampling

🚀 support frequnecy_penalty - Support for frequency_penalty #275
Logit Manipulators - Add Transformers logits manipulators #241
Combine repetition & presence penalties - Support for combining repetition_penalty, presence_penalty #274

Workflow

Front-ends

OpenAI compatible API - Provide an interface similar to OpenAI API #334
Flag for end-of-stream - Flag indicate end of stream #240
Load from Buffer - GptManager add support for loading from buffer #144
Paged KV Cache Utilization Metric - How to know the utility of paged kv cache ? #512
Log Probabilities - Return log probabilities for tokens #238
Return only new tokens - How to get the newly generated tokens only? #227

Integrations

🚀 LlamaIndex
🚀 LangChain
Mojo - Question about a Mojo Integration #556

Usage / Installation

pip install - waiting for pre-built wheel package #790,

Platform Support

Jetson - Nvidia Jetson device Support #62, How can I running successful on jetson orin NX? #488, TensorRT installation in TRT-LLM #619
V100, T4 MHA - Why FMHA is not supported in V100 and T4 #320

The text was updated successfully, but these errors were encountered:

teis-e · 2024-04-04T18:37:58Z

Please add CohereAI!!

CohereForAI/c4ai-command-r-plus

EwoutH · 2024-04-22T09:03:02Z

Llama 3 would be great (both 8B and 70B): #1470

Maybe quantized to 8 or even 4 bit.

StephennFernandes · 2024-04-22T22:02:06Z

currently llama 3 throws a bunch of errors converting to TensorRT LLM

any ideal about the support for llama 3

EwoutH · 2024-04-23T15:06:56Z

Phi-3-mini should be amazing! Such a small 3.8B model could run quantized on many GPUs, with as little as 4GB VRAM.

Paper: https://arxiv.org/abs/2404.14219
Model weights: https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3

oscarbg · 2024-05-04T14:51:58Z

+1 for Phi-3

user-0a · 2024-05-18T05:22:42Z

+1 for Command R Plus!

CohereForAI/c4ai-command-r-plus

ncomly-nvidia added the good first issue Good for newcomers label Dec 18, 2023

ncomly-nvidia pinned this issue Dec 18, 2023

symphonylyh mentioned this issue Dec 18, 2023

Is it even possible to have multiple input layers #641

Closed

erenup mentioned this issue Dec 29, 2023

Add Roberta and few new tests for Bert #778

Closed

tp-nan mentioned this issue Mar 13, 2024

[Feature Request] More realistic benchmark and throughput optimization #1292

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorRT-LLM Requests #632

TensorRT-LLM Requests #632

ncomly-nvidia commented Dec 11, 2023 •

edited

teis-e commented Apr 4, 2024

EwoutH commented Apr 22, 2024

StephennFernandes commented Apr 22, 2024

EwoutH commented Apr 23, 2024

oscarbg commented May 4, 2024

user-0a commented May 18, 2024

TensorRT-LLM Requests #632

TensorRT-LLM Requests #632

Comments

ncomly-nvidia commented Dec 11, 2023 • edited

Models

Decoder Only

Encoder / Encoder-Decoder

Multi-Modal

Other

Features & Optimizations

KV Cache

Quantization

Sampling

Workflow

Front-ends

Integrations

Usage / Installation

Platform Support

teis-e commented Apr 4, 2024

EwoutH commented Apr 22, 2024

StephennFernandes commented Apr 22, 2024

EwoutH commented Apr 23, 2024

oscarbg commented May 4, 2024

user-0a commented May 18, 2024

ncomly-nvidia commented Dec 11, 2023 •

edited