24 May 10:55

Narsil

8f22cb9

v2.0.4 Latest

Latest

Main changes

AMD MI300 compatibility by @fxmarty in #1764
Many bugfixes.

What's Changed

OpenAI function calling compatible support by @phangiabao98 in #1888
Fixing types. by @Narsil in #1906
Types. by @Narsil in #1909
Fixing signals. by @Narsil in #1910
Removing some unused code. by @Narsil in #1915
MI300 compatibility by @fxmarty in #1764
Add TGI monitoring guide through Grafana and Prometheus by @fxmarty in #1908
Update grafana template by @fxmarty in #1918
Fix TunableOp bug by @fxmarty in #1920
Fix TGI issues with ROCm by @fxmarty in #1921
Fixing the download strategy for ibm-fms by @Narsil in #1917
ROCm: make CK FA2 default instead of Triton by @fxmarty in #1924
docs: Fix grafana dashboard url by @edwardzjl in #1925
feat: include token in client test like server tests by @drbh in #1932
Creating doc automatically for supported models. by @Narsil in #1929
fix: use path inside of speculator config by @drbh in #1935
feat: add train medusa head tutorial by @drbh in #1934
reenable xpu for tgi by @sywangyi in #1939
Fixing some legacy behavior (big swapout of serverless on legacy stuff). by @Narsil in #1937
Add completion route to client and add stop parameter where it's missing by @thomas-schillaci in #1869
Improving the logging system. by @Narsil in #1938
Fixing codellama loads by using purely AutoTokenizer. by @Narsil in #1947

New Contributors

@phangiabao98 made their first contribution in #1888
@edwardzjl made their first contribution in #1925
@thomas-schillaci made their first contribution in #1869

Full Changelog: v2.0.3...v2.0.4

Contributors

Narsil, thomas-schillaci, and 5 other contributors

Assets 2

16 May 05:05

Narsil

v2.0.3

40213c9

v2.0.3

Important changes

Add: Support for the Falcon2 by @Nilabhra in #1886
New speculation method MLPSpeculator. by @JRosenkranz in #1865
Pali gemma modeling by @drbh in #1895

What's Changed

Fix: "Fixing" double BOS for mistral too. by @Narsil in #1843
Adding scripts to prepare load data. by @Narsil in #1841
Remove misleading warning (not that important nowadays anyway). by @Narsil in #1848
feat: prefer huggingface_hub in docs and show image api by @drbh in #1844
Updating Phi3 (long context). by @Narsil in #1849
Add router name to /info endpoint by @Wauplin in #1854
Upgrading to rust 1.78. by @Narsil in #1851
update xpu docker image and use public ipex whel by @sywangyi in #1860
Refactor layers. by @Narsil in #1866
Granite support? by @Narsil in #1882
Add: Support for the Falcon2 11B architecture by @Nilabhra in #1886
MLPSpeculator. by @JRosenkranz in #1865
Fixing truncation. by @Narsil in #1890
Correct 'using guidance' link by @brandon-lockaby in #1892
Add GPT-2 with flash attention by @danieldk in #1889
Removing accepted ids in the regular info logs, downgrade to debug. by @Narsil in #1898
feat: add deprecation warning to clients by @drbh in #1855
[Bug Fix] Update torch import reference in bnb quantization by @DhruvSrikanth in #1902
Pali gemma modeling by @drbh in #1895

New Contributors

@Nilabhra made their first contribution in #1886
@brandon-lockaby made their first contribution in #1892
@danieldk made their first contribution in #1889
@DhruvSrikanth made their first contribution in #1902

Full Changelog: v2.0.2...v2.0.3

Contributors

danieldk, Narsil, and 7 other contributors

Assets 2

01 May 07:22

Narsil

v2.0.2

6073ece

v2.0.2

Tl;dr

New models (idefics2, phi3)
Cleaner VLM support in the openai layer
Upgraded to pytorch 2.3.0

What's Changed

Make --cuda-graphs 0 work as expected (bis) by @fxmarty in #1768
fix typos in docs and add small clarifications by @MoritzLaurer in #1790
Add attribute descriptions for GenerateParameters by @Wauplin in #1798
feat: allow null eos and bos tokens in config by @drbh in #1791
Phi3 support by @Narsil in #1797
Idefics2. by @Narsil in #1756
fix: avoid frequency and repetition penalty on padding tokens by @drbh in #1765
Adding support for HF_HUB_OFFLINE support in the router. by @Narsil in #1789
feat: improve temperature logic in chat by @drbh in #1749
Updating the benchmarks so everyone uses openai compat layer. by @Narsil in #1800
Update guidance docs to reflect grammar support in API by @dr3s in #1775
Use the generation config. by @Narsil in #1808
2nd round of benchmark modifications (tiny adjustements to avoid overloading the host). by @Narsil in #1816
Adding new env variables for TPU backends. by @Narsil in #1755
add intel xpu support for TGI by @sywangyi in #1475
Blunder by @Narsil in #1815
Fixing qwen2. by @Narsil in #1818
Dummy CI run. by @Narsil in #1817
Changing the waiting_served_ratio default (stack more aggressively by default). by @Narsil in #1820
Better graceful shutdown. by @Narsil in #1827
Add the missing tool_prompt parameter to Python client by @maziyarpanahi in #1825
Small CI cleanup. by @Narsil in #1801
Add reference to TPU support by @brandonroyal in #1760
fix: use get_speculate to the number of layers by @OlivierDehaene in #1737
feat: add how it works section by @drbh in #1773
Fixing frequency penalty by @martinigoyanes in #1811
feat: add vlm docs and simple examples by @drbh in #1812
Handle images in chat api by @drbh in #1828
chore: update torch by @OlivierDehaene in #1730
(chore): torch 2.3.0 by @Narsil in #1833

New Contributors

@MoritzLaurer made their first contribution in #1790
@dr3s made their first contribution in #1775
@maziyarpanahi made their first contribution in #1825
@brandonroyal made their first contribution in #1760
@martinigoyanes made their first contribution in #1811

Full Changelog: v2.0.1...v2.0.2

Contributors

dr3s, Narsil, and 9 other contributors

Assets 2

18 Apr 15:22

OlivierDehaene

v2.0.1

2d0a717

v2.0.1

What's Changed

feat: improve tools to include name and add tests by @drbh in #1693
Update response type for /v1/chat/completions and /v1/completions by @Wauplin in #1747
accept list as prompt for OpenAI API by @drbh in #1702
fix ROCm docker image

Full Changelog: v2.0.0...v2.0.1

Contributors

drbh and Wauplin

Assets 2

12 Apr 16:44

OlivierDehaene

v2.0.0

c38a7d7

v2.0.0

TGI is back to Apache 2.0!

Highlights

License was reverted to Apache 2.0
Cuda graphs are now used by default. They improve latency substancially on high end nodes.
Llava-next was added. It is the second multimodal model available on TGI after Idefics.
Cohere Command R+ support. TGI is the fastest open source backend for Command R+
FP8 support.
We now share the vocabulary for all medusa heads, greatly improving latency and memory use.

Try out Command R+ with Medusa heads on 4xA100s with:

model=text-generation-inference/commandrplus-medusa
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --speculate 3 --num-shard 4

What's Changed

Add cuda graphs sizes and make it default. by @Narsil in #1703
Pickle conversion now requires --trust-remote-code. by @Narsil in #1704
Push users to streaming in the readme. by @Narsil in #1698
Fixing cohere tokenizer. by @Narsil in #1697
Force weights_only (before fully breaking pickle files anyway). by @Narsil in #1710
Regenerate ld.so.cache by @oOraph in #1708
Revert license to Apache 2.0 by @OlivierDehaene in #1714
Automatic quantization config. by @Narsil in #1719
Adding Llava-Next (Llava 1.6) with full support. by @Narsil in #1709
fix: fix CohereForAI/c4ai-command-r-plus by @OlivierDehaene in #1707
Update libraries by @abhishekkrthakur in #1713
Dev/mask ldconfig output v2 by @oOraph in #1716
Fp8 Support by @Narsil in #1726
Upgrade EETQ (Fixes the cuda graphs). by @Narsil in #1729
fix(router): fix a possible deadlock in next_batch by @OlivierDehaene in #1731
chore(cargo-toml): apply lto fat and codegen-units of one by @somehowchris in #1651
Improve the defaults for the launcher by @Narsil in #1727
feat: medusa shared by @OlivierDehaene in #1734
Fix typo in guidance.md by @eltociear in #1735

New Contributors

@somehowchris made their first contribution in #1651

Full Changelog: v1.4.5...v2.0.0

Contributors

Narsil, abhishekkrthakur, and 4 other contributors

Assets 2

29 Mar 18:18

OlivierDehaene

v1.4.5

4ee0a0c

v.1.4.5

Highlights

DBRX support #1685. See #1679 on how to prompt the model.

What's Changed

fix: adjust logprob response logic by @drbh in #1682
fix: handle batches with and without grammars by @drbh in #1676
feat: Add dbrx support by @OlivierDehaene in #1685

Full Changelog: v1.4.4...v1.4.5

Contributors

drbh and OlivierDehaene

Assets 2

22 Mar 17:45

OlivierDehaene

v1.4.4

6c4496a

v.1.4.4

Highlights

CohereForAI/c4ai-command-r-v01 model support

What's Changed

Handle concurrent grammar requests by @drbh in #1610
Fix idefics default. by @Narsil in #1614
Fix async client timeout by @hugoabonizio in #1617
accept legacy request format and response by @drbh in #1527
add missing stop parameter for chat request by @drbh in #1619
correctly index into mask when applying grammar by @drbh in #1618
Use a better model for the quick tour by @lewtun in #1639
Upgrade nix version from 0.27.1 to 0.28.0 by @yuanwu2017 in #1638
Update peft + transformers + accelerate + bnb + safetensors by @abhishekkrthakur in #1646
Fix index in ChatCompletionChunk by @Wauplin in #1648
Fixing minor typo in documentation: supported hardware section by @SachinVarghese in #1632
bump minijina and add test for core templates by @drbh in #1626
support force downcast after FastRMSNorm multiply for Gemma by @drbh in #1658
prefer spaces url over temp url by @drbh in #1662
improve tool type, bump pydantic and outlines by @drbh in #1650
Remove unecessary cuda graph. by @Narsil in #1664
Repair idefics integration tests. by @Narsil in #1663
fix: LlamaTokenizerFast to AutoTokenizer at flash_mistral.py by @SeongBeomLEE in #1637
Inline images for multimodal models. by @Narsil in #1666

New Contributors

@hugoabonizio made their first contribution in #1617
@yuanwu2017 made their first contribution in #1638
@abhishekkrthakur made their first contribution in #1646
@Wauplin made their first contribution in #1648
@SachinVarghese made their first contribution in #1632
@SeongBeomLEE made their first contribution in #1637

Full Changelog: v1.4.3...v1.4.4

Contributors

Narsil, abhishekkrthakur, and 7 other contributors

Assets 2

28 Feb 15:14

OlivierDehaene

v1.4.3

e6bb3ff

v1.4.3

Highlights

Add support for Starcoder 2
Add support for Qwen2

What's Changed

fix openapi schema by @OlivierDehaene in #1586
avoid default message by @drbh in #1579
Revamp medusa implementation so that every model can benefit. by @Narsil in #1588
Support tools by @drbh in #1587
Fixing x-compute-time. by @Narsil in #1606
Fixing guidance docs. by @Narsil in #1607
starcoder2 by @OlivierDehaene in #1605
Qwen2 by @Jason-CKY in #1608

Full Changelog: v1.4.2...v1.4.3

Contributors

Narsil, drbh, and 2 other contributors

Assets 2

21 Feb 13:52

OlivierDehaene

v1.4.2

9c1cb81

v1.4.2

Highlights

Add support for Google Gemma models

What's Changed

Fix mistral with length > window_size for long prefills (rotary doesn't create long enough cos, sin). by @Narsil in #1571
improve endpoint support by @drbh in #1577
refactor syntax to correctly include structs by @drbh in #1580
fix openapi and add jsonschema validation by @OlivierDehaene in #1578
add support for Gemma by @OlivierDehaene in #1583

Full Changelog: v1.4.1...v1.4.2

Contributors

Narsil, drbh, and OlivierDehaene

Assets 2

16 Feb 16:53

OlivierDehaene

v1.4.1

4139054

v1.4.1

Highlights

Mamba support by @drbh in #1480 and by @Narsil in #1552
Experimental support for cuda graphs by @OlivierDehaene in #1428
Outlines guided generation by @drbh in #1539
Added name field to OpenAI compatible API Messages by @amihalik in #1563

What's Changed

Fixing top_n_tokens. by @Narsil in #1497
Sending compute type from the environment instead of hardcoded string by @Narsil in #1504
Create the compute type at launch time (if not provided in the env). by @Narsil in #1505
Modify default for max_new_tokens in python client by @freitng in #1336
feat: eetq gemv optimization when batch_size <= 4 by @dtlzhuangz in #1502
fix: improve messages api docs content and formatting by @drbh in #1506
GPTNeoX: Use static rotary embedding by @dwyatte in #1498
Hotfix the / health - route. by @Narsil in #1515
fix: tokenizer config should use local model path when possible by @drbh in #1518
Updating tokenizers. by @Narsil in #1517
[docs] Fix link to Install CLI by @pcuenca in #1526
feat: add ie update to message docs by @drbh in #1523
feat: use existing add_generation_prompt variable from config in temp… by @drbh in #1533
Update to peft 0.8.2 by @Stillerman in #1537
feat(server): add frequency penalty by @OlivierDehaene in #1541
chore: bump ci rust version by @drbh in #1543
ROCm AWQ support by @IlyasMoutawwakil in #1514
feat(router): add max_batch_size by @OlivierDehaene in #1542
feat: add deserialize_with that handles strings or objects with content by @drbh in #1550
Fixing glibc version in the runtime. by @Narsil in #1556
Upgrade intermediary layer for nvidia too. by @Narsil in #1557
Improving mamba runtime by using updates by @Narsil in #1552
Small cleanup. by @Narsil in #1560
Bugfix: eos and bos tokens positions are inconsistent by @amihalik in #1567
chore: add pre-commit by @OlivierDehaene in #1569
feat: add chat template struct to avoid tuple ordering errors by @OlivierDehaene in #1570
v1.4.1 by @OlivierDehaene in #1568

New Contributors

@freitng made their first contribution in #1336
@dtlzhuangz made their first contribution in #1502
@dwyatte made their first contribution in #1498
@pcuenca made their first contribution in #1526
@Stillerman made their first contribution in #1537
@IlyasMoutawwakil made their first contribution in #1514
@amihalik made their first contribution in #1563

Full Changelog: v1.4.0...v1.4.1

Contributors

Narsil, pcuenca, and 8 other contributors

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Main changes

What's Changed

New Contributors

Contributors

Important changes

What's Changed

New Contributors

Contributors

Tl;dr

What's Changed

New Contributors

Contributors

What's Changed

Contributors

TGI is back to Apache 2.0!

Highlights

What's Changed

New Contributors

Contributors

Highlights

What's Changed

Contributors

Highlights

What's Changed

New Contributors

Contributors

Highlights

What's Changed

Contributors

Highlights

What's Changed

Contributors

Highlights

What's Changed

New Contributors

Contributors

Releases: huggingface/text-generation-inference

v2.0.4

Main changes

What's Changed

New Contributors

Contributors

v2.0.3

Important changes

What's Changed

New Contributors

Contributors

v2.0.2

Tl;dr

What's Changed

New Contributors

Contributors

v2.0.1

What's Changed

Contributors

v2.0.0

TGI is back to Apache 2.0!

Highlights

What's Changed

New Contributors

Contributors

v.1.4.5

Highlights

What's Changed

Contributors

v.1.4.4

Highlights

What's Changed

New Contributors

Contributors

v1.4.3

Highlights

What's Changed

Contributors

v1.4.2

Highlights

What's Changed

Contributors

v1.4.1

Highlights

What's Changed

New Contributors

Contributors