Releases: huggingface/text-generation-inference
Releases · huggingface/text-generation-inference
v2.0.4
Main changes
What's Changed
- OpenAI function calling compatible support by @phangiabao98 in #1888
- Fixing types. by @Narsil in #1906
- Types. by @Narsil in #1909
- Fixing signals. by @Narsil in #1910
- Removing some unused code. by @Narsil in #1915
- MI300 compatibility by @fxmarty in #1764
- Add TGI monitoring guide through Grafana and Prometheus by @fxmarty in #1908
- Update grafana template by @fxmarty in #1918
- Fix TunableOp bug by @fxmarty in #1920
- Fix TGI issues with ROCm by @fxmarty in #1921
- Fixing the download strategy for ibm-fms by @Narsil in #1917
- ROCm: make CK FA2 default instead of Triton by @fxmarty in #1924
- docs: Fix grafana dashboard url by @edwardzjl in #1925
- feat: include token in client test like server tests by @drbh in #1932
- Creating doc automatically for supported models. by @Narsil in #1929
- fix: use path inside of speculator config by @drbh in #1935
- feat: add train medusa head tutorial by @drbh in #1934
- reenable xpu for tgi by @sywangyi in #1939
- Fixing some legacy behavior (big swapout of serverless on legacy stuff). by @Narsil in #1937
- Add completion route to client and add stop parameter where it's missing by @thomas-schillaci in #1869
- Improving the logging system. by @Narsil in #1938
- Fixing codellama loads by using purely
AutoTokenizer
. by @Narsil in #1947
New Contributors
- @phangiabao98 made their first contribution in #1888
- @edwardzjl made their first contribution in #1925
- @thomas-schillaci made their first contribution in #1869
Full Changelog: v2.0.3...v2.0.4
v2.0.3
Important changes
- Add: Support for the Falcon2 by @Nilabhra in #1886
- New speculation method MLPSpeculator. by @JRosenkranz in #1865
- Pali gemma modeling by @drbh in #1895
What's Changed
- Fix: "Fixing" double BOS for mistral too. by @Narsil in #1843
- Adding scripts to prepare load data. by @Narsil in #1841
- Remove misleading warning (not that important nowadays anyway). by @Narsil in #1848
- feat: prefer huggingface_hub in docs and show image api by @drbh in #1844
- Updating Phi3 (long context). by @Narsil in #1849
- Add router name to /info endpoint by @Wauplin in #1854
- Upgrading to rust 1.78. by @Narsil in #1851
- update xpu docker image and use public ipex whel by @sywangyi in #1860
- Refactor layers. by @Narsil in #1866
- Granite support? by @Narsil in #1882
- Add: Support for the Falcon2 11B architecture by @Nilabhra in #1886
- MLPSpeculator. by @JRosenkranz in #1865
- Fixing truncation. by @Narsil in #1890
- Correct 'using guidance' link by @brandon-lockaby in #1892
- Add GPT-2 with flash attention by @danieldk in #1889
- Removing accepted ids in the regular info logs, downgrade to debug. by @Narsil in #1898
- feat: add deprecation warning to clients by @drbh in #1855
- [Bug Fix] Update torch import reference in bnb quantization by @DhruvSrikanth in #1902
- Pali gemma modeling by @drbh in #1895
New Contributors
- @Nilabhra made their first contribution in #1886
- @brandon-lockaby made their first contribution in #1892
- @danieldk made their first contribution in #1889
- @DhruvSrikanth made their first contribution in #1902
Full Changelog: v2.0.2...v2.0.3
v2.0.2
Tl;dr
- New models (idefics2, phi3)
- Cleaner VLM support in the openai layer
- Upgraded to pytorch 2.3.0
What's Changed
- Make
--cuda-graphs 0
work as expected (bis) by @fxmarty in #1768 - fix typos in docs and add small clarifications by @MoritzLaurer in #1790
- Add attribute descriptions for
GenerateParameters
by @Wauplin in #1798 - feat: allow null eos and bos tokens in config by @drbh in #1791
- Phi3 support by @Narsil in #1797
- Idefics2. by @Narsil in #1756
- fix: avoid frequency and repetition penalty on padding tokens by @drbh in #1765
- Adding support for
HF_HUB_OFFLINE
support in the router. by @Narsil in #1789 - feat: improve temperature logic in chat by @drbh in #1749
- Updating the benchmarks so everyone uses openai compat layer. by @Narsil in #1800
- Update guidance docs to reflect grammar support in API by @dr3s in #1775
- Use the generation config. by @Narsil in #1808
- 2nd round of benchmark modifications (tiny adjustements to avoid overloading the host). by @Narsil in #1816
- Adding new env variables for TPU backends. by @Narsil in #1755
- add intel xpu support for TGI by @sywangyi in #1475
- Blunder by @Narsil in #1815
- Fixing qwen2. by @Narsil in #1818
- Dummy CI run. by @Narsil in #1817
- Changing the waiting_served_ratio default (stack more aggressively by default). by @Narsil in #1820
- Better graceful shutdown. by @Narsil in #1827
- Add the missing
tool_prompt
parameter to Python client by @maziyarpanahi in #1825 - Small CI cleanup. by @Narsil in #1801
- Add reference to TPU support by @brandonroyal in #1760
- fix: use get_speculate to the number of layers by @OlivierDehaene in #1737
- feat: add how it works section by @drbh in #1773
- Fixing frequency penalty by @martinigoyanes in #1811
- feat: add vlm docs and simple examples by @drbh in #1812
- Handle images in chat api by @drbh in #1828
- chore: update torch by @OlivierDehaene in #1730
- (chore): torch 2.3.0 by @Narsil in #1833
New Contributors
- @MoritzLaurer made their first contribution in #1790
- @dr3s made their first contribution in #1775
- @maziyarpanahi made their first contribution in #1825
- @brandonroyal made their first contribution in #1760
- @martinigoyanes made their first contribution in #1811
Full Changelog: v2.0.1...v2.0.2
v2.0.1
v2.0.0
TGI is back to Apache 2.0!
Highlights
- License was reverted to Apache 2.0
- Cuda graphs are now used by default. They improve latency substancially on high end nodes.
- Llava-next was added. It is the second multimodal model available on TGI after Idefics.
- Cohere Command R+ support. TGI is the fastest open source backend for Command R+
- FP8 support.
- We now share the vocabulary for all medusa heads, greatly improving latency and memory use.
Try out Command R+ with Medusa heads on 4xA100s with:
model=text-generation-inference/commandrplus-medusa
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --speculate 3 --num-shard 4
What's Changed
- Add cuda graphs sizes and make it default. by @Narsil in #1703
- Pickle conversion now requires
--trust-remote-code
. by @Narsil in #1704 - Push users to streaming in the readme. by @Narsil in #1698
- Fixing cohere tokenizer. by @Narsil in #1697
- Force weights_only (before fully breaking pickle files anyway). by @Narsil in #1710
- Regenerate ld.so.cache by @oOraph in #1708
- Revert license to Apache 2.0 by @OlivierDehaene in #1714
- Automatic quantization config. by @Narsil in #1719
- Adding Llava-Next (Llava 1.6) with full support. by @Narsil in #1709
- fix: fix CohereForAI/c4ai-command-r-plus by @OlivierDehaene in #1707
- Update libraries by @abhishekkrthakur in #1713
- Dev/mask ldconfig output v2 by @oOraph in #1716
- Fp8 Support by @Narsil in #1726
- Upgrade EETQ (Fixes the cuda graphs). by @Narsil in #1729
- fix(router): fix a possible deadlock in next_batch by @OlivierDehaene in #1731
- chore(cargo-toml): apply lto fat and codegen-units of one by @somehowchris in #1651
- Improve the defaults for the launcher by @Narsil in #1727
- feat: medusa shared by @OlivierDehaene in #1734
- Fix typo in guidance.md by @eltociear in #1735
New Contributors
- @somehowchris made their first contribution in #1651
Full Changelog: v1.4.5...v2.0.0
v.1.4.5
Highlights
What's Changed
- fix: adjust logprob response logic by @drbh in #1682
- fix: handle batches with and without grammars by @drbh in #1676
- feat: Add dbrx support by @OlivierDehaene in #1685
Full Changelog: v1.4.4...v1.4.5
v.1.4.4
Highlights
- CohereForAI/c4ai-command-r-v01 model support
What's Changed
- Handle concurrent grammar requests by @drbh in #1610
- Fix idefics default. by @Narsil in #1614
- Fix async client timeout by @hugoabonizio in #1617
- accept legacy request format and response by @drbh in #1527
- add missing stop parameter for chat request by @drbh in #1619
- correctly index into mask when applying grammar by @drbh in #1618
- Use a better model for the quick tour by @lewtun in #1639
- Upgrade nix version from 0.27.1 to 0.28.0 by @yuanwu2017 in #1638
- Update peft + transformers + accelerate + bnb + safetensors by @abhishekkrthakur in #1646
- Fix index in ChatCompletionChunk by @Wauplin in #1648
- Fixing minor typo in documentation: supported hardware section by @SachinVarghese in #1632
- bump minijina and add test for core templates by @drbh in #1626
- support force downcast after FastRMSNorm multiply for Gemma by @drbh in #1658
- prefer spaces url over temp url by @drbh in #1662
- improve tool type, bump pydantic and outlines by @drbh in #1650
- Remove unecessary cuda graph. by @Narsil in #1664
- Repair idefics integration tests. by @Narsil in #1663
- fix: LlamaTokenizerFast to AutoTokenizer at flash_mistral.py by @SeongBeomLEE in #1637
- Inline images for multimodal models. by @Narsil in #1666
New Contributors
- @hugoabonizio made their first contribution in #1617
- @yuanwu2017 made their first contribution in #1638
- @abhishekkrthakur made their first contribution in #1646
- @Wauplin made their first contribution in #1648
- @SachinVarghese made their first contribution in #1632
- @SeongBeomLEE made their first contribution in #1637
Full Changelog: v1.4.3...v1.4.4
v1.4.3
Highlights
- Add support for Starcoder 2
- Add support for Qwen2
What's Changed
- fix openapi schema by @OlivierDehaene in #1586
- avoid default message by @drbh in #1579
- Revamp medusa implementation so that every model can benefit. by @Narsil in #1588
- Support tools by @drbh in #1587
- Fixing x-compute-time. by @Narsil in #1606
- Fixing guidance docs. by @Narsil in #1607
- starcoder2 by @OlivierDehaene in #1605
- Qwen2 by @Jason-CKY in #1608
Full Changelog: v1.4.2...v1.4.3
v1.4.2
Highlights
- Add support for Google Gemma models
What's Changed
- Fix mistral with length > window_size for long prefills (rotary doesn't create long enough cos, sin). by @Narsil in #1571
- improve endpoint support by @drbh in #1577
- refactor syntax to correctly include structs by @drbh in #1580
- fix openapi and add jsonschema validation by @OlivierDehaene in #1578
- add support for Gemma by @OlivierDehaene in #1583
Full Changelog: v1.4.1...v1.4.2
v1.4.1
Highlights
- Mamba support by @drbh in #1480 and by @Narsil in #1552
- Experimental support for cuda graphs by @OlivierDehaene in #1428
- Outlines guided generation by @drbh in #1539
- Added
name
field to OpenAI compatible API Messages by @amihalik in #1563
What's Changed
- Fixing top_n_tokens. by @Narsil in #1497
- Sending compute type from the environment instead of hardcoded string by @Narsil in #1504
- Create the compute type at launch time (if not provided in the env). by @Narsil in #1505
- Modify default for max_new_tokens in python client by @freitng in #1336
- feat: eetq gemv optimization when batch_size <= 4 by @dtlzhuangz in #1502
- fix: improve messages api docs content and formatting by @drbh in #1506
- GPTNeoX: Use static rotary embedding by @dwyatte in #1498
- Hotfix the / health - route. by @Narsil in #1515
- fix: tokenizer config should use local model path when possible by @drbh in #1518
- Updating tokenizers. by @Narsil in #1517
- [docs] Fix link to Install CLI by @pcuenca in #1526
- feat: add ie update to message docs by @drbh in #1523
- feat: use existing add_generation_prompt variable from config in temp… by @drbh in #1533
- Update to peft 0.8.2 by @Stillerman in #1537
- feat(server): add frequency penalty by @OlivierDehaene in #1541
- chore: bump ci rust version by @drbh in #1543
- ROCm AWQ support by @IlyasMoutawwakil in #1514
- feat(router): add max_batch_size by @OlivierDehaene in #1542
- feat: add deserialize_with that handles strings or objects with content by @drbh in #1550
- Fixing glibc version in the runtime. by @Narsil in #1556
- Upgrade intermediary layer for nvidia too. by @Narsil in #1557
- Improving mamba runtime by using updates by @Narsil in #1552
- Small cleanup. by @Narsil in #1560
- Bugfix: eos and bos tokens positions are inconsistent by @amihalik in #1567
- chore: add pre-commit by @OlivierDehaene in #1569
- feat: add chat template struct to avoid tuple ordering errors by @OlivierDehaene in #1570
- v1.4.1 by @OlivierDehaene in #1568
New Contributors
- @freitng made their first contribution in #1336
- @dtlzhuangz made their first contribution in #1502
- @dwyatte made their first contribution in #1498
- @pcuenca made their first contribution in #1526
- @Stillerman made their first contribution in #1537
- @IlyasMoutawwakil made their first contribution in #1514
- @amihalik made their first contribution in #1563
Full Changelog: v1.4.0...v1.4.1