Releases: vectorch-ai/ScaleLLM
Releases · vectorch-ai/ScaleLLM
v0.1.1
What's Changed
- [feat] added cuda 11.8 devel image to build cpp release image by @guocuimi in #194
- [fix] fix workflow format by @guocuimi in #195
- [CI] fix docker run options by @guocuimi in #196
- fix: make build pass with gcc-9 by @guocuimi in #197
- ci: bump version and build with new manylinux image (gcc-9) by @guocuimi in #198
- [python] added more examples and fix requirments version by @guocuimi in #199
- feat: moved scheduler wait logic from python into scheduler run_until_complete function by @guocuimi in #200
- feat: added multiple threads support for LLMHandler by @guocuimi in #201
- fix: use a proper epsilon to avoid division by zero error for rejection sampler by @guocuimi in #202
- feat: added batch support for llm handler by @guocuimi in #204
- ci: publish wheels to whl index repo by @guocuimi in #205
Full Changelog: v0.1.0...v0.1.1
v0.1.0
Major changes:
- Added python wrapper and published scalellm package to PyPI.
- Supported openai-compatible rest api server. 'python3 -m scalellm.serve.api_server'
- Install scalellm with pip: 'pip install scalellm'
- Added examples for offline inference and async stream.
What's Changed
- [fix] use the pybind11 from libtorch and fix model download issue. by @guocuimi in #167
- [misc] upgrade torch to 2.3 and use gcc-12 by @guocuimi in #168
- [feat] added python rest api server skeleton by @guocuimi in #169
- [refactor] combine sequence and request outputs by @guocuimi in #170
- [feat] added python LLMEngine skeleton by @guocuimi in #171
- [refactor] move proto definitions into proto namespace by @guocuimi in #173
- [feat] implement async llm engine for python wrapper by @guocuimi in #172
- [refactor] consolidate handlers to share llm_handler between python rest api server and grpc server by @guocuimi in #174
- [python] move request handling logic into seperate file from api server by @guocuimi in #175
- [python] added model check for rest api by @guocuimi in #176
- [feat] added status handling for grpc server by @guocuimi in #177
- [misc] some changes to cmake file by @guocuimi in #180
- [kernle] change head_dim list to reduce binary size by @guocuimi in #181
- [CI] added base docker image for python wheel build by @guocuimi in #182
- [ci] build python wheels by @guocuimi in #183
- [CI] fix docker image issues and build wheel for different python, pytorch versions by @guocuimi in #184
- [fix] added manylinux support by @guocuimi in #185
- [fix] added cuda 11.8 support for manylinux by @guocuimi in #186
- [feat] added version suffix to include cuda and torch version by @guocuimi in #187
- [CI] Upload wheels to release as asserts by @guocuimi in #188
- [fix] fix extension typo for wheel publish workflow by @guocuimi in #189
- [python] added LLM for offline inference and stream examples for chat and complete by @guocuimi in #190
- [python] added requirements into package by @guocuimi in #191
- [Release] prepare 0.1.0 release by @guocuimi in #192
- [Release] added workflow to publish whls to PyPI by @guocuimi in #193
Full Changelog: v0.0.9...v0.1.0
v0.0.9
Major Changes
- Enabled speculative decoding and updated README
What's Changed
- [refactor] add implicit conversion between slice and vector by @guocuimi in #134
- [refactor] change tokenizer special tokens from token to token + id. by @guocuimi in #135
- [feat] support tensor parallelism for MQA/GQA models when num_kv_heads < world_size by @guocuimi in #137
- [refactor] refactoring for sequence by @guocuimi in #140
- [unittest] added more unittests for speculative decoding by @guocuimi in #141
- [unittest] added more unittests for pos_embedding, sampler and rejection_sampler. by @guocuimi in #142
- [feat] added support for kv_cache with different strides. by @guocuimi in #143
- [feat] enable speculative decoding and update readme by @guocuimi in #145
Full Changelog: v0.0.8...v0.0.9
v0.0.8
Major changes
- Added Meta Llama3 and Google Gemma support
- Added cuda graph support for decoding
What's Changed
- [model] added support for google Gemma-2b model by @936187425 in #103
- [feat] added rms norm residual kernel by @guocuimi in #125
- [fix] fix data accuracy issue for gemma by @guocuimi in #126
- [refactor] added options for LLMEngine, SpeculativeEngine and Scheduler. by @guocuimi in #127
- [feat] enable cuda graph for decoding by @guocuimi in #129
- [bugfix] fix cuda graph capture issue for tensor parallelism by @guocuimi in #130
- [feat] optimize batch size for cuda graph by @guocuimi in #132
New Contributors
- @936187425 made their first contribution in #103
Full Changelog: v0.0.7...v0.0.8
v0.0.7
Major changes
- Dynamic prefix cache
- Dynamic split-fuse scheduler
- Speculative decoding
What's Changed
- [feat] add support for cudagraph and its unit test. by @liutongxuan in #79
- [feat] add block id lifecycle management for block sharing scenarios. by @guocuimi in #85
- [feat] added prefix cache to share kv cache across sequences. by @guocuimi in #86
- [feat] enable prefix cache in block manager by @guocuimi in #87
- [feat] added LRU policy into prefix cache. by @guocuimi in #89
- [refactor] move batch related logic into a class by @guocuimi in #90
- [fix] replace submodules git path with https path to avoid permission issue. by @guocuimi in #92
- [feat] add max tokens to process to support dynamic split-fuse by @guocuimi in #93
- [feat] return prompt string directly in echo mode to avoid decode cost and avoid showing appended prefix tokens. by @guocuimi in #94
- [fix] added small page size support for flash attention. by @guocuimi in #95
- [fix] adjust kv_cache_pos to give at least one token to generate logits by @guocuimi in #96
- added layernorm benchmark by @dongxianzhe in #97
- [feat] added dynamic split-fuse support in continuous scheduler by @guocuimi in #98
- [refactor] move model output process logic into batch by @guocuimi in #99
- [feat] added engine type to allow LLM and SSM share sequence. by @guocuimi in #100
- [feat] added speculative engine class without implementation. by @guocuimi in #101
- [refactor] moved top_k and top_p from sampler to logits process. by @guocuimi in #102
- [workflow] added clang-format workflow by @guocuimi in #105
- [fix] only run git-clang-format agains c/c++ files by @guocuimi in #106
- [feat] added prompt blocks sharing across n sequences by @guocuimi in #107
- [feat] Added selected tokens to return logits from model execution. by @guocuimi in #109
- [feat] added rejection sampler for speculative decoding. by @guocuimi in #112
- [feat] enable speculative decoding for simple server by @guocuimi in #113
- [feat] mask out rejected tokens with -1 in Rejection Sampler by @guocuimi in #114
- [feat] added sampling support for multiple query decoding by @guocuimi in #115
- [feat] added stream support for n > 1 scenarios by @guocuimi in #116
- [feat] enable speculative decoding for scalellm. by @guocuimi in #117
- [feat] cancel request if rpc is not ok by @guocuimi in #118
- [fix] put finish reason into a separate response by @guocuimi in #119
- [feat] added skip_special_tokens support for tokenizers by @guocuimi in #120
New Contributors
- @dongxianzhe made their first contribution in #97
Full Changelog: v0.0.6...v0.0.7
v0.0.6
Major changes:
- Introduced new kernels aimed at enhancing efficiency.
- Implemented an initial Python wrapper, simplifying integration and extending accessibility.
- Incorporated new models such as Baichuan2 and ChatGLM.
- Added support for Jinja chat templates, enhancing customization and user interaction.
- Added usage statistics into responses, ensuring compatibility with OpenAI APIs.
- Enabled ccache to accelerate build speed, facilitating quicker development cycles.
What's Changed
- add timestamp into ccache cache key by @guocuimi in #42
- use ${GITHUB_SHA} in cache key by @guocuimi in #43
- replace GITHUB_SHA with ${{ github.sha }} by @guocuimi in #44
- encapsulate class of time for performance tracking. by @liutongxuan in #46
- upgrade paged_atten kernel to v0.2.7 by @guocuimi in #47
- [feat] add speculative decoding. by @liutongxuan in #50
- added a new attention kernel for speculative decoding by @guocuimi in #52
- added support for small page size. by @guocuimi in #53
- enable flash decoding for both prefill and decode phase. by @guocuimi in #54
- enable split-k for flash decoding and fix bugs. by @guocuimi in #59
- [ut] add unit tests for speculative scheduler. by @liutongxuan in #57
- added a custom command to generate instantiation for flashinfer by @guocuimi in #61
- add custom command to generate instantiation for flash-attn by @guocuimi in #62
- added gpu memory profiling to decided kv cache size precisely. by @guocuimi in #63
- moved attention related files into attention subfolder by @guocuimi in #65
- add pybind11 to support python user interface. by @liutongxuan in #64
- added support to build python wrapper with installed pytorch ( pre-cxx11 abi) by @guocuimi in #66
- merge huggingface tokenizers and safetensors rust projects into one. by @guocuimi in #67
- more changes to support python wrapper by @guocuimi in #68
- [feat] added attention handler for different implementations by @guocuimi in #71
- [perf] enabled speed up for gpa and mqa decoding. by @guocuimi in #72
- [perf] use a seperate cuda stream for kv cache by @guocuimi in #73
- [models] added baichuan/baichuan2 model support. by @liutongxuan in #70
- [minor] cleanup redundant code for models. by @liutongxuan in #74
- [feat] moved rope logic into attention handler to support apply positional embeding on the fly by @guocuimi in #76
- [refactor] replace dtype and device with options since they are used together usually by @guocuimi in #77
- [refactor] move cutlass and flashinfer into third_party folder by @guocuimi in #78
- [refactor] split model forward function into two: 1> get hidden states 2> get logits from hidden states by @guocuimi in #80
- [models] support both baichuan and baichuan2 by @guocuimi in #81
- [models] fix chatglm model issue. by @guocuimi in #82
Full Changelog: v0.0.5...v0.0.6
v0.0.5
Major changes
- Added Qwen, ChatGLM and Phi2 support.
- Added tiktoken tokenizer support.
- Enabled more custom kernels for sampling.
What's Changed
- [docs] add speculative decoding design docs. by @liutongxuan in #33
- [docs] add devel image in CONTRIBUTING.md. by @liutongxuan in #35
- [refactor] rename Executor to ThreadPool. by @liutongxuan in #36
New Contributors
- @liutongxuan made their first contribution in #33
Full Changelog: v0.0.4...v0.0.5
v0.0.4
v0.0.3
- Added support for Yi Chat Model.
- Added args overrider support.
- Replaced libevhtp with boost asio for http server to fix epoll_wait not implemented error on old linux kernels.
Full Changelog: v0.0.2...v0.0.3-fix
v0.0.2
Major changes
- Added Yi series models support.
- Upgrade paged attention kernel to v2.
- Added chat templates for mistral, aquila and internlm.
What's Changed
- load dtype from config by @guocuimi in #14
- fixed top_k tensor type and added unittests. by @guocuimi in #15
Full Changelog: v0.0.1...v0.0.2