Skip to content

Releases: vectorch-ai/ScaleLLM

v0.1.1

23 May 22:06
57a5730
Compare
Choose a tag to compare

What's Changed

  • [feat] added cuda 11.8 devel image to build cpp release image by @guocuimi in #194
  • [fix] fix workflow format by @guocuimi in #195
  • [CI] fix docker run options by @guocuimi in #196
  • fix: make build pass with gcc-9 by @guocuimi in #197
  • ci: bump version and build with new manylinux image (gcc-9) by @guocuimi in #198
  • [python] added more examples and fix requirments version by @guocuimi in #199
  • feat: moved scheduler wait logic from python into scheduler run_until_complete function by @guocuimi in #200
  • feat: added multiple threads support for LLMHandler by @guocuimi in #201
  • fix: use a proper epsilon to avoid division by zero error for rejection sampler by @guocuimi in #202
  • feat: added batch support for llm handler by @guocuimi in #204
  • ci: publish wheels to whl index repo by @guocuimi in #205

Full Changelog: v0.1.0...v0.1.1

v0.1.0

17 May 16:26
Compare
Choose a tag to compare

Major changes:

  • Added python wrapper and published scalellm package to PyPI.
  • Supported openai-compatible rest api server. 'python3 -m scalellm.serve.api_server'
  • Install scalellm with pip: 'pip install scalellm'
  • Added examples for offline inference and async stream.

What's Changed

  • [fix] use the pybind11 from libtorch and fix model download issue. by @guocuimi in #167
  • [misc] upgrade torch to 2.3 and use gcc-12 by @guocuimi in #168
  • [feat] added python rest api server skeleton by @guocuimi in #169
  • [refactor] combine sequence and request outputs by @guocuimi in #170
  • [feat] added python LLMEngine skeleton by @guocuimi in #171
  • [refactor] move proto definitions into proto namespace by @guocuimi in #173
  • [feat] implement async llm engine for python wrapper by @guocuimi in #172
  • [refactor] consolidate handlers to share llm_handler between python rest api server and grpc server by @guocuimi in #174
  • [python] move request handling logic into seperate file from api server by @guocuimi in #175
  • [python] added model check for rest api by @guocuimi in #176
  • [feat] added status handling for grpc server by @guocuimi in #177
  • [misc] some changes to cmake file by @guocuimi in #180
  • [kernle] change head_dim list to reduce binary size by @guocuimi in #181
  • [CI] added base docker image for python wheel build by @guocuimi in #182
  • [ci] build python wheels by @guocuimi in #183
  • [CI] fix docker image issues and build wheel for different python, pytorch versions by @guocuimi in #184
  • [fix] added manylinux support by @guocuimi in #185
  • [fix] added cuda 11.8 support for manylinux by @guocuimi in #186
  • [feat] added version suffix to include cuda and torch version by @guocuimi in #187
  • [CI] Upload wheels to release as asserts by @guocuimi in #188
  • [fix] fix extension typo for wheel publish workflow by @guocuimi in #189
  • [python] added LLM for offline inference and stream examples for chat and complete by @guocuimi in #190
  • [python] added requirements into package by @guocuimi in #191
  • [Release] prepare 0.1.0 release by @guocuimi in #192
  • [Release] added workflow to publish whls to PyPI by @guocuimi in #193

Full Changelog: v0.0.9...v0.1.0

v0.0.9

28 Apr 03:50
638e616
Compare
Choose a tag to compare

Major Changes

  • Enabled speculative decoding and updated README

What's Changed

  • [refactor] add implicit conversion between slice and vector by @guocuimi in #134
  • [refactor] change tokenizer special tokens from token to token + id. by @guocuimi in #135
  • [feat] support tensor parallelism for MQA/GQA models when num_kv_heads < world_size by @guocuimi in #137
  • [refactor] refactoring for sequence by @guocuimi in #140
  • [unittest] added more unittests for speculative decoding by @guocuimi in #141
  • [unittest] added more unittests for pos_embedding, sampler and rejection_sampler. by @guocuimi in #142
  • [feat] added support for kv_cache with different strides. by @guocuimi in #143
  • [feat] enable speculative decoding and update readme by @guocuimi in #145

Full Changelog: v0.0.8...v0.0.9

v0.0.8

19 Apr 05:37
Compare
Choose a tag to compare

Major changes

  • Added Meta Llama3 and Google Gemma support
  • Added cuda graph support for decoding

What's Changed

  • [model] added support for google Gemma-2b model by @936187425 in #103
  • [feat] added rms norm residual kernel by @guocuimi in #125
  • [fix] fix data accuracy issue for gemma by @guocuimi in #126
  • [refactor] added options for LLMEngine, SpeculativeEngine and Scheduler. by @guocuimi in #127
  • [feat] enable cuda graph for decoding by @guocuimi in #129
  • [bugfix] fix cuda graph capture issue for tensor parallelism by @guocuimi in #130
  • [feat] optimize batch size for cuda graph by @guocuimi in #132

New Contributors

Full Changelog: v0.0.7...v0.0.8

v0.0.7

06 Apr 04:10
e550882
Compare
Choose a tag to compare

Major changes

  • Dynamic prefix cache
  • Dynamic split-fuse scheduler
  • Speculative decoding

What's Changed

  • [feat] add support for cudagraph and its unit test. by @liutongxuan in #79
  • [feat] add block id lifecycle management for block sharing scenarios. by @guocuimi in #85
  • [feat] added prefix cache to share kv cache across sequences. by @guocuimi in #86
  • [feat] enable prefix cache in block manager by @guocuimi in #87
  • [feat] added LRU policy into prefix cache. by @guocuimi in #89
  • [refactor] move batch related logic into a class by @guocuimi in #90
  • [fix] replace submodules git path with https path to avoid permission issue. by @guocuimi in #92
  • [feat] add max tokens to process to support dynamic split-fuse by @guocuimi in #93
  • [feat] return prompt string directly in echo mode to avoid decode cost and avoid showing appended prefix tokens. by @guocuimi in #94
  • [fix] added small page size support for flash attention. by @guocuimi in #95
  • [fix] adjust kv_cache_pos to give at least one token to generate logits by @guocuimi in #96
  • added layernorm benchmark by @dongxianzhe in #97
  • [feat] added dynamic split-fuse support in continuous scheduler by @guocuimi in #98
  • [refactor] move model output process logic into batch by @guocuimi in #99
  • [feat] added engine type to allow LLM and SSM share sequence. by @guocuimi in #100
  • [feat] added speculative engine class without implementation. by @guocuimi in #101
  • [refactor] moved top_k and top_p from sampler to logits process. by @guocuimi in #102
  • [workflow] added clang-format workflow by @guocuimi in #105
  • [fix] only run git-clang-format agains c/c++ files by @guocuimi in #106
  • [feat] added prompt blocks sharing across n sequences by @guocuimi in #107
  • [feat] Added selected tokens to return logits from model execution. by @guocuimi in #109
  • [feat] added rejection sampler for speculative decoding. by @guocuimi in #112
  • [feat] enable speculative decoding for simple server by @guocuimi in #113
  • [feat] mask out rejected tokens with -1 in Rejection Sampler by @guocuimi in #114
  • [feat] added sampling support for multiple query decoding by @guocuimi in #115
  • [feat] added stream support for n > 1 scenarios by @guocuimi in #116
  • [feat] enable speculative decoding for scalellm. by @guocuimi in #117
  • [feat] cancel request if rpc is not ok by @guocuimi in #118
  • [fix] put finish reason into a separate response by @guocuimi in #119
  • [feat] added skip_special_tokens support for tokenizers by @guocuimi in #120

New Contributors

Full Changelog: v0.0.6...v0.0.7

v0.0.6

13 Mar 18:16
Compare
Choose a tag to compare

Major changes:

  • Introduced new kernels aimed at enhancing efficiency.
  • Implemented an initial Python wrapper, simplifying integration and extending accessibility.
  • Incorporated new models such as Baichuan2 and ChatGLM.
  • Added support for Jinja chat templates, enhancing customization and user interaction.
  • Added usage statistics into responses, ensuring compatibility with OpenAI APIs.
  • Enabled ccache to accelerate build speed, facilitating quicker development cycles.

What's Changed

  • add timestamp into ccache cache key by @guocuimi in #42
  • use ${GITHUB_SHA} in cache key by @guocuimi in #43
  • replace GITHUB_SHA with ${{ github.sha }} by @guocuimi in #44
  • encapsulate class of time for performance tracking. by @liutongxuan in #46
  • upgrade paged_atten kernel to v0.2.7 by @guocuimi in #47
  • [feat] add speculative decoding. by @liutongxuan in #50
  • added a new attention kernel for speculative decoding by @guocuimi in #52
  • added support for small page size. by @guocuimi in #53
  • enable flash decoding for both prefill and decode phase. by @guocuimi in #54
  • enable split-k for flash decoding and fix bugs. by @guocuimi in #59
  • [ut] add unit tests for speculative scheduler. by @liutongxuan in #57
  • added a custom command to generate instantiation for flashinfer by @guocuimi in #61
  • add custom command to generate instantiation for flash-attn by @guocuimi in #62
  • added gpu memory profiling to decided kv cache size precisely. by @guocuimi in #63
  • moved attention related files into attention subfolder by @guocuimi in #65
  • add pybind11 to support python user interface. by @liutongxuan in #64
  • added support to build python wrapper with installed pytorch ( pre-cxx11 abi) by @guocuimi in #66
  • merge huggingface tokenizers and safetensors rust projects into one. by @guocuimi in #67
  • more changes to support python wrapper by @guocuimi in #68
  • [feat] added attention handler for different implementations by @guocuimi in #71
  • [perf] enabled speed up for gpa and mqa decoding. by @guocuimi in #72
  • [perf] use a seperate cuda stream for kv cache by @guocuimi in #73
  • [models] added baichuan/baichuan2 model support. by @liutongxuan in #70
  • [minor] cleanup redundant code for models. by @liutongxuan in #74
  • [feat] moved rope logic into attention handler to support apply positional embeding on the fly by @guocuimi in #76
  • [refactor] replace dtype and device with options since they are used together usually by @guocuimi in #77
  • [refactor] move cutlass and flashinfer into third_party folder by @guocuimi in #78
  • [refactor] split model forward function into two: 1> get hidden states 2> get logits from hidden states by @guocuimi in #80
  • [models] support both baichuan and baichuan2 by @guocuimi in #81
  • [models] fix chatglm model issue. by @guocuimi in #82

Full Changelog: v0.0.5...v0.0.6

v0.0.5

03 Jan 12:21
Compare
Choose a tag to compare

Major changes

  • Added Qwen, ChatGLM and Phi2 support.
  • Added tiktoken tokenizer support.
  • Enabled more custom kernels for sampling.

What's Changed

New Contributors

Full Changelog: v0.0.4...v0.0.5

v0.0.4

03 Dec 00:25
Compare
Choose a tag to compare

Major change:

  • Added docker image build for cuda 11.8.
  • Added exception handling logic in http server.

Full Changelog: v0.0.3-fix...v0.0.4

v0.0.3

23 Nov 11:51
Compare
Choose a tag to compare
  • Added support for Yi Chat Model.
  • Added args overrider support.
  • Replaced libevhtp with boost asio for http server to fix epoll_wait not implemented error on old linux kernels.

Full Changelog: v0.0.2...v0.0.3-fix

v0.0.2

15 Nov 07:38
Compare
Choose a tag to compare

Major changes

  • Added Yi series models support.
  • Upgrade paged attention kernel to v2.
  • Added chat templates for mistral, aquila and internlm.

What's Changed

Full Changelog: v0.0.1...v0.0.2