Releases · vectorch-ai/ScaleLLM

23 May 22:06

github-actions

v0.1.1

57a5730

v0.1.1 Latest

Latest

What's Changed

[feat] added cuda 11.8 devel image to build cpp release image by @guocuimi in #194
[fix] fix workflow format by @guocuimi in #195
[CI] fix docker run options by @guocuimi in #196
fix: make build pass with gcc-9 by @guocuimi in #197
ci: bump version and build with new manylinux image (gcc-9) by @guocuimi in #198
[python] added more examples and fix requirments version by @guocuimi in #199
feat: moved scheduler wait logic from python into scheduler run_until_complete function by @guocuimi in #200
feat: added multiple threads support for LLMHandler by @guocuimi in #201
fix: use a proper epsilon to avoid division by zero error for rejection sampler by @guocuimi in #202
feat: added batch support for llm handler by @guocuimi in #204
ci: publish wheels to whl index repo by @guocuimi in #205

Full Changelog: v0.1.0...v0.1.1

Contributors

guocuimi

Assets 26

scalellm-0.1.1+cu118torch2.1-cp310-cp310-linux_x86_64.whl

74.4 MB 2024-05-23T21:55:03Z
scalellm-0.1.1+cu118torch2.1-cp311-cp311-linux_x86_64.whl

74.4 MB 2024-05-23T21:55:03Z
scalellm-0.1.1+cu118torch2.1-cp38-cp38-linux_x86_64.whl

74.4 MB 2024-05-23T21:55:02Z
scalellm-0.1.1+cu118torch2.1-cp39-cp39-linux_x86_64.whl

74.4 MB 2024-05-23T21:55:03Z
scalellm-0.1.1+cu118torch2.2-cp310-cp310-linux_x86_64.whl

74.4 MB 2024-05-23T21:55:03Z
scalellm-0.1.1+cu118torch2.2-cp311-cp311-linux_x86_64.whl

74.4 MB 2024-05-23T21:55:03Z
scalellm-0.1.1+cu118torch2.2-cp38-cp38-linux_x86_64.whl

74.4 MB 2024-05-23T21:55:04Z
scalellm-0.1.1+cu118torch2.2-cp39-cp39-linux_x86_64.whl

74.4 MB 2024-05-23T21:55:03Z
scalellm-0.1.1+cu118torch2.3-cp310-cp310-linux_x86_64.whl

74.4 MB 2024-05-23T21:55:04Z
scalellm-0.1.1+cu118torch2.3-cp311-cp311-linux_x86_64.whl

74.4 MB 2024-05-23T21:55:04Z
Source code (zip)

2024-05-23T19:18:34Z
Source code (tar.gz)

2024-05-23T19:18:34Z

17 May 16:26

github-actions

v0.1.0

38298bc

v0.1.0

Major changes:

Added python wrapper and published scalellm package to PyPI.
Supported openai-compatible rest api server. 'python3 -m scalellm.serve.api_server'
Install scalellm with pip: 'pip install scalellm'
Added examples for offline inference and async stream.

What's Changed

[fix] use the pybind11 from libtorch and fix model download issue. by @guocuimi in #167
[misc] upgrade torch to 2.3 and use gcc-12 by @guocuimi in #168
[feat] added python rest api server skeleton by @guocuimi in #169
[refactor] combine sequence and request outputs by @guocuimi in #170
[feat] added python LLMEngine skeleton by @guocuimi in #171
[refactor] move proto definitions into proto namespace by @guocuimi in #173
[feat] implement async llm engine for python wrapper by @guocuimi in #172
[refactor] consolidate handlers to share llm_handler between python rest api server and grpc server by @guocuimi in #174
[python] move request handling logic into seperate file from api server by @guocuimi in #175
[python] added model check for rest api by @guocuimi in #176
[feat] added status handling for grpc server by @guocuimi in #177
[misc] some changes to cmake file by @guocuimi in #180
[kernle] change head_dim list to reduce binary size by @guocuimi in #181
[CI] added base docker image for python wheel build by @guocuimi in #182
[ci] build python wheels by @guocuimi in #183
[CI] fix docker image issues and build wheel for different python, pytorch versions by @guocuimi in #184
[fix] added manylinux support by @guocuimi in #185
[fix] added cuda 11.8 support for manylinux by @guocuimi in #186
[feat] added version suffix to include cuda and torch version by @guocuimi in #187
[CI] Upload wheels to release as asserts by @guocuimi in #188
[fix] fix extension typo for wheel publish workflow by @guocuimi in #189
[python] added LLM for offline inference and stream examples for chat and complete by @guocuimi in #190
[python] added requirements into package by @guocuimi in #191
[Release] prepare 0.1.0 release by @guocuimi in #192
[Release] added workflow to publish whls to PyPI by @guocuimi in #193

Full Changelog: v0.0.9...v0.1.0

Contributors

guocuimi

Assets 20

28 Apr 03:50

guocuimi

v0.0.9

638e616

v0.0.9

Major Changes

Enabled speculative decoding and updated README

What's Changed

[refactor] add implicit conversion between slice and vector by @guocuimi in #134
[refactor] change tokenizer special tokens from token to token + id. by @guocuimi in #135
[feat] support tensor parallelism for MQA/GQA models when num_kv_heads < world_size by @guocuimi in #137
[refactor] refactoring for sequence by @guocuimi in #140
[unittest] added more unittests for speculative decoding by @guocuimi in #141
[unittest] added more unittests for pos_embedding, sampler and rejection_sampler. by @guocuimi in #142
[feat] added support for kv_cache with different strides. by @guocuimi in #143
[feat] enable speculative decoding and update readme by @guocuimi in #145

Full Changelog: v0.0.8...v0.0.9

Contributors

guocuimi

Assets 2

19 Apr 05:37

guocuimi

v0.0.8

450e1c1

v0.0.8

Major changes

Added Meta Llama3 and Google Gemma support
Added cuda graph support for decoding

What's Changed

[model] added support for google Gemma-2b model by @936187425 in #103
[feat] added rms norm residual kernel by @guocuimi in #125
[fix] fix data accuracy issue for gemma by @guocuimi in #126
[refactor] added options for LLMEngine, SpeculativeEngine and Scheduler. by @guocuimi in #127
[feat] enable cuda graph for decoding by @guocuimi in #129
[bugfix] fix cuda graph capture issue for tensor parallelism by @guocuimi in #130
[feat] optimize batch size for cuda graph by @guocuimi in #132

New Contributors

@936187425 made their first contribution in #103

Full Changelog: v0.0.7...v0.0.8

Contributors

guocuimi and 936187425

Assets 2

06 Apr 04:10

guocuimi

v0.0.7

e550882

v0.0.7

Major changes

Dynamic prefix cache
Dynamic split-fuse scheduler
Speculative decoding

What's Changed

[feat] add support for cudagraph and its unit test. by @liutongxuan in #79
[feat] add block id lifecycle management for block sharing scenarios. by @guocuimi in #85
[feat] added prefix cache to share kv cache across sequences. by @guocuimi in #86
[feat] enable prefix cache in block manager by @guocuimi in #87
[feat] added LRU policy into prefix cache. by @guocuimi in #89
[refactor] move batch related logic into a class by @guocuimi in #90
[fix] replace submodules git path with https path to avoid permission issue. by @guocuimi in #92
[feat] add max tokens to process to support dynamic split-fuse by @guocuimi in #93
[feat] return prompt string directly in echo mode to avoid decode cost and avoid showing appended prefix tokens. by @guocuimi in #94
[fix] added small page size support for flash attention. by @guocuimi in #95
[fix] adjust kv_cache_pos to give at least one token to generate logits by @guocuimi in #96
added layernorm benchmark by @dongxianzhe in #97
[feat] added dynamic split-fuse support in continuous scheduler by @guocuimi in #98
[refactor] move model output process logic into batch by @guocuimi in #99
[feat] added engine type to allow LLM and SSM share sequence. by @guocuimi in #100
[feat] added speculative engine class without implementation. by @guocuimi in #101
[refactor] moved top_k and top_p from sampler to logits process. by @guocuimi in #102
[workflow] added clang-format workflow by @guocuimi in #105
[fix] only run git-clang-format agains c/c++ files by @guocuimi in #106
[feat] added prompt blocks sharing across n sequences by @guocuimi in #107
[feat] Added selected tokens to return logits from model execution. by @guocuimi in #109
[feat] added rejection sampler for speculative decoding. by @guocuimi in #112
[feat] enable speculative decoding for simple server by @guocuimi in #113
[feat] mask out rejected tokens with -1 in Rejection Sampler by @guocuimi in #114
[feat] added sampling support for multiple query decoding by @guocuimi in #115
[feat] added stream support for n > 1 scenarios by @guocuimi in #116
[feat] enable speculative decoding for scalellm. by @guocuimi in #117
[feat] cancel request if rpc is not ok by @guocuimi in #118
[fix] put finish reason into a separate response by @guocuimi in #119
[feat] added skip_special_tokens support for tokenizers by @guocuimi in #120

New Contributors

@dongxianzhe made their first contribution in #97

Full Changelog: v0.0.6...v0.0.7

Contributors

liutongxuan, guocuimi, and dongxianzhe

Assets 2

13 Mar 18:16

guocuimi

v0.0.6

db79a0e

v0.0.6

Major changes:

Introduced new kernels aimed at enhancing efficiency.
Implemented an initial Python wrapper, simplifying integration and extending accessibility.
Incorporated new models such as Baichuan2 and ChatGLM.
Added support for Jinja chat templates, enhancing customization and user interaction.
Added usage statistics into responses, ensuring compatibility with OpenAI APIs.
Enabled ccache to accelerate build speed, facilitating quicker development cycles.

What's Changed

add timestamp into ccache cache key by @guocuimi in #42
use ${GITHUB_SHA} in cache key by @guocuimi in #43
replace GITHUB_SHA with ${{ github.sha }} by @guocuimi in #44
encapsulate class of time for performance tracking. by @liutongxuan in #46
upgrade paged_atten kernel to v0.2.7 by @guocuimi in #47
[feat] add speculative decoding. by @liutongxuan in #50
added a new attention kernel for speculative decoding by @guocuimi in #52
added support for small page size. by @guocuimi in #53
enable flash decoding for both prefill and decode phase. by @guocuimi in #54
enable split-k for flash decoding and fix bugs. by @guocuimi in #59
[ut] add unit tests for speculative scheduler. by @liutongxuan in #57
added a custom command to generate instantiation for flashinfer by @guocuimi in #61
add custom command to generate instantiation for flash-attn by @guocuimi in #62
added gpu memory profiling to decided kv cache size precisely. by @guocuimi in #63
moved attention related files into attention subfolder by @guocuimi in #65
add pybind11 to support python user interface. by @liutongxuan in #64
added support to build python wrapper with installed pytorch ( pre-cxx11 abi) by @guocuimi in #66
merge huggingface tokenizers and safetensors rust projects into one. by @guocuimi in #67
more changes to support python wrapper by @guocuimi in #68
[feat] added attention handler for different implementations by @guocuimi in #71
[perf] enabled speed up for gpa and mqa decoding. by @guocuimi in #72
[perf] use a seperate cuda stream for kv cache by @guocuimi in #73
[models] added baichuan/baichuan2 model support. by @liutongxuan in #70
[minor] cleanup redundant code for models. by @liutongxuan in #74
[feat] moved rope logic into attention handler to support apply positional embeding on the fly by @guocuimi in #76
[refactor] replace dtype and device with options since they are used together usually by @guocuimi in #77
[refactor] move cutlass and flashinfer into third_party folder by @guocuimi in #78
[refactor] split model forward function into two: 1> get hidden states 2> get logits from hidden states by @guocuimi in #80
[models] support both baichuan and baichuan2 by @guocuimi in #81
[models] fix chatglm model issue. by @guocuimi in #82

Full Changelog: v0.0.5...v0.0.6

Contributors

liutongxuan and guocuimi

Assets 2

03 Jan 12:21

guocuimi

v0.0.5

e551884

v0.0.5

Major changes

Added Qwen, ChatGLM and Phi2 support.
Added tiktoken tokenizer support.
Enabled more custom kernels for sampling.

What's Changed

[docs] add speculative decoding design docs. by @liutongxuan in #33
[docs] add devel image in CONTRIBUTING.md. by @liutongxuan in #35
[refactor] rename Executor to ThreadPool. by @liutongxuan in #36

New Contributors

@liutongxuan made their first contribution in #33

Full Changelog: v0.0.4...v0.0.5

Contributors

liutongxuan

Assets 2

03 Dec 00:25

guocuimi

v0.0.4

7f1679f

v0.0.4

Major change:

Added docker image build for cuda 11.8.
Added exception handling logic in http server.

Full Changelog: v0.0.3-fix...v0.0.4

Assets 2

23 Nov 11:51

guocuimi

v0.0.3-fix

376875c

v0.0.3

Added support for Yi Chat Model.
Added args overrider support.
Replaced libevhtp with boost asio for http server to fix epoll_wait not implemented error on old linux kernels.

Full Changelog: v0.0.2...v0.0.3-fix

Assets 2

15 Nov 07:38

guocuimi

v0.0.2

caa8ab2

v0.0.2

Major changes

Added Yi series models support.
Upgrade paged attention kernel to v2.
Added chat templates for mistral, aquila and internlm.

What's Changed

load dtype from config by @guocuimi in #14
fixed top_k tensor type and added unittests. by @guocuimi in #15

Full Changelog: v0.0.1...v0.0.2

Contributors

guocuimi

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Contributors

Major changes:

What's Changed

Contributors

Major Changes

What's Changed

Contributors

Major changes

What's Changed

New Contributors

Contributors

Major changes

What's Changed

New Contributors

Contributors

Major changes:

What's Changed

Contributors

Major changes

What's Changed

New Contributors

Contributors

Major changes

What's Changed

Contributors

Releases: vectorch-ai/ScaleLLM

v0.1.1

What's Changed

Contributors

v0.1.0

Major changes:

What's Changed

Contributors

v0.0.9

Major Changes

What's Changed

Contributors

v0.0.8

Major changes

What's Changed

New Contributors

Contributors

v0.0.7

Major changes

What's Changed

New Contributors

Contributors

v0.0.6

Major changes:

What's Changed

Contributors

v0.0.5

Major changes

What's Changed

New Contributors

Contributors

v0.0.4

v0.0.3

v0.0.2

Major changes

What's Changed

Contributors