Skip to content

Releases: alibaba/rtp-llm

v0.1.13

30 Apr 06:38
Compare
Choose a tag to compare

feat

  • support gte-Qwen1.5-7B-instruct
  • support Qwen1.5-MoE

fix

  • fix V100 performance
  • fix MULTI_TASK_PROMPT and MULTI_TASK_PROMPT_STR env
  • fix starcode-7b load failed
  • fix llava renderer sep
  • fix split_k_factor

v0.1.12

21 Apr 11:08
Compare
Choose a tag to compare

feature:

  • 支持新模型llama3/code-qwen2/cohere
    bug fix:
  • bloom weight加载错误
  • temperature不生效

v0.1.11

12 Apr 09:50
Compare
Choose a tag to compare

fix

  • int4 tp issue

v0.1.10

07 Apr 15:05
Compare
Choose a tag to compare

feat

  • sp support TP
  • suport tie_word_embeddings option in hf config.json
  • update transformers version to 4.39.3

refactor

  • add log for weight load: lora apply success / miss weight

fix

  • lora support one q/k/v weight is miss

docs

  • add Quantization docs

v0.1.9

01 Apr 03:42
Compare
Choose a tag to compare

feat

  • support awq
  • mv attention mask when use FMHA
  • support sparse&robert embedding, support calc similarity

refactor

  • use asyncio.future to avoid resource exclusivity
  • mv asyncio lock to asyncmodel

fix

  • tmp fix filelock version
  • moe model size
  • add headers for image downloading
  • update whl version
  • cutlass interface

docs

  • update pipeline usage

v0.1.8

25 Mar 13:32
Compare
Choose a tag to compare

feat

  • support qwen2 gptq
  • update multi_task_prompt create
  • speculative support tp
  • support roberta

refactor

  • refactor multimodal model process

fix

  • fix kv cache int8 bug: add dequantization method in reuse block scenario
  • fix stream output stop words
  • fix lora

v0.1.7

19 Mar 02:53
Compare
Choose a tag to compare

features

  • support int4 (experimental) on Qwen GPTQ
  • support V100 fmha
  • support Bert
  • Optimize VIT Engine by TensorRT

refactor

  • refactor schedule strategy, malloc kv cache in schedule new stream
  • refactor MOE

docs

  • update supported models

v0.1.6

09 Mar 07:06
Compare
Choose a tag to compare

features

  • support starcoder2
  • support gemma

fixs

  • fix lora merge
  • fix num_return_sequences 1
  • fix query cancel not release source
  • fix tp block num sync
  • fix some model rotary embedding dim 64

v0.1.5

01 Mar 09:25
Compare
Choose a tag to compare

features

  • refactor large amount of server code

fixs

  • fix inference server concurrency limit no decrease
  • cancel request correctly when client disconnected
  • fix ptuning with separate path

v0.1.4

26 Feb 06:15
Compare
Choose a tag to compare

features

  • support qwen 2
  • support qwen 1b8 vl
  • add throughput test

fixes

  • chatglm3 not output correctly
  • potential error when pydantic>=2.6.0
  • concurrency controller not working correctly