Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: restore backwards compatibility with sm_60 (P100 and GP100) #444

Merged
merged 2 commits into from
Apr 30, 2024

Conversation

AlpinDale
Copy link
Member

@AlpinDale AlpinDale commented Apr 29, 2024

Looks like sm_60 can't do dot product with 4 elements. There's a better solution, but this will work for now. I may implement an sm_60-only dot product function here.

resolves #413

@dirkson
Copy link

dirkson commented Apr 30, 2024

I was able to build and run this on 1x and 4x P100's more or less flawlessly with the provided runtime. I tested turboderp's Llama 3 8B exl's and MaziyarPanahi's 8B and 70B gptq.

There was an issue with --context-shift , which is apparently due to Triton currently being limited to Cuda 7.0. The GPTQ's endlessly generated, but that's a known issue with the models tested, so I suspect it was unrelated.

Otherwise I really wasn't able to find any issues. Everything I tried just worked in my setup, or gave reasonable error messages about requiring higher Cuda releases.

Performance, once I adjusted to the way Aphrodite reports performance, was quite pretty solid - The 70B model on -tp 4 ran at 12t/s, up from 3.5 for tabbyapi. The 8B model ran at 50t/s on a single P100, or 25t/s on multiple. It ran at around 40t/s on tabbyapi.

@AlpinDale AlpinDale merged commit 50a9176 into dev Apr 30, 2024
@AlpinDale AlpinDale deleted the fix/sm_60 branch April 30, 2024 11:09
AlpinDale added a commit that referenced this pull request May 11, 2024
* add new logits processor

* clean up sampler

* stop workflows on dev

* pipe in logitproc in lora

* compute logits in model_runner

* simplify sampler in llama

* support command-r+ model

Co-authored-by: o_0 <pkircher82@gmail.com>

* logitproc for cohere

* add some imports

* add logit scale for command-r

* fix gptq for cohere

* conflicts with _is_neuron()

* fix query shape in logits processor

* fix pydantic serializer warning

* add warning for mismatch in vocab size

* fix quants for gemma

* do not remove duplicate params for qwen2

* refactor: neuron support

* fix and re-enable custom all-reduce

* add scheduler delay factor

* add reorder scheduler policy

* fix logprobs serializer warnings

* improve detokenization performance; improve logprobs

* LockFile -> SoftLockFile

* fix tied embeddings in falcon

* fix query shape in moe models

* rope: get_device() -> device

* fix formatting

* feat: attention refactor part 2

* min_tokens

* add logprob ranks

* don't output two stop strings in api

* vision model support

* yapf

* optimize logprob ranks

* add stop_reason

* ipv6 fix

* KVCache type in llava

* add python nccl wrapper, remove cupy

* ruff

* support gemma 1.1 models with approximate gelu

* fix nccl path

* add dbrx support

* formatting

* add v2 block manager

* ruff

* add qwen2moe support (needs transformers git)

* Chunked Prefill Part 2: data update

* optional vision language config for neuron

* enable multi-node inference

* add exllamav2 tensor paralell, fused MoE for GPTQ/AWQ

* formatting and typing

* this is kinda dumb if you ask me

* logprobs fixes

* do not use codespell on kernels

* fix codespell path

* rccl path for ROCm

* fix case when API request to top_k is 0

* cache tokenizer len

* small fixes (#393)

* Pin torch to 2.2.0

* Improve nccl errors.

* directly use  in forward pass

* CMake build system (#395)

* add cmake

* update setup.py

* clean up + pin versions

* add hadamard kernels for compilation

* formatting

* fix requirement

* restore nvcc

* dev libraries for cuda

* Fix cohere for command-r+ (#394)

* Fix cohere for command-r+

* fix smoothquant+

---------

Co-authored-by: AlpinDale <alpindale@gmail.com>

* proper typing

* fix build for 7.5

* feat: add optimized layernorm kernels (#398)

* simplify tokenizer.py

* Speculative Decoding Part 4: Lookahead scheduling (#402)

Co-authored-by: Cade Daniel <edacih@gmail.com>

* feat: Intel CPU support (#403)

* add CPU types

Co-authored-by: bigPYJ1151 <jiang1.li@intel.com>

* port all relevant kernels to CPU

Co-authored-by: bigPYJ1151 <jiang1.li@intel.com>

* make cpu build work

* add SDPA backend

* working CPU backend

* make this work on the API

* remove unnecessary file

---------

Co-authored-by: bigPYJ1151 <jiang1.li@intel.com>

* fix minor cuda version mismatch with runtime

* fix spec_decode and block imports

* add speculative config and arg for later

* better recognize cpu build

* add dict merging util

* refactor scheduler for chunked prefill, remove reorder policy for now

* feat: FP8 E4M3 KV Cache (#405)

* Add FP8 E4M3 kernels for AMD GPUs

Courtesy of the AMD team.
Co-authored-by: Adrian Abeyta <adabeyta@amd.com>

* refactor kv cache support in attention kernel

* ops and pybind

* fix compilation errors

* update cmake

* support fp8 e4m3 in the engine

* fix AMD build

* do not compile exl2 for amd (for now)

* amd compatibility fixes for align block size

* maybe fix layernorm compiles

* fix moe softmax kernel

* fix multi-gpu ray tokenizer for trust_remote_code

* enable hf_transfer if installed

* fix CPU build

* make detokenization optional

* update torch to 2.2.1

* make nccl wrapper more robust

* split requirements

* add sampling param for left-truncating prompt tokens

* fix: Improve cohere model. (#404)

* feat: add chunked prefill scheduler (#406)

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* allow out-of-tree model registry

* use the get_len() method instead of manual len calculation

* fix TP for llava

* enable attention bias support in llama

* disable new layernorm kernels for CUDA < 12.0

* separate init_distributed_environment from worker

* feat: Triton flash attention backend for ROCm (#407)

* refactor executor classes and workers

* KeyError for GPT-NeoX

* add triton flash-attention kernel for ROCm

* fix types in merge_dict

* skip rows in logits added for the prompt tokens

* head_size 256 for gemma in triton FA

* no key sorting for outlines

* fix case where hf_config==None

* move megatron to a top-level directory

* fix micromamba url

* fix outlines requirements

* why was that not committed?

* fully working chunked prefill

* fix docstrings

* roll back chunked prefill changes to SDPA, isolate cpu worker

* make init_distributed_environment compatible with init_process_group

* fix echo

* fix formatting for previous commit

* fix stop strings not being excluded from outputs

* move merge_async_iterators to common utils

* fix type hint

* enable custom_all_reduce by default in llm.py

* triton compile error for flash_attn

* debug logging for distributed_init_method

* incorrect use of monotonic time in metrics logger

* fix neuron

* cache the p2p access check for memory saving

* feat: EETQ quantization (#408)

* Support arbitrary model in GGUF. (#381)

* Support arbitrary model in GGUF.

* Update gguf to support cohere and dbrx.

* yapf

* formatting

* fix: Allow setting config-path when converting ggufs. (#410)

* incorrect comparison for hadamard and punica checks

* fix: max_num_batched_tokens for chunked_prefill (#412)

* Fix max_num_batched_tokens for chunked_prefill.

* Remove accelerate as dependency.

* add args to benchmark script

* change chunk size to 768 default

---------

Co-authored-by: AlpinDale <alpindale@gmail.com>

* feat: upport twe lm_head for quantized weights (#409)

* Support twe lm_head for quantized weights.

* Remove unused class.

* separate api server args into another file

* add CLI app

* fix: split the exl2 weight loading and SQ+ init (#423)

* Split the exl2 weight ASAP.

* Shard exl2 weights on cpu. reducing vram bubbles.

* feat: support sharded ggufs (#420)

* Support sharded ggufs.

* Fix .gguf ext detection.

* log tokenizer conversion message once

---------

Co-authored-by: AlpinDale <alpindale@gmail.com>

* chore: port sampler+metadata changes from main to dev (#427)

* Port sampler/metadata changes over from main

* Update old logitprocs to support serial processing

* fix fine-grained seeding. Again.

* formatting

---------

Co-authored-by: AlpinDale <alpindale@gmail.com>

* suppress import error for eetq

* fix CPU blocks logger for CPU backend

* simplify model_executor logic

* fix the nsight profiling with ray

* fix engine_use_ray=True

* feat: LM Format Enforcer support (#428)

* feat: add lm-format-enforcer support for guided decoding

Co-authored-by: Noam Gat <noamgat@gmail.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>

* clean up

---------

Co-authored-by: Noam Gat <noamgat@gmail.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>

* fix: abort requests when the connection to /v1/completions is interrupted (#431)

* fix: linear bias of qkv layers in models (#430)

* feat: Speculative Decoding using a draft model (#432)

* feat: support speculative decoding using draft model

* benchmark script update; add attribution

Co-authored-by: Cade Daniel <edacih@gmail.com>

---------

Co-authored-by: Cade Daniel <edacih@gmail.com>

* feat: add ngram prompt lookup decoding for speculative decoding (#438)

Co-authored-by: Lei Wen <wenlei03@qiyi.com>

* fix: rope scaling for cohere and qwen (#436)

* Fix incorrect auto RoPE of cohere.

* Remove vocab padding from cohere and qwen.

---------

Co-authored-by: AlpinDale <alpindale@gmail.com>

* fix: shard exl2 weights more evenly between ranks (#437)

Co-authored-by: AlpinDale <alpindale@gmail.com>

* chore: update Kobold Lite Embed (#433)

* Add files via upload

* Update api_server.py

Update Kobold Version

* fix: logging in the API server

* fix: options requests in the api (#439)

* fix: cpu executor

* fix serial impl of bias logproc

* tiny fixes (#441)

* Temporary fix of chat api.

* Fix indentation of bnb to be within _set_default_torch_dtype.

* Revert the is_neox_style change.

* Better handle tokenizer vocab size.

* vllm #4270

* vllm #4280

* fix: use BiasLogitsProcessor for OpenAI endpoint (#443)

Sterilize bias dict token range according to active model.

* fix: llama3 generations with EOS

* fix: rope_style in quants

* fix: rope style in eetq

* fix: incorrect module in quip

* fix: restore backwards compatibility with sm_60 (P100 and GP100) (#444)

* fix: compatibility with sm_60 arch

* fix min requirement for gguf

* fix: revision arg not being correctly passed (#456)

* chore: Fix minor bugs in outlines and lmfe. (#449)

* fix: lora errors (#462)

* fix: possibly unbreak loras

* fix lora logit processor

* increase support vocab size for lora to 128k

* 15k dim

* 43k dim

* compile less dtypes for punica

* fix compile error

* fix lm_head in lora

* Refactor: Quantization (#454)

* split quant ops

* isolate the quantization modeling code

* properly set the _quant_C module

* import error fixes

* raise importerror only when the specific quant is called

* missed one

* roll back cc for hadamard

* address comments

* raise error in __init__ class

* missed it in gptq

* ruff

* Bump `torch` to 2.3.0 (#467)

* bump to torch 2.3.0

* fix triton

* fix: Navi support (#466)

Co-authored-by: AlpinDale <52078762+AlpinDale@users.noreply.github.com>

* Dockerfile: permission update, configurable build jobs, torch 2.3.0 (#465)

* topk as linear write

* better variable naming

* yapf considers this space to be CRITICAL

* Overhauled SamplingTensors construction.
Fix multiple bugs in sampler flags.
Restored the functioning logitproc format.
Fixed sudden NaNs in quadratic smoothing.
Rewrote mirostat to work with seeds and other samplers.
Removed branches from some samplers.

* Fix logitproc for logit_bias in OAI endpoints.

* merge main

* fix memory pinning conditional

* Missed .items() and assert

* fix: kobold api /tokencount (#424)

* /tokencount uses mismatched value types

* reduce output to only 'value'

---------

Co-authored-by: Adrian Wells <neoturi@hotmail.com>

* fixes #458

* #453, #458

* revert irrelevant changes

* properly this time

---------

Co-authored-by: 50h100a <all_awful@proton.me>
Co-authored-by: 50h100a <136940546+50h100a@users.noreply.github.com>
Co-authored-by: Krovius <neoturi@gmail.com>
Co-authored-by: Adrian Wells <neoturi@hotmail.com>
Co-authored-by: AlpinDale <alpindale@gmail.com>

* ruff

* ruff again

* codespell

* yapf

* add sm_60 to build ci

* restore punica dtypes

* compile extra dtypes for punica

* bump torch in environment.yaml

* update ROCm dockerfile

* add TODO for build script

* bump version to 0.5.3

---------

Co-authored-by: o_0 <pkircher82@gmail.com>
Co-authored-by: sgsdxzy <sgsdxzy@gmail.com>
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: bigPYJ1151 <jiang1.li@intel.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: 50h100a <136940546+50h100a@users.noreply.github.com>
Co-authored-by: Noam Gat <noamgat@gmail.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Pyroserenus <142424797+Pyroserenus@users.noreply.github.com>
Co-authored-by: 50h100a <all_awful@proton.me>
Co-authored-by: Houman <houmie@gmail.com>
Co-authored-by: Naomiusearch <121130001+Naomiusearch@users.noreply.github.com>
Co-authored-by: The Objective Dad <63609026+theobjectivedad@users.noreply.github.com>
Co-authored-by: Krovius <neoturi@gmail.com>
Co-authored-by: Adrian Wells <neoturi@hotmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Is there a reason CUDA 6.1 is the minimum? Would CUDA 6.0 on the P100 not work?
2 participants