[Bug]: loading model with int8 kv cache chokes #346

BlairSadewitz · 2024-03-19T19:59:07Z

Your current environment

PyTorch version: 2.2.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (conda-forge gcc 11.3.0-19) 11.3.0
Clang version: Could not collect 
CMake version: version 3.27.6
Libc version: glibc-2.35
Python version: 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:53:32) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-6.5.0-15-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA RTX A6000
GPU 1: NVIDIA RTX A6000

Nvidia driver version: 535.154.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      40 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             64
On-line CPU(s) list:                0-63
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC Processor
CPU family:                         23
Model:                              1
Thread(s) per core:                 2
Core(s) per socket:                 32
Socket(s):                          1
Stepping:                           2
BogoMIPS:                           4890.76
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid amd_dcm tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat npt nrip_save
Virtualization:                     AMD-V
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          1 MiB (32 instances)
L1i cache:                          2 MiB (32 instances)
L2 cache:                           16 MiB (32 instances)
L3 cache:                           64 MiB (8 instances)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-63
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; untrained return thunk; SMT vulnerable
Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.2.0
[pip3] triton==2.2.0
[conda] Could not collect ROCM Version: Could not collect 
Aphrodite Version: 0.5.2
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled

🐛 Describe the bug

(aphrodite-runtime) root@C.10151121:~/aphrodite-engine$ python -m aphrodite.endpoints.openai.api_server -tp 2 --model ParasiticRogue/Merged-Vicuna-RP-Stew-34B --kv-cache-dtype int8

2024-03-19 19:52:14,449 WARNING utils.py:575 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set `RAY_USE_MULTIPROCESSING_CPU_COUNT=1` as an env var before starting Ray. Set the env var: `RAY_DISABLE_DOCKER_CPU_WARNING=1` to mute this warning.
2024-03-19 19:52:14,450 WARNING utils.py:587 -- Ray currently does not support initializing Ray with fractional cpus. Your num_cpus will be truncated from 30.71999 to 30.
2024-03-19 19:52:14,649 INFO worker.py:1724 -- Started a local Ray instance.
INFO:     Initializing the Aphrodite Engine (v0.5.2) with the following config:
INFO:     Model = 'ParasiticRogue/Merged-Vicuna-RP-Stew-34B'
INFO:     DataType = torch.bfloat16
INFO:     Model Load Format = auto
INFO:     Number of GPUs = 2
INFO:     Disable Custom All-Reduce = False
INFO:     Quantization Format = None
INFO:     Context Length = 32768
INFO:     Enforce Eager Mode = False
INFO:     KV Cache Data Type = int8
INFO:     KV Cache Params Path = None
INFO:     Device = cuda
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/root/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 599, in <module>
    engine = AsyncAphrodite.from_engine_args(engine_args)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 676, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 341, in __init__
    self.engine = self._init_engine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 410, in _init_engine
    return engine_class(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 113, in __init__
    self._init_workers_ray(placement_group)
  File "/root/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 268, in _init_workers_ray
    self.driver_worker = Worker(
                         ^^^^^^^
  File "/root/aphrodite-engine/aphrodite/task_handler/worker.py", line 60, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 92, in __init__
    self.kv_quant_params = (self.load_kv_quant_params(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 116, in load_kv_quant_params
    kv_quant_params.append(kv_quant_param)
                           ^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'kv_quant_param' where it is not associated with a value
2024-03-19 19:52:19,750 ERROR worker.py:405 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerAphrodite.init_worker() (pid=26429, ip=172.17.0.2, actor_id=537d7fe532ba3d411a06c1f001000000, repr=<aphrodite.engine.ray_tools.RayWorkerAphrodite object at 0x7f34058b5b50>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/engine/ray_tools.py", line 22, in init_worker
    self.worker = worker_init_fn()
                  ^^^^^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 252, in <lambda>
    lambda rank=rank, local_rank=local_rank: Worker(
                                             ^^^^^^^
  File "/root/aphrodite-engine/aphrodite/task_handler/worker.py", line 60, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 92, in __init__
    self.kv_quant_params = (self.load_kv_quant_params(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 116, in load_kv_quant_params
    kv_quant_params.append(kv_quant_param)
                           ^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'kv_quant_param' where it is not associated with a value

Well, it turns out that I didn't have enough VRAM to load the model in 16-bit, but I just tried it with --load-in-4bit, and failure's the same. Without the int8 kv_cache, model loads fine:

(aphrodite-runtime) root@C.10151121:~/aphrodite-engine$  python -m aphrodite.endpoints.openai.api_server -tp 2 --model ParasiticRogue/Merged-Vicuna-RP-Stew-34B --load-in-4bit 
WARNING:  bnb quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-03-19 20:03:18,803 WARNING utils.py:575 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set `RAY_USE_MULTIPROCESSING_CPU_COUNT=1` as an env var before starting Ray. Set the env var: `RAY_DISABLE_DOCKER_CPU_WARNING=1` to mute this warning.
2024-03-19 20:03:18,804 WARNING utils.py:587 -- Ray currently does not support initializing Ray with fractional cpus. Your num_cpus will be truncated from 30.71999 to 30.
2024-03-19 20:03:18,984 INFO worker.py:1724 -- Started a local Ray instance.
INFO:     Initializing the Aphrodite Engine (v0.5.2) with the following config:
INFO:     Model = 'ParasiticRogue/Merged-Vicuna-RP-Stew-34B'
INFO:     DataType = torch.bfloat16
INFO:     Model Load Format = auto
INFO:     Number of GPUs = 2
INFO:     Disable Custom All-Reduce = False
INFO:     Quantization Format = bnb
INFO:     Context Length = 32768
INFO:     Enforce Eager Mode = False
INFO:     KV Cache Data Type = auto
INFO:     KV Cache Params Path = None
INFO:     Device = cuda
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING:  Custom allreduce is disabled because your platform lacks GPU P2P capability. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerAphrodite pid=36344) WARNING:  Custom allreduce is disabled because your platform lacks GPU P2P capability. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO:     Downloading model weights ['*.safetensors']
(RayWorkerAphrodite pid=36344) INFO:     Downloading model weights ['*.safetensors']

INFO:     Memory allocated for converted model: 9.17 GiB
INFO:     Memory reserved for converted model: 9.26 GiB
INFO:     Model weights loaded. Memory usage: 9.17 GiB x 2 = 18.34 GiB

with kv-cache-dtype=fp8_e5m2 and load-in-4bit, it works also.

The text was updated successfully, but these errors were encountered:

BlairSadewitz · 2024-03-19T20:43:56Z

Oops, nevermind. I didn't read the documentation. Sorry, lol. You might want to put that in boldface or something on the main page where you mention it.

BlairSadewitz added the bug Something isn't working label Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: loading model with int8 kv cache chokes #346

[Bug]: loading model with int8 kv cache chokes #346

BlairSadewitz commented Mar 19, 2024 •

edited

BlairSadewitz commented Mar 19, 2024

[Bug]: loading model with int8 kv cache chokes #346

[Bug]: loading model with int8 kv cache chokes #346

Comments

BlairSadewitz commented Mar 19, 2024 • edited

Your current environment

🐛 Describe the bug

BlairSadewitz commented Mar 19, 2024

BlairSadewitz commented Mar 19, 2024 •

edited