You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
2024-03-19 19:52:14,449 WARNING utils.py:575 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set `RAY_USE_MULTIPROCESSING_CPU_COUNT=1` as an env var before starting Ray. Set the env var: `RAY_DISABLE_DOCKER_CPU_WARNING=1` to mute this warning.
2024-03-19 19:52:14,450 WARNING utils.py:587 -- Ray currently does not support initializing Ray with fractional cpus. Your num_cpus will be truncated from 30.71999 to 30.
2024-03-19 19:52:14,649 INFO worker.py:1724 -- Started a local Ray instance.
INFO: Initializing the Aphrodite Engine (v0.5.2) with the following config:
INFO: Model = 'ParasiticRogue/Merged-Vicuna-RP-Stew-34B'
INFO: DataType = torch.bfloat16
INFO: Model Load Format = auto
INFO: Number of GPUs = 2
INFO: Disable Custom All-Reduce = False
INFO: Quantization Format = None
INFO: Context Length = 32768
INFO: Enforce Eager Mode = False
INFO: KV Cache Data Type = int8
INFO: KV Cache Params Path = None
INFO: Device = cuda
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/root/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 599, in <module>
engine = AsyncAphrodite.from_engine_args(engine_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 676, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 341, in __init__
self.engine = self._init_engine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 410, in _init_engine
return engine_class(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 113, in __init__
self._init_workers_ray(placement_group)
File "/root/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 268, in _init_workers_ray
self.driver_worker = Worker(
^^^^^^^
File "/root/aphrodite-engine/aphrodite/task_handler/worker.py", line 60, in __init__
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 92, in __init__
self.kv_quant_params = (self.load_kv_quant_params(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 116, in load_kv_quant_params
kv_quant_params.append(kv_quant_param)
^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'kv_quant_param' where it is not associated with a value
2024-03-19 19:52:19,750 ERROR worker.py:405 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerAphrodite.init_worker() (pid=26429, ip=172.17.0.2, actor_id=537d7fe532ba3d411a06c1f001000000, repr=<aphrodite.engine.ray_tools.RayWorkerAphrodite object at 0x7f34058b5b50>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/engine/ray_tools.py", line 22, in init_worker
self.worker = worker_init_fn()
^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 252, in <lambda>
lambda rank=rank, local_rank=local_rank: Worker(
^^^^^^^
File "/root/aphrodite-engine/aphrodite/task_handler/worker.py", line 60, in __init__
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 92, in __init__
self.kv_quant_params = (self.load_kv_quant_params(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 116, in load_kv_quant_params
kv_quant_params.append(kv_quant_param)
^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'kv_quant_param' where it is not associated with a value
Well, it turns out that I didn't have enough VRAM to load the model in 16-bit, but I just tried it with --load-in-4bit, and failure's the same. Without the int8 kv_cache, model loads fine:
(aphrodite-runtime) root@C.10151121:~/aphrodite-engine$ python -m aphrodite.endpoints.openai.api_server -tp 2 --model ParasiticRogue/Merged-Vicuna-RP-Stew-34B --load-in-4bit
WARNING: bnb quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-03-19 20:03:18,803 WARNING utils.py:575 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set `RAY_USE_MULTIPROCESSING_CPU_COUNT=1` as an env var before starting Ray. Set the env var: `RAY_DISABLE_DOCKER_CPU_WARNING=1` to mute this warning.
2024-03-19 20:03:18,804 WARNING utils.py:587 -- Ray currently does not support initializing Ray with fractional cpus. Your num_cpus will be truncated from 30.71999 to 30.
2024-03-19 20:03:18,984 INFO worker.py:1724 -- Started a local Ray instance.
INFO: Initializing the Aphrodite Engine (v0.5.2) with the following config:
INFO: Model = 'ParasiticRogue/Merged-Vicuna-RP-Stew-34B'
INFO: DataType = torch.bfloat16
INFO: Model Load Format = auto
INFO: Number of GPUs = 2
INFO: Disable Custom All-Reduce = False
INFO: Quantization Format = bnb
INFO: Context Length = 32768
INFO: Enforce Eager Mode = False
INFO: KV Cache Data Type = auto
INFO: KV Cache Params Path = None
INFO: Device = cuda
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING: Custom allreduce is disabled because your platform lacks GPU P2P capability. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerAphrodite pid=36344) WARNING: Custom allreduce is disabled because your platform lacks GPU P2P capability. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO: Downloading model weights ['*.safetensors']
(RayWorkerAphrodite pid=36344) INFO: Downloading model weights ['*.safetensors']
INFO: Memory allocated for converted model: 9.17 GiB
INFO: Memory reserved for converted model: 9.26 GiB
INFO: Model weights loaded. Memory usage: 9.17 GiB x 2 = 18.34 GiB
with kv-cache-dtype=fp8_e5m2 and load-in-4bit, it works also.
The text was updated successfully, but these errors were encountered:
Oops, nevermind. I didn't read the documentation. Sorry, lol. You might want to put that in boldface or something on the main page where you mention it.
Your current environment
🐛 Describe the bug
(aphrodite-runtime) root@C.10151121:~/aphrodite-engine$ python -m aphrodite.endpoints.openai.api_server -tp 2 --model ParasiticRogue/Merged-Vicuna-RP-Stew-34B --kv-cache-dtype int8
Well, it turns out that I didn't have enough VRAM to load the model in 16-bit, but I just tried it with --load-in-4bit, and failure's the same. Without the int8 kv_cache, model loads fine:
with kv-cache-dtype=fp8_e5m2 and load-in-4bit, it works also.
The text was updated successfully, but these errors were encountered: