-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Converting gguf to state_dict #411
Comments
It is recommended to use exl2, gptq or awq over gguf. The support for gguf (especially sharded gguf) is unfinished. |
Oh I see. Thank you for the reply! |
Experimental support of multiple gguf files is added to the |
@sgsdxzy Thank you for the update. I tried to test 'dev' branch but while the document says The dev branch extends support for GGUF to all available model architectures besides LLAMA, and sharded (multiple-file) GGUF. the code has the part which contradicts it what the document says Only support llama so far Thus, when I tried to run the model 'dranger003/c4ai-command-r-plus-iMat.GGUF', it raises the error. With the llama 3 model, it raises different error. (/home/lhs1012/.conda/aphrodite-runtime) lhs1012@ubuntu:/mnt3/lhs1012/laboratory/aphrodite-engine$ python -m aphrodite.endpoints.openai.api_server --model /mnt3/.cache/huggingface/hub/models--QuantFactory--Meta-Llama-3-70B-Instruct-GGUF-v2/snapshots/7549d4063b18c5b0eb91e547a633245ee8fc4cdd/Meta-Llama-3-70B-Instruct-v2.Q5_1-00001-of-00002.gguf --enforce-eager true --tensor-parallel-size 2 --gpu-memory-utilization 0.95 --quantization gguf |
Support for sharded ggufs (you are using 00001-of-00002), and other architectures requires pre-convert. You also need to point
Llama3 doesn't use |
It succeeded in converting but I got this error when running the model aphrodite run /mnt3/.cache/huggingface/hub/models--command-r-plus-gguf -tp 2 |
Can you test with latest release v0.5.3 and see if the issue still persists? |
still the same error with v0.5.3 and also the current main branch |
I am unable to reproduce this issue on main with ggml-c4ai-command-r-plus-iq2_xxs.gguf |
Hi mate,
This code is also still present in the 5.3.0 release and is halting convertion of the gguf to torch in the offical docker container. As mentioned above this contradicts what is stated in the documentation. Line 49-52 of aphrodite/transformers_utils/config.py#L49
same line is present in dev branch. Cheers! |
@JJordanCCurnow you need to pass
On the other hand I don't think phi3 is supported in Aphrodite. |
Your current environment
🐛 Describe the bug
I might be missing something. But at the beginning of converting gguf to pytorch state_dict, it fails to find the layer 'blk.0.ffn_gate_exps' in the dictionary 'mapping'
I have no name!@a535e478c460:/tmp/hub/models--MaziyarPanahi--WizardLM-2-8x22B-GGUF$ python3 -m aphrodite.endpoints.openai.api_server --host 0.0.0.0 --port 7860 --download-dir /tmp/hub --model /tmp/hub/models--MaziyarPanahi--WizardLM-2-8x22B-GGUF/snapshots/e382348c70b7cbadc126025a60c2c9f7445fcddc/WizardLM-2-8x22B.IQ3_XS-00001-of-00005.gguf --dtype auto --max-model-len 32768 --tensor-parallel-size 2 --gpu-memory-utilization 0.95 --quantization gguf --enforce-eager --trust-remote-code
WARNING: gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-04-16 01:16:53,309 INFO worker.py:1724 -- Started a local Ray instance.
INFO: Initializing the Aphrodite Engine (v0.5.2) with the following config:
INFO: Model =
'/tmp/hub/models--MaziyarPanahi--WizardLM-2-8x22B-GGUF/snapshots/e382348c70b7cbadc126025a60c2c9f7445fcddc/WizardLM-2-8x22B.IQ3_XS-00001-of-00005.gguf'
INFO: DataType = torch.float16
INFO: Model Load Format = auto
INFO: Number of GPUs = 2
INFO: Disable Custom All-Reduce = False
INFO: Quantization Format = gguf
INFO: Context Length = 32768
INFO: Enforce Eager Mode = True
INFO: KV Cache Data Type = auto
INFO: KV Cache Params Path = None
INFO: Device = cuda
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the
legacy
(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, setlegacy=False
. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565Converting GGUF tensors to PyTorch... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1% 1/128 -:--:--
(RayWorkerAphrodite pid=1148) Converting GGUF tensors to PyTorch... 1% 1/128 -:--:--
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/app/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 599, in
engine = AsyncAphrodite.from_engine_args(engine_args)
File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 676, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 341, in init
self.engine = self._init_engine(*args, **kwargs)
File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 410, in _init_engine
return engine_class(*args, **kwargs)
File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 113, in init
self._init_workers_ray(placement_group)
File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 283, in _init_workers_ray
self._run_workers(
File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 1028, in _run_workers
driver_worker_output = getattr(self.driver_worker,
File "/app/aphrodite-engine/aphrodite/task_handler/worker.py", line 112, in load_model
self.model_runner.load_model()
File "/app/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 121, in load_model
self.model = get_model(self.model_config, self.device_config,
File "/app/aphrodite-engine/aphrodite/modeling/loader.py", line 91, in get_model
model.load_weights(model_config.model, model_config.download_dir,
File "/app/aphrodite-engine/aphrodite/modeling/models/mixtral_quant.py", line 450, in load_weights
for name, loaded_weight in hf_model_weights_iterator(
File "/app/aphrodite-engine/aphrodite/modeling/hf_downloader.py", line 293, in hf_model_weights_iterator
for name, param in convert_gguf_to_state_dict(model_name_or_path,
File "/app/aphrodite-engine/aphrodite/modeling/hf_downloader.py", line 271, in convert_gguf_to_state_dict
new_key, output_dim = mapping[layer]
KeyError: 'blk.0.ffn_gate_exps'
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
The text was updated successfully, but these errors were encountered: