[Bug]: Cannot load llama-3 gguf based models #473

EugeoSynthesisThirtyTwo · 2024-05-18T15:13:12Z

Your current environment

PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3080 Ti Laptop GPU
Nvidia driver version: 546.17
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 20
On-line CPU(s) list: 0-19
Vendor ID: GenuineIntel
Model name: 12th Gen Intel(R) Core(TM) i9-12900HK
CPU family: 6
Model: 154
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 1
Stepping: 3
BogoMIPS: 5836.79
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves umip gfni vaes vpclmulqdq rdpid fsrm md_clear flush_l1d arch_capabilities
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 480 KiB (10 instances)
L1i cache: 320 KiB (10 instances)
L2 cache: 12.5 MiB (10 instances)
L3 cache: 24 MiB (1 instance)
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.3.0
[pip3] triton==2.3.0
[conda] Could not collect ROCM Version: Could not collect
Aphrodite Version: 0.5.3
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled

🐛 Describe the bug

Upon entering the following command: python -m aphrodite.endpoints.openai.api_server --model Llama-3-8B-Instruct-abliterated-v2_q8.gguf

I get the following error:

INFO:     Extracting config from GGUF...
WARNING:  gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO:     Initializing the Aphrodite Engine (v0.5.3) with the following config:
INFO:     Model = 'Llama-3-8B-Instruct-abliterated-v2_q8.gguf'
INFO:     Speculative Config = None
INFO:     DataType = torch.float16
INFO:     Model Load Format = auto
INFO:     Number of GPUs = 1
INFO:     Disable Custom All-Reduce = False
INFO:     Quantization Format = gguf
INFO:     Context Length = 8192
INFO:     Enforce Eager Mode = True
INFO:     KV Cache Data Type = auto
INFO:     KV Cache Params Path = None
INFO:     Device = cuda
INFO:     Guided Decoding Backend = DecodingConfig(guided_decoding_backend='outlines')
INFO:     Converting tokenizer from GGUF...
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/endpoints/openai/api_server.py", line 562, in <module>
    run_server(args)
  File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/endpoints/openai/api_server.py", line 519, in run_server
    engine = AsyncAphrodite.from_engine_args(engine_args)
  File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 358, in from_engine_args
    engine = cls(engine_config.parallel_config.worker_use_ray,
  File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 323, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 429, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/engine/aphrodite_engine.py", line 125, in __init__
    self._init_tokenizer()
  File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/engine/aphrodite_engine.py", line 246, in _init_tokenizer
    self.tokenizer: BaseTokenizerGroup = get_tokenizer_group(
  File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/transformers_utils/tokenizer_group/__init__.py", line 20, in get_tokenizer_group
    return TokenizerGroup(**init_kwargs)
  File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/transformers_utils/tokenizer_group/tokenizer_group.py", line 23, in __init__
    self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)
  File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/transformers_utils/tokenizer.py", line 136, in get_tokenizer
    return convert_gguf_to_tokenizer(tokenizer_name)
  File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/transformers_utils/tokenizer.py", line 44, in convert_gguf_to_tokenizer
    scores = result.fields['tokenizer.ggml.scores']
KeyError: 'tokenizer.ggml.scores'

I get the same error for every Llama-3 based model, whether it's 8B or 70B

The text was updated successfully, but these errors were encountered:

sgsdxzy · 2024-05-18T16:52:46Z

Llama3 doesn't use LlamaTokenizer, you need to supply the original tokenizers by --tokenizer original_repo

EugeoSynthesisThirtyTwo added the bug Something isn't working label May 18, 2024

EugeoSynthesisThirtyTwo changed the title ~~[Bug]: Cannot load gguf Lumimaid~~ [Bug]: Cannot load llama-3 gguf based models May 18, 2024

EugeoSynthesisThirtyTwo closed this as completed May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Cannot load llama-3 gguf based models #473

[Bug]: Cannot load llama-3 gguf based models #473

EugeoSynthesisThirtyTwo commented May 18, 2024

sgsdxzy commented May 18, 2024

[Bug]: Cannot load llama-3 gguf based models #473

[Bug]: Cannot load llama-3 gguf based models #473

Comments

EugeoSynthesisThirtyTwo commented May 18, 2024

Your current environment

🐛 Describe the bug

sgsdxzy commented May 18, 2024