LLamaSharp.Backend.OpenCL 0.11.2: GGML_ASSERT llama.cpp:14093: hparams.n_embd_head_v % ggml_blck_size(type_v) == 0 #655

david-j-smith · 2024-04-07T05:28:02Z

Greetings!
I am trying to use LLamaSharp.Backend.OpenCL Version 0.11.2 under Windows 11, but the after loading any GGUF model, inference fails with the following assertion:
GGML_ASSERT: D:\a\LLamaSharp\LLamaSharp\llama.cpp:14093: hparams.n_embd_head_v % ggml_blck_size(type_v) == 0

I tried with several models (mistral-7b-instruct-v0.2.Q6_K.gguf, tiny-llama, phi-2, gemma-it) and with and without GPU offloading, but the error remains the same. The CPU backend works fine.
As far as I could see, there is an issue in llama.cpp (#5928) that at least sounds similar, but I can't tell if it is the same error.

The full log (mistral, with GPU offloading):

Successfully loaded the library [runtimes\win-x64\native\clblast\llama.dll] specified by user
ggml_opencl: selecting platform: 'Intel(R) OpenCL Graphics'
ggml_opencl: selecting device: 'Intel(R) UHD Graphics 730'
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from mistral-7b-instruct-v0.2.Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 18
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q6_K:  226 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q6_K
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 5.53 GiB (6.56 BPW)
llm_load_print_meta: general.name     = mistralai_mistral-7b-instruct-v0.2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'

llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    64.09 MiB
llm_load_tensors:     OpenCL buffer size =  1803.82 MiB
................................................................................................

llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 32
llama_new_context_with_model: n_ubatch   = 4
llama_new_context_with_model: freq_base  = 1.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   768.00 MiB
llama_new_context_with_model: KV self size  =  768.00 MiB, K (f16):  256.00 MiB, V (f32):  512.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     1.10 MiB
llama_new_context_with_model:        CPU compute buffer size =     2.31 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 32
llama_new_context_with_model: n_ubatch   = 4
llama_new_context_with_model: freq_base  = 1.0
llama_new_context_with_model: freq_scale = 1

GGML_ASSERT: D:\a\LLamaSharp\LLamaSharp\llama.cpp:14093: hparams.n_embd_head_v % ggml_blck_size(type_v) == 0

The text was updated successfully, but these errors were encountered:

AsakusaRinne · 2024-04-07T06:36:47Z

It seems to be the same with llama.cpp#5928 you mentioned. Have you tried to run the model directly with llama.cpp examples compiled with OpenCL?

david-j-smith · 2024-04-07T11:52:10Z

Yes, I tested with llama.cpp release b2303 (commit 3ab8b3a) and the latest (binary) release b2589 (commit 1ff4d9f). Both work as expected and for all GGUF models.
However, I have noticed that with OpenCL mode, the RAM consumption is considerably higher and in particular the inference speed is considerably less than half as fast. 😞 So I think I'll stick with CPU only mode.

AsakusaRinne · 2024-04-07T12:19:40Z

UHD Graphics 730 is not as efficient as some GPU for computation, like Nvidia RTX and AMD RX. However if everything works well with llama.cpp but not with LLamaSharp, it could be confirmed as a BUG. Could you please give a link of the model you used for us to reproduce it?

david-j-smith · 2024-04-07T16:24:31Z

I had hoped that the iGPU would be around the same level of CPU performance, but more efficient (which on my PC actually means: much quieter 🤣). But the CUDA backend is definitely an order of magnitude faster, I agree.

Here are the links to the models I used for the test with llama.cpp native (unfortunately, mistral-7b Q6 works in CPU but runs OOM in OpenCL):

AsakusaRinne added bug Something isn't working Upstream Tracking an issue in llama.cpp labels Apr 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLamaSharp.Backend.OpenCL 0.11.2: GGML_ASSERT llama.cpp:14093: hparams.n_embd_head_v % ggml_blck_size(type_v) == 0 #655

LLamaSharp.Backend.OpenCL 0.11.2: GGML_ASSERT llama.cpp:14093: hparams.n_embd_head_v % ggml_blck_size(type_v) == 0 #655

david-j-smith commented Apr 7, 2024

AsakusaRinne commented Apr 7, 2024

david-j-smith commented Apr 7, 2024

AsakusaRinne commented Apr 7, 2024

david-j-smith commented Apr 7, 2024

LLamaSharp.Backend.OpenCL 0.11.2: GGML_ASSERT llama.cpp:14093: hparams.n_embd_head_v % ggml_blck_size(type_v) == 0 #655

LLamaSharp.Backend.OpenCL 0.11.2: GGML_ASSERT llama.cpp:14093: hparams.n_embd_head_v % ggml_blck_size(type_v) == 0 #655

Comments

david-j-smith commented Apr 7, 2024

AsakusaRinne commented Apr 7, 2024

david-j-smith commented Apr 7, 2024

AsakusaRinne commented Apr 7, 2024

david-j-smith commented Apr 7, 2024