-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Illegal Instruction when running a llamafile #413
Comments
Last stdout lines with --ftrace flag: $ ./TinyLlama-1.1B-Chat-v1.0.F16.llamafile --ftrace |
OK you have a sandybridge CPU. Five years EOL but still supported by us. Could you run |
Hi, Here are the information: Note: I had to download APE / APE-jart and register them. |
Same thing here |
It seems to be a regression between version 0.7.0 and version 0.8.0
|
I see what the issue is here. I've confirmed a fix is incoming. |
Please be warned that once this fix goes live, using f16 weights on a sandybridge cpu that doesn't have the f16c isa, while it will no longer crash, it will almost certainly go very slow. You'll most likely be better served using the q4 weights on an older cpu. |
Hi,
Issue:
I tried to run llava-v1.5-7b-q4.llamafile or TinyLlama-1.1B-Chat-v1.0.F16.llamafile on my system:
Linux Ubuntu 6.5.0-28-generic #29~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr 4 14:39:20 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
But I encountered the same error at the same step for both:
stdout:
$ ./TinyLlama-1.1B-Chat-v1.0.F16.llamafile
note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
{"build":1500,"commit":"a30b324","function":"server_cli","level":"INFO","line":2856,"msg":"build info","tid":"11165056","timestamp":1715465433}
{"function":"server_cli","level":"INFO","line":2859,"msg":"system info","n_threads":4,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"11165056","timestamp":1715465433,"total_threads":4}
llama_model_loader: loaded meta data with 23 key-value pairs and 201 tensors from TinyLlama-1.1B-Chat-v1.0.F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: llama.block_count u32 = 22
llama_model_loader: - kv 2: llama.context_length u32 = 2048
llama_model_loader: - kv 3: llama.embedding_length u32 = 2048
llama_model_loader: - kv 4: llama.feed_forward_length u32 = 5632
llama_model_loader: - kv 5: llama.attention.head_count u32 = 32
llama_model_loader: - kv 6: llama.attention.head_count_kv u32 = 4
llama_model_loader: - kv 7: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 8: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 9: general.file_type u32 = 1
llama_model_loader: - kv 10: llama.vocab_size u32 = 32000
llama_model_loader: - kv 11: llama.rope.dimension_count u32 = 64
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.pre str = default
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,32000] = ["", "
", "", "<0x00>", "<...llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 2
llama_model_loader: - kv 21: tokenizer.chat_template str = {% for message in messages %}\n{% if m...
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 45 tensors
llama_model_loader: - type f16: 156 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_layer = 22
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 256
llm_load_print_meta: n_embd_v_gqa = 256
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 5632
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 1B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 1.10 B
llm_load_print_meta: model size = 2.05 GiB (16.00 BPW)
llm_load_print_meta: general.name = n/a
llm_load_print_meta: BOS token = 1 '
''llm_load_print_meta: EOS token = 2 '
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 2 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.10 MiB
llm_load_tensors: CPU buffer size = 2098.35 MiB
..........................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 11.00 MiB
llama_new_context_with_model: KV self size = 11.00 MiB, K (f16): 5.50 MiB, V (f16): 5.50 MiB
llama_new_context_with_model: CPU output buffer size = 0.13 MiB
llama_new_context_with_model: CPU compute buffer size = 66.50 MiB
llama_new_context_with_model: graph nodes = 710
llama_new_context_with_model: graph splits = 1
Instruction non permise (core dumped)
llama.log content:
$ cat llama.log
warming up the model with an empty run
lscpu
It seems to be CPU related, so here is my lscpu:
$ lscpu
Architecture : x86_64
Mode(s) opératoire(s) des processeurs : 32-bit, 64-bit
Address sizes: 36 bits physical, 48 bits virtual
Boutisme : Little Endian
Processeur(s) : 4
Liste de processeur(s) en ligne : 0-3
Identifiant constructeur : GenuineIntel
Nom de modèle : Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
Famille de processeur : 6
Modèle : 42
Thread(s) par cœur : 1
Cœur(s) par socket : 4
Socket(s) : 1
Révision : 7
Vitesse maximale du processeur en MHz : 3700,0000
Vitesse minimale du processeur en MHz : 1600,0000
BogoMIPS : 6619.18
Drapaux : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pb
e syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni
pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes x
save avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid xsaveopt dtherm ida arat pln pts vnm
i md_clear flush_l1d
Virtualization features:
Virtualisation : VT-x
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 1 MiB (4 instances)
L3: 6 MiB (1 instance)
NUMA:
Nœud(s) NUMA : 1
Nœud NUMA 0 de processeur(s) : 0-3
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: KVM: Mitigation: VMX disabled
L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
Mds: Mitigation; Clear CPU buffers; SMT disabled
Meltdown: Mitigation; PTI
Mmio stale data: Unknown: No mitigations
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
I saw a similar issue with a similar CPU: Support broken on old Intel/Amd CPUs #25. But as it does not crash at the same step, I was wondering if it could be related.
The text was updated successfully, but these errors were encountered: