getting error when building from source #1142

RachelShalom · 2024-01-15T13:01:04Z

Hi I was trying to run a model with the printing output in the following way and I keep getting what(): unexpectedly reached end of file
any idea on how to solve this?

run command:
EURAL_SPEED_VERBOSE=1 ./build/bin/run_llama -m /home/ny_user_name/runtime_outs/ne_llama_q_int8_jblas_cbf16_g32.bin -p "once upon a time, a little girl" -n 10
output
Welcome to use the llama on the ITREX! 
main: seed  = 1705323521
AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:1 AVX512_VNNI:1 AMX_INT8:1 AMX_BF16:1 AVX512_BF16:1 AVX512_FP16:1
model.cpp: loading model from /home/rachels.dell/runtime_outs/ne_llama_q_int8_jblas_cbf16_g32.bin
terminate called after throwing an instance of 'std::runtime_error'
  what():  unexpectedly reached end of file
Aborted (core dumped)

The text was updated successfully, but these errors were encountered:

kevinintel · 2024-01-16T06:45:56Z

Can you share your quntization step?

RachelShalom · 2024-01-16T07:42:46Z

yes I was running the basic example with llama 2 chat 7B and the quantization was done as part of the model loading:
and the following script worked flawlessly

from transformers import AutoTokenizer,TextStreamer,TextIteratorStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name="meta-llama/Llama-2-7b-chat-hf"
tokenizer=AutoTokenizer.from_pretrained(model_name,trust_remote_code=True)
streamer=TextStreamer(tokenizer,skip_prompt=True)   
model = AutoModelForCausalLM.from_pretrained(model_name,load_in_8bit=True)
inputs=tokenizer(prompt,return_tensors="pt").input_ids
outputs = model.generate(inputs,streamer=streamer,max_new_tokens=1000,ctx_size=2048,ignore_prompt=True)

zhentaoyu · 2024-01-16T09:01:36Z

Hi, @RachelShalom I have tried your Python script and got the int8 bin. But I could run the run_llama binary with your command successfully. Here are my some thoughts.

Hi I was trying to run a model with the printing output in the following way and I keep getting what(): unexpectedly reached end of file any idea on how to solve this?

run command:
EURAL_SPEED_VERBOSE=1 ./build/bin/run_llama -m /home/ny_user_name/runtime_outs/ne_llama_q_int8_jblas_cbf16_g32.bin -p "once upon a time, a little girl" -n 10
output
Welcome to use the llama on the ITREX! 
main: seed  = 1705323521
AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:1 AVX512_VNNI:1 AMX_INT8:1 AMX_BF16:1 AVX512_BF16:1 AVX512_FP16:1
model.cpp: loading model from /home/rachels.dell/runtime_outs/ne_llama_q_int8_jblas_cbf16_g32.bin
terminate called after throwing an instance of 'std::runtime_error'
  what():  unexpectedly reached end of file
Aborted (core dumped)

I do have not this path ./build/bin/run_llama if you install intel_extension_for_transformers from source rather than pip install whl. It seems you built the graph folder from source? If you build the intel_extension_for_transformers from source, the run_llama will be in intel-extension-for-transformers/intel_extension_for_transformers/llm/runtime/graph floder.
This error usually happens due to the different commits between quant bin and inference bin from my experience. For example, you pip install intel_extension_for_transformers, then convert and quant bin, then build from source, then run the former bin.

thanks.

RachelShalom · 2024-01-16T09:18:28Z

ohh I understrand. You are right - I did both. I pip installed and used the script above that created the bin file.
then I build from source in order to run the model to get the print output ( since this is not part of the latest release)
so I guess I should run the code that convert the model to int8 with the code that I built from source?
btw what was your output when you ran it?

zhentaoyu · 2024-01-16T09:28:33Z

so I guess I should run the code that convert the model to int8 with the code that I built from source?

Yes. build intel_extension_for_transformers from source and re-convert and re-inference.

btw what was your output when you ran it?

My machine is Intel(R) Xeon(R) Platinum 8480+ and I set the --seed 12 , the output like this:

main: seed  = 12
AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:1 AVX512_VNNI:1 AMX_INT8:1 AMX_BF16:1 AVX512_BF16:1 AVX512_FP16:1
model.cpp: loading model from xxx/ne_llama_q_int8_jblas_cbf16_g32.bin
init: n_vocab    = 32000
init: n_embd     = 4096
init: n_mult     = 256
init: n_head     = 32
init: n_head_kv  = 32
init: n_layer    = 32
init: n_rot      = 128
init: n_ff       = 11008
init: n_parts    = 1
load: ne ctx size = 7199.26 MB
load: mem required  = 9249.26 MB (+ memory per state)
...................................................................................................
model_init_from_file: support_jblas_kv = 1
model_init_from_file: kv self size =  276.00 MB

system_info: n_threads = 112 / 224 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | F16C = 1 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 10, n_keep = 0


 once upon a time, a little girl named Ella was born with a rare genetic
model_print_timings:        load time =  4617.68 ms
model_print_timings:      sample time =     8.27 ms /    10 runs   (    0.83 ms per token)
model_print_timings: prompt eval time =   359.86 ms /     9 tokens (   39.98 ms per token)
model_print_timings:        eval time =   521.95 ms /     9 runs   (   57.99 ms per token)
model_print_timings:       total time =  5152.31 ms
========== eval time log of each prediction ==========
prediction   0, time: 359.86ms
prediction   1, time: 70.54ms
prediction   2, time: 62.73ms
prediction   3, time: 56.78ms
prediction   4, time: 45.01ms
prediction   5, time: 75.16ms
prediction   6, time: 46.52ms
prediction   7, time: 44.98ms
prediction   8, time: 62.22ms
prediction   9, time: 58.00ms

please ignore the time logs since they may be inaccurate. And the generated results may be different in your machine (model would dispatch to different kernels due to different instruction sets).

RachelShalom · 2024-01-16T09:43:05Z

thank you @zhentaoyu will try that!

zhentaoyu · 2024-01-25T01:43:23Z

Hi, @RachelShalom, do you run it successfully？ Can we close this issue?

kevinintel assigned zhentaoyu Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getting error when building from source #1142

getting error when building from source #1142

RachelShalom commented Jan 15, 2024

kevinintel commented Jan 16, 2024

RachelShalom commented Jan 16, 2024 •

edited

zhentaoyu commented Jan 16, 2024

RachelShalom commented Jan 16, 2024 •

edited

zhentaoyu commented Jan 16, 2024

RachelShalom commented Jan 16, 2024

zhentaoyu commented Jan 25, 2024

getting error when building from source #1142

getting error when building from source #1142

Comments

RachelShalom commented Jan 15, 2024

kevinintel commented Jan 16, 2024

RachelShalom commented Jan 16, 2024 • edited

zhentaoyu commented Jan 16, 2024

RachelShalom commented Jan 16, 2024 • edited

zhentaoyu commented Jan 16, 2024

RachelShalom commented Jan 16, 2024

zhentaoyu commented Jan 25, 2024

RachelShalom commented Jan 16, 2024 •

edited

RachelShalom commented Jan 16, 2024 •

edited