Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

i wish for simpler way to run the model #230

Open
kolinfluence opened this issue Apr 23, 2024 · 4 comments
Open

i wish for simpler way to run the model #230

kolinfluence opened this issue Apr 23, 2024 · 4 comments
Assignees

Comments

@kolinfluence
Copy link

kolinfluence commented Apr 23, 2024

i'm not well versed with python and where do i put the downloaded llama-2-7b-chat.Q4_0.gguf file?

i can make llama.cpp work real easy on my laptop but i cant seem to get this to work

i did git clone the neural speed, i did the pip install ... saved the file in run_model.py...

python run_model.py

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM

# Specify the GGUF repo on the Hugginface
model_name = "TheBloke/Llama-2-7B-Chat-GGUF"
# Download the the specific gguf model file from the above repo
model_file = "llama-2-7b-chat.Q4_0.gguf"
# make sure you are granted to access this model on the Huggingface.
tokenizer_name = "meta-llama/Llama-2-7b-chat-hf"

prompt = "Once upon a time"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, model_file = model_file)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
(base) root@ubuntu:/usr/local/src/neural-speed# python run_model.py 
Traceback (most recent call last):
  File "/usr/local/src/neural-speed/run_model.py", line 2, in <module>
    from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig
ImportError: cannot import name 'WeightOnlyQuantConfig' from 'intel_extension_for_transformers.transformers' (/root/miniconda3/lib/python3.11/site-packages/intel_extension_for_transformers/transformers/__init__.py)
(base) root@ubuntu:/usr/local/src/neural-speed# 

@Zhenzhong1
Copy link
Contributor

Zhenzhong1 commented Apr 24, 2024

@kolinfluence sry about this.

I have checked your script. It's correct.

The reason may be an too old ITREX version.

I can get the correct result using your script.

image

As you can see the ITREX version is 1.4.1.

Please reinstall ITREX, Neural Speed and re-run the script.

pip install intel-extension-for-transformers; pip install neural_speed

@Zhenzhong1
Copy link
Contributor

@kolinfluence

where do i put the downloaded llama-2-7b-chat.Q4_0.gguf file?

The script will download the file directly from HF and automatically place it into the local HF cache. The path is like
image

@kolinfluence
Copy link
Author

@Zhenzhong1 i used the same script but i get this.
possible for me to manually download it as i actually have too many things on my laptop and wish not to use hugging face access etc.

so how do i manually download it and try?

p.s. : may i know what's the direction to take for this neural speed thing? are you guys going to improve or seeking to merge into llama.cpp or something?

python run_model.py 
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.11/site-packages/huggingface_hub/utils/_errors.py", line 270, in hf_raise_for_status
    response.raise_for_status()
  File "/root/miniconda3/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/tokenizer_config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.11/site-packages/transformers/utils/hub.py", line 385, in cached_file
    resolved_file = hf_hub_download(
                    ^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1374, in hf_hub_download
    raise head_call_error
  File "/root/miniconda3/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1247, in hf_hub_download
    metadata = get_hf_file_metadata(
               ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1624, in get_hf_file_metadata
    r = _request_wrapper(
        ^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 402, in _request_wrapper
    response = _request_wrapper(
               ^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 426, in _request_wrapper
    hf_raise_for_status(response)
  File "/root/miniconda3/lib/python3.11/site-packages/huggingface_hub/utils/_errors.py", line 286, in hf_raise_for_status
    raise GatedRepoError(message, response) from e
huggingface_hub.utils._errors.GatedRepoError: 403 Client Error. (Request ID: Root=1-6628b4c7-577559ad44f1431409bac9bc;f4fc8e13-0f16-4ad1-bb6c-e3af7b61a3a1)

Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/tokenizer_config.json.
Access to model meta-llama/Llama-2-7b-chat-hf is restricted and you are not in the authorized list. Visit https://huggingface.co/meta-llama/Llama-2-7b-chat-hf to ask for access.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/src/neural-speed/run_model.py", line 12, in <module>
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 758, in from_pretrained
    tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 590, in get_tokenizer_config
    resolved_config_file = cached_file(
                           ^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/transformers/utils/hub.py", line 400, in cached_file
    raise EnvironmentError(
OSError: You are trying to access a gated repo.
Make sure to request access at https://huggingface.co/meta-llama/Llama-2-7b-chat-hf and pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`.

@Zhenzhong1
Copy link
Contributor

@kolinfluence OK. We also can inference offline.

Make sure you have the local file llama-2-7b-chat.Q4_0.gguf and model meta-llama/Llama-2-7b-chat-hf.

Please try this script. https://github.com/intel/neural-speed/blob/main/scripts/python_api_example_for_gguf.py

For example:
python scripts/python_api_example_for_gguf.py --model_name llama --model_path /your_model_path/meta-llama/Llama-2-7b-chat-hf -m /your_gguf_file_path/llama-2-7b-chat.Q4_0.gguf

OSError: You are trying to access a gated repo.
Make sure to request access at https://huggingface.co/meta-llama/Llama-2-7b-chat-hf and pass a token having permission to this repo either by logging in with huggingface-cli login or by passing token=<your_token>.

This means you don't have the right to access the llama-2-7b-chat model on the HF. You have to apply for the access token first on the HF.

may i know what's the direction to take for this neural speed thing? are you guys going to improve or seeking to merge into llama.cpp or something?

The neural speed will not be merged into llama.cpp currently. Neural Speed aims to provide the efficient LLMs inference on Intel platforms. For example, Neural Speed provides highly optimized low-precision kernels on CPUs, which means it can get better perfommance vs llama.cpp. Please check this https://medium.com/@NeuralCompressor/llm-performance-of-intel-extension-for-transformers-f7d061556176.

@a32543254 a32543254 self-assigned this May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants