Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue about using ipex on cpu #197

Open
KepingYan opened this issue Apr 19, 2024 · 0 comments
Open

Issue about using ipex on cpu #197

KepingYan opened this issue Apr 19, 2024 · 0 comments

Comments

@KepingYan
Copy link
Contributor

When ipex is set to true on cpu, the value here

will be trust_remote_code=False use_auth_token='' load_in_4bit=False torch_dtype=torch.float16 revision=None.

But when llm_on_ray-serve is executed, a warning lib/python3.9/site-packages/intel_extension_for_pytorch/transformers/optimize.py:948: UserWarning: fail to apply ipex.llm.optimize due to: Unsupported input type, fallback to the origin model appears. And after sending a request, the server will report an error:

(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) 2024-04-19 02:06:37,419 - llm_on_ray.inference.predictor_deployment - INFO - Handling dynamic batch (size=1) ...
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) ERROR 2024-04-19 02:06:37,427 llama-2-7b-chat-hf_PredictorDeployment c5fp5i4e 3e35b1cb-a52d-4ebc-aa5c-dce9703fc4b4 /llama-2-7b-chat-hf/llama-2-7b-chat-hf replica.py:352 - Request failed:
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) ray::ServeReplica:llama-2-7b-chat-hf:PredictorDeployment.handle_request_with_rejection() (pid=2799681, ip=10.0.11.2)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/ray/serve/_private/utils.py", line 164, in wrap_to_ray_error
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     raise exception
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 1102, in call_user_method
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     await self._call_func_or_gen(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 828, in _call_func_or_gen
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     result = await result
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/project/performance/llm-on-ray/llm_on_ray/inference/predictor_deployment.py", line 403, in __call__
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return await self.handle_non_streaming(prompts, config)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/project/performance/llm-on-ray/llm_on_ray/inference/predictor_deployment.py", line 220, in handle_non_streaming
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return await self.handle_dynamic_batch((prompts, config))
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/ray/serve/batching.py", line 591, in batch_wrapper
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return await enqueue_request(args, kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/ray/serve/batching.py", line 243, in _assign_func_results
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     results = await func_future
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/project/performance/llm-on-ray/llm_on_ray/inference/predictor_deployment.py", line 249, in handle_dynamic_batch
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     batch_results = self.predictor.generate(prompts, **config)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/project/performance/llm-on-ray/llm_on_ray/inference/transformer_predictor.py", line 113, in generate
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     gen_tokens = self.model.generate(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return func(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/transformers/generation/utils.py", line 1719, in generate
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return self.sample(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/transformers/generation/utils.py", line 2801, in sample
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     outputs = self(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return self._call_impl(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return forward_call(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/intel_extension_for_pytorch/transformers/models/reference/models.py", line 108, in LlamaForCausalLM_forward
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     outputs = self.model(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return self._call_impl(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return forward_call(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 922, in forward
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     layer_outputs = decoder_layer(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return self._call_impl(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return forward_call(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/intel_extension_for_pytorch/transformers/models/reference/modules/decoder.py", line 874, in forward
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return LlamaDecoderLayer_forward(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/intel_extension_for_pytorch/transformers/models/reference/modules/decoder.py", line 26, in LlamaDecoderLayer_forward
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     hidden_states = self.input_layernorm(hidden_states)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return self._call_impl(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return forward_call(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/intel_extension_for_pytorch/transformers/models/cpu/fusions/mha_fusion.py", line 137, in forward
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return torch.ops.torch_ipex.rmsnorm(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/_ops.py", line 755, in __call__
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return self._op(*args, **(kwargs or {}))
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) RuntimeError: Unsupported input type

If I remove parameter torch_dtype=torch.float16 in model_config, it will work fine.

conda env

intel-extension-for-pytorch 2.2.0+cpu
torch                       2.2.2+cpu
transformers                4.35.0

model
Llama-2-7b-hf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant