Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting error while executing query_openai_sdk.py to test the inference #66

Open
dkiran1 opened this issue Jan 18, 2024 · 9 comments
Open
Assignees

Comments

@dkiran1
Copy link

dkiran1 commented Jan 18, 2024

I ran the infernce of Falcon-7b and neural-chat-7b-v3-1 models on ray server with below command
python inference/serve.py --config_file inference/models/neural-chat-7b-v3-1.yaml --simple
python inference/serve.py --config_file inference/models/falcon-7b.yaml --simple
I could run the test infernce with python examples/inference/api_server_simple/query_single.py --model_endpoint http://172.17.0.2:8000/neural-chat-7b-v3-1
I exported export OPENAI_API_BASE=http://172.17.0.2:8000/falcon-7b
export OPENAI_API_KEY=
and tried to run python examples/inference/api_server_openai/query_openai_sdk.py, Iam getting belwo error

File "/root/llm-ray/examples/inference/api_server_openai/query_openai_sdk.py", line 45, in
models = openai.Model.list()
File "/usr/local/lib/python3.10/dist-packages/openai/api_resources/abstract/listable_api_resource.py", line 60, in list
response, _, api_key = requestor.request(
File "/usr/local/lib/python3.10/dist-packages/openai/api_requestor.py", line 298, in request
resp, got_stream = self._interpret_response(result, stream)
File "/usr/local/lib/python3.10/dist-packages/openai/api_requestor.py", line 700, in _interpret_response
self._interpret_response_line(
File "/usr/local/lib/python3.10/dist-packages/openai/api_requestor.py", line 757, in _interpret_response_line
raise error.APIError(
openai.error.APIError: HTTP code 500 from API (Unexpected error, traceback: ray::ServeReplica:falcon-7b:PredictorDeployment.handle_request_streaming() (pid=15684, ip=172.17.0.2)
File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/utils.py", line 165, in wrap_to_ray_error
raise exception
File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 994, in call_user_method
await self._call_func_or_gen(
File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 750, in _call_func_or_gen
result = await result
File "/root/llm-ray/inference/predictor_deployment.py", line 84, in call
json_request: Dict[str, Any] = await http_request.json()
File "/usr/local/lib/python3.10/dist-packages/starlette/requests.py", line 244, in json
self._json = json.loads(body)
File "/usr/lib/python3.10/json/init.py", line 346, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0).)

I installed open-api 0.28.0 version, Please let me know what could be the isssue, Iam I missing any installations?

@xwu99
Copy link
Contributor

xwu99 commented Jan 18, 2024

@yutianchen666 Could you help to reproduce the issue? I am not sure if it is OpenAI version causing api break.

@dkiran1
Copy link
Author

dkiran1 commented Jan 18, 2024

I used openai==0.28 version, since latest version gave error and recommoneded to use this version

@yutianchen666
Copy link
Collaborator

@yutianchen666 Could you help to reproduce the issue? I am not sure if it is OpenAI version causing api break.

ok, I'll reproduce it soon

@KepingYan
Copy link
Contributor

@dkiran1 Thank you for your reporting. If you want to use Openai compatible sdk, please remove --simple parameter. After serving, please set ENDPOINT_URL=http://localhost:8000/v1 when running query_http_requests.py or set OPENAI_API_BASE=http://localhost:8000/v1 when running query_open_sdk.py. And you can see serve.md for more details.

@dkiran1
Copy link
Author

dkiran1 commented Jan 19, 2024

Hi Yan, Thanks for the details, I tried the above mentioned steps, I could run inference server with falcon model, but on running
python examples/inference/api_server_openai/query_openai_sdk.py --model_name="falcon-7b" Its waiting for the response from long time, but no response, I tried with neural chat model, yestuday it was working on upgrading transformer library , but its giving error

d lead to undefined behavior!
(ServeController pid=11891) ERROR 2024-01-19 05:35:26,615 controller 11891 deployment_state.py:672 - Exception in replica 'neural-chat-7b-v3-1#PredictorDeployment#3jmxrf36', the replica will be stopped.
(ServeController pid=11891) Traceback (most recent call last):
(ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/deployment_state.py", line 670, in check_ready
(ServeController pid=11891) _, self._version = ray.get(self._ready_obj_ref)
(ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
(ServeController pid=11891) return fn(*args, **kwargs)
(ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
(ServeController pid=11891) return func(*args, **kwargs)
(ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2656, in get
(ServeController pid=11891) values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
(ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 869, in get_objects
(ServeController pid=11891) raise value.as_instanceof_cause()
(ServeController pid=11891) ray.exceptions.RayTaskError(RuntimeError): ray::ServeReplica:neural-chat-7b-v3-1:PredictorDeployment.initialize_and_get_metadata() (pid=18013, ip=172.17.0.2, actor_id=685216a503325bcc4e3c3c7701000000, repr=<ray.serve._private.replica.ServeReplica:neural-chat-7b-v3-1:PredictorDeployment object at 0x7fabd93efd00>)
(ServeController pid=11891) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
(ServeController pid=11891) return self.__get_result()
(ServeController pid=11891) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
(ServeController pid=11891) raise self._exception
(ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 570, in initialize_and_get_metadata
(ServeController pid=11891) raise RuntimeError(traceback.format_exc()) from None
(ServeController pid=11891) RuntimeError: Traceback (most recent call last):
(ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 554, in initialize_and_get_metadata
(ServeController pid=11891) await self._user_callable_wrapper.initialize_callable()
(ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 778, in initialize_callable
(ServeController pid=11891) await self._call_func_or_gen(
(ServeController pid=11891) result = callable(*args, **kwargs)
(ServeController pid=11891) File "/root/llm-ray/inference/predictor_deployment.py", line 64, in init
(ServeController pid=11891) self.predictor = TransformerPredictor(infer_conf)
(ServeController pid=11891) File "/root/llm-ray/inference/transformer_predictor.py", line 22, in init
(ServeController pid=11891) from optimum.habana.transformers.modeling_utils import (
(ServeController pid=11891) File "/root/optimum-habana/optimum/habana/transformers/modeling_utils.py", line 19, in
(ServeController pid=11891) from .models import (
(ServeController pid=11891) File "/root/optimum-habana/optimum/habana/transformers/models/init.py", line 59, in
(ServeController pid=11891) from .mpt import (
(ServeController pid=11891) File "/root/optimum-habana/optimum/habana/transformers/models/mpt/init.py", line 1, in
(ServeController pid=11891) from .modeling_mpt import (
(ServeController pid=11891) File "/root/optimum-habana/optimum/habana/transformers/models/mpt/modeling_mpt.py", line 24, in
(ServeController pid=11891) from transformers.models.mpt.modeling_mpt import MptForCausalLM, MptModel, _expand_mask, _make_causal_mask
(ServeController pid=11891) ImportError: cannot import name '_expand_mask' from 'transformers.models.mpt.modeling_mpt' (/usr/local/lib/python3.10/dist-packages/transformers/models/mpt/modeling_mpt.py)
(ServeController pid=11891) INFO 2024-01-19 05:35:27,338 controller 11891 deployment_state.py:2188 - Replica neural-chat-7b-v3-1#PredictorDeployment#3jmxrf36 is stopped.
(ServeController pid=11891) INFO 2024-01-19 05:35:27,339 controller 11891 deployment_state.py:1850 - Adding 1 replica to deployment PredictorDeployment in application 'neural-chat-7b-v3-1'.
exit(ServeReplica:router:PredictorDeployment pid=18206) /usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
(ServeReplica:router:PredictorDeployment pid=18206) warnings.warn(
(ServeReplica:neural-chat-7b-v3-1:PredictorDeployment pid=18013) [WARNING|utils.py:190] 2024-01-19 05:35:26,443 >> optimum-habana v1.8.0.dev0 has been validated for SynapseAI v1.11.0 but the driver version is v1.13.0, this could lead to undefined behavior!

@kira-lin
Copy link
Contributor

Hi @dkiran1 , we currently have limited bandwidth and hardware to test on Gaudi. Currently the Gaudi related part is not up to date. I just tested in docker, in vault.habana.ai/gaudi-docker/1.13.0/ubuntu22.04/habanalabs/pytorch-installer-2.1.0 container, you only need to

# install llm-on-ray, assume mounted
pip install -e .
# install latest optimum[habana]
pip install optimum[habana]

Make sure tranformers version is 4.34.1, which is required by optimum[habana], and caused your error. In addition, inference with gaudi does not require IPEX

@dkiran1
Copy link
Author

dkiran1 commented Jan 19, 2024

Hi Lin, Thanks a lot after doing pip install optimum[habana] neural-chat model along with query_openai_sdk is working fine. I will test other models and will post the status

@dkiran1
Copy link
Author

dkiran1 commented Jan 19, 2024

I tested falcon-7b,mpt-7b,mistral-7b and neural-chat model ,I could run inference server of these models , Iam getting response for neural-chat and mistral-7b model with query_openai_sdk.py , but its waiting for resposne for mpt-7b and flacon model

@kira-lin
Copy link
Contributor

Hi @dkiran1 ,
When you use openai serving, try add --max_new_tokens config. It seems like optimum-habana requires this config. I'll look into why and how to fix this later.

zhangjian94cn pushed a commit to zhangjian94cn/llm-on-ray that referenced this issue Feb 4, 2024
* support more models in finetune

* modify dockerfile

* fix bug caused by accelerate upgrade

* add llama2

* fix error

* fix error

* test

* fix error

* update
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants