Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama2 chat model times out #376

Open
jcntrl opened this issue Mar 20, 2024 · 1 comment
Open

Llama2 chat model times out #376

jcntrl opened this issue Mar 20, 2024 · 1 comment
Assignees

Comments

@jcntrl
Copy link

jcntrl commented Mar 20, 2024

Llama2 (and Llama-based models) timeout. Other chat models (tested Mistral, Mixtral) respond fine. Below is the snippet of the docker container log capturing when the request is sent from Refact extension (VS Code) and timeout received at the extension.

This was installed using :latest (note to self: never again use :latest). My attempt to find what version this is:

ubuntu@REDACTED:~$ docker images
REPOSITORY                       TAG       IMAGE ID       CREATED       SIZE
smallcloud/refact_self_hosting   latest    5e8a87f811b8   2 weeks ago   20.8GB
ubuntu@REDACTED:~$ IMAGE_ID=5e8a87f811b8
ubuntu@REDACTED:~$ docker image inspect --format '{{json .}}' "$IMAGE_ID" | jq -r '. | {Id: .Id, Digest: .Digest, RepoDigests: .RepoDigests, Labels: .Config.Labels}'
{
  "Id": "sha256:5e8a87f811b8257cfb24e6b0606ac8090e7ee8e5947105e7982a5d06a2e049e3",
  "Digest": null,
  "RepoDigests": [
    "smallcloud/refact_self_hosting@sha256:ebe5962002a47e92db987a2903e0c2f7426f39852dada10620412c4699a91d7e"
  ],
  "Labels": {
    "com.nvidia.cudnn.version": "8.9.0.131",
    "maintainer": "NVIDIA CORPORATION <cudatools@nvidia.com>",
    "org.opencontainers.image.ref.name": "ubuntu",
    "org.opencontainers.image.version": "22.04"
  }
}
-- 1089 -- 20240320 15:58:00 MODEL 10002.1ms http://127.0.0.1:8008/infengine-v1/completions-wait-batch WAIT
-- 316840 -- 20240320 15:58:00 WEBUI 127.0.0.1:52586 - "POST /infengine-v1/completions-wait-batch HTTP/1.1" 200
-- 316840 -- 20240320 15:58:01 WEBUI 137.65.195.181:57190 - "GET /tab-host-have-gpus HTTP/1.1" 200
-- 316840 -- 20240320 15:58:02 WEBUI 15.122.93.82:57206 - "GET /tab-host-have-gpus HTTP/1.1" 200
-- 316840 -- 20240320 15:58:03 WEBUI 137.65.195.181:57190 - "GET /tab-host-have-gpus HTTP/1.1" 200
-- 316840 -- 20240320 15:58:04 WEBUI comp-3eb55b8302c6 model resolve "llama2/7b" -> "llama2/7b" from user
-- 316840 -- 20240320 15:58:04 WEBUI wait_batch batch 1/1 => llama2_7b_ed839e52dcb2
-- 316840 -- 20240320 15:58:04 WEBUI 137.65.195.181:59460 - "POST /v1/completions HTTP/1.1" 200
-- 316840 -- 20240320 15:58:04 WEBUI 127.0.0.1:41752 - "POST /infengine-v1/completions-wait-batch HTTP/1.1" 200
-- 316840 -- 20240320 15:58:04 WEBUI 15.122.93.82:57206 - "GET /tab-host-have-gpus HTTP/1.1" 200
-- 318086 -- 20240320 15:58:04 MODEL 7455.4ms http://127.0.0.1:8008/infengine-v1/completions-wait-batch OK
-- 318086 -- 20240320 15:58:04 MODEL Model llama2/7b does not support finetune
-- 318086 -- 20240320 15:58:04 MODEL LlamaRotaryEmbedding.forward() missing 1 required positional argument: 'position_ids'
-- 318086 -- 20240320 15:58:04 MODEL Traceback (most recent call last):
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/self_hosting_machinery/inference/inference_hf.py", line 284, in infer
-- 318086 --     self._model.generate(**generation_kwargs)
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/_base.py", line 447, in generate
-- 318086 --     return self.model.generate(**kwargs)
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
-- 318086 --     return func(*args, **kwargs)
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1593, in generate
-- 318086 --     return self.sample(
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2697, in sample
-- 318086 --     outputs = self(
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
-- 318086 --     return self._call_impl(*args, **kwargs)
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
-- 318086 --     return forward_call(*args, **kwargs)
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 1148, in forward
-- 318086 --     outputs = self.model(
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
-- 318086 --     return self._call_impl(*args, **kwargs)
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
-- 318086 --     return forward_call(*args, **kwargs)
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 990, in forward
-- 318086 --     layer_outputs = decoder_layer(
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
-- 318086 --     return self._call_impl(*args, **kwargs)
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
-- 318086 --     return forward_call(*args, **kwargs)
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 716, in forward
-- 318086 --     hidden_states, self_attn_weights, present_key_value = self.self_attn(
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
-- 318086 --     return self._call_impl(*args, **kwargs)
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
-- 318086 --     return forward_call(*args, **kwargs)
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/auto_gptq/nn_modules/fused_llama_attn.py", line 72, in forward
-- 318086 --     cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
-- 318086 --     return self._call_impl(*args, **kwargs)
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
-- 318086 --     return forward_call(*args, **kwargs)
-- 318086 -- TypeError: LlamaRotaryEmbedding.forward() missing 1 required positional argument: 'position_ids'
-- 318086 -- 
-- 316840 -- 20240320 15:58:04 WEBUI 127.0.0.1:42516 - "POST /infengine-v1/completions-wait-batch HTTP/1.1" 200
-- 316282 -- 20240320 15:58:04 MODEL 10002.2ms http://127.0.0.1:8008/infengine-v1/completions-wait-batch WAIT
-- 316840 -- 20240320 15:58:05 WEBUI 137.65.195.181:57190 - "GET /tab-host-have-gpus HTTP/1.1" 200
-- 316840 -- 20240320 15:58:06 WEBUI 15.122.93.82:57206 - "GET /tab-host-have-gpus HTTP/1.1" 200
-- 316840 -- 20240320 15:58:07 WEBUI 137.65.195.181:57190 - "GET /tab-host-have-gpus HTTP/1.1" 200
-- 316840 -- 20240320 15:58:08 WEBUI 15.122.93.82:57206 - "GET /tab-host-have-gpus HTTP/1.1" 200
-- 316840 -- 20240320 15:58:09 WEBUI 137.65.195.181:57190 - "GET /tab-host-have-gpus HTTP/1.1" 200
-- 316840 -- 20240320 15:58:10 WEBUI 15.122.93.82:57206 - "GET /tab-host-have-gpus HTTP/1.1" 200
-- 1089 -- 20240320 15:58:10 MODEL 10003.0ms http://127.0.0.1:8008/infengine-v1/completions-wait-batch WAIT
@olegklimov
Copy link
Contributor

whoops, that's clearly a problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants