Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

illegal memory access when input tokens < 8 #170

Open
casper-hansen opened this issue Apr 2, 2024 · 0 comments
Open

illegal memory access when input tokens < 8 #170

casper-hansen opened this issue Apr 2, 2024 · 0 comments

Comments

@casper-hansen
Copy link
Contributor

Hi @ys-2020, thanks for your engineering work on the new kernels. I was made aware of a bug recently after importing the new GEMV/GEMM kernels into AutoAWQ. The issue specifically occurs on the GEMV kernel.

Conditions to trigger the bug:

  • disable fused modules
  • use huggingface/vllm implementation
  • pass in less than 8 input tokens

Import into vLLM

The same illegal memory access occurs when trying to import it to vLLM.

Workaround

I found a workaround that seems to allow us to use the GEMV kernel, but without triggering the illegal memory access. However, I wanted to post here to get your thoughts on this fix and if you can identify what the issue is with the imported kernels from TensorRT.

Replace this:

if inputs.numel() / inputs.shape[-1] < 8:

With:

batch_size, n_tokens, _ = inputs.shape
if batch_size < 8 and n_tokens == 1:

Traceback

Traceback (most recent call last):
  File "/workspace/AutoAWQ/examples/generate.py", line 25, in <module>
    generation_output = model.generate(
  File "/workspace/AutoAWQ/awq/models/base.py", line 111, in generate
    return self.model.generate(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1544, in generate
    return self.greedy_search(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2404, in greedy_search
    outputs = self(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/mistral/modeling_mistral.py", line 1157, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/mistral/modeling_mistral.py", line 1042, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/mistral/modeling_mistral.py", line 757, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/mistral/modeling_mistral.py", line 666, in forward
    query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/mistral/modeling_mistral.py", line 160, in apply_rotary_pos_emb
    cos = cos[position_ids].unsqueeze(unsqueeze_dim)
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

CC @robertgshaw2-neuralmagic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant