Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen1.5-72B-chat-AWQ with longbench and infinibench benchmark OOM with A100 80G #38

Open
ehuaa opened this issue Apr 30, 2024 · 0 comments

Comments

@ehuaa
Copy link

ehuaa commented Apr 30, 2024

When I test Qwen1.5-72B-chat-AWQ with
bash scripts/longbench.sh it turns out to OOM with A100 80G

My config:
model:
type: inf-llm
path: /root/czh/quant_models/Qwen2-geogpt-72b-0412-awq-dde-12000
block_size: 128
n_init: 128
n_local: 4096
topk: 16
repr_topk: 4
max_cached_block: 32
exc_block_size: 512
fattn: false
base: 1000000
distance_scale: 1.0

max_len: 2147483647
chunk_size: 2048
conv_type: qwen

The Traceback is as follows:
Traceback (most recent call last):
File "/root/czh/InfLLM/benchmark/pred.py", line 321, in
preds = get_pred(
File "/root/czh/InfLLM/benchmark/pred.py", line 256, in get_pred
output = searcher.generate(
File "/root/czh/InfLLM/inf_llm/utils/greedy_search.py", line 32, in generate
result = self._decode(input_ids, **kwargs)
File "/root/czh/InfLLM/inf_llm/utils/greedy_search.py", line 54, in _decode
out = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2/modeling_qwen2.py", line 1169, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/czh/InfLLM/inf_llm/utils/patch.py", line 100, in model_forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2/modeling_qwen2.py", line 768, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/czh/InfLLM/inf_llm/utils/patch.py", line 16, in hf_forward
ret = forward(
File "/root/czh/InfLLM/inf_llm/attention/inf_llm.py", line 64, in forward
o = past_key_value.append(
File "/root/czh/InfLLM/inf_llm/attention/context_manager.py", line 774, in append
chunk_o, local_score = self._append(
File "/root/czh/InfLLM/inf_llm/attention/context_manager.py", line 526, in _append
attn.append(
File "/root/czh/InfLLM/inf_llm/attention/dot_production_attention/torch_impl.py", line 96, in append
self.finalize()
File "/root/czh/InfLLM/inf_llm/attention/dot_production_attention/torch_impl.py", line 22, in finalize
tmp = torch.masked_fill(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 79.15 GiB of which 190.19 MiB is free. Process 3985934 has 78.95 GiB memory in use. Of the allocated memory 75.61 GiB is allocated by PyTorch, and 2.82 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
/usr/local/lib/python3.10/dist-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
Evaluating on: ['result.json']
{}
Can someone help with this issue? Thanks!

@ehuaa ehuaa changed the title Qwen1.5-72B-chat-AWQ infinibench benchmark OOM with A100 80G Qwen1.5-72B-chat-AWQ with longbench and infinibench benchmark OOM with A100 80G Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant