Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to build int4_awq on Mixtral 8x7b #1580

Open
2 of 4 tasks
gloritygithub11 opened this issue May 12, 2024 · 3 comments
Open
2 of 4 tasks

Fail to build int4_awq on Mixtral 8x7b #1580

gloritygithub11 opened this issue May 12, 2024 · 3 comments
Assignees
Labels
not a bug Some known limitation, but not a bug. triaged Issue has been triaged by maintainers

Comments

@gloritygithub11
Copy link

gloritygithub11 commented May 12, 2024

System Info

ubuntu 20.04
tensorrt 10.0.1
tensorrt-cu12 10.0.1
tensorrt-cu12-bindings 10.0.1
tensorrt-cu12-libs 10.0.1
tensorrt-llm 0.10.0.dev2024050700

Who can help?

@Tracin

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction


set -e

export MODEL_DIR=/mnt/memory
export MODEL_NAME=Mixtral-8x7B-Instruct-v0.1
export LD_LIBRARY_PATH=/usr/local/tensorrt/lib:$LD_LIBRARY_PATH
export PATH=/usr/local/tensorrt/bin:$PATH
export QUANTIZE=int4_awq
export DTYPE=bfloat16
export PYTHONPATH=/app/tensorrt-llm:$PYTHONPATH


python ../quantization/quantize.py \
     --model_dir $MODEL_DIR/${MODEL_NAME} \
     --output_dir $MODEL_DIR/tmp/trt_models/${MODEL_NAME}/$QUANTIZE/1-gpu \
     --dtype $DTYPE \
     --qformat $QUANTIZE \
     --calib_size 256 \
     --batch_size 8

# export CUDA_VISIBLE_DEVICES=0

trtllm-build \
    --checkpoint_dir $MODEL_DIR/tmp/trt_models/${MODEL_NAME}/$QUANTIZE/1-gpu \
    --output_dir $MODEL_DIR/tmp/trt_engines/${MODEL_NAME}/$QUANTIZE/1-gpu \
    --gemm_plugin $DTYPE \
    --max_batch_size 1 \
    --max_input_len 2048 \
    --max_output_len 1024

Expected behavior

trtllm-build could execute success

actual behavior

trtllm-build failed with following error:
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024050700
[05/12/2024-03:05:39] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set gemm_plugin to bfloat16.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set nccl_plugin to float16.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set lookup_plugin to None.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set lora_plugin to None.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set moe_plugin to float16.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set context_fmha to True.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set paged_kv_cache to True.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set remove_input_padding to True.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set multi_block_mode to False.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set enable_xqa to True.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set tokens_per_block to 128.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set multiple_profiles to False.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set paged_state to True.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set streamingllm to False.
[05/12/2024-03:05:39] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_sizemax_input_len.
It may not be optimal to set max_num_tokens=max_batch_size
max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
[05/12/2024-03:05:39] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

/app/tensorrt-llm/tensorrt_llm/models/modeling_utils.py:964: UserWarning: The use of x.T on tensors of dimension other than 2 to reverse their shape is deprecated and it will throw an error in a future release. Consider x.mT to transpose batches of matrices or x.permute(*torch.arange(x.ndim - 1, -1, -1)) to reverse the dimensions of a tensor. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3637.)
weights[name] = preprocessor(param.T.contiguous(),
Traceback (most recent call last):
File "/app/venv_dev/bin/trtllm-build", line 8, in
sys.exit(main())
File "/app/tensorrt-llm/tensorrt_llm/commands/build.py", line 486, in main
parallel_build(source, build_config, args.output_dir, workers,
File "/app/tensorrt-llm/tensorrt_llm/commands/build.py", line 370, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File "/app/tensorrt-llm/tensorrt_llm/commands/build.py", line 329, in build_and_save
engine = build_model(build_config,
File "/app/tensorrt-llm/tensorrt_llm/commands/build.py", line 305, in build_model
model = load_model(rank_config, ckpt_dir, model_cls)
File "/app/tensorrt-llm/tensorrt_llm/models/modeling_utils.py", line 1100, in load_model
preprocess_weights(weights, model_config)
File "/app/tensorrt-llm/tensorrt_llm/models/modeling_utils.py", line 964, in preprocess_weights
weights[name] = preprocessor(param.T.contiguous(),
File "/app/venv_dev/lib/python3.10/site-packages/torch/_ops.py", line 755, in call
return self.op(args, (kwargs or {}))
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 7168 and num_col_bytes = 8. (/app/tensorrt-llm/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:278)
1 0x7f597e9b665a tensorrt_llm::common::throwRuntimeError(char const
, int, std::string const&) + 102
2 0x7f5ba6d945dd void tensorrt_llm::kernels::cutlass_kernels::subbyte_transpose_impl<(tensorrt_llm::kernels::cutlass_kernels::QuantType)1>(signed char
, signed char const
, std::vector<unsigned long, std::allocator > const&) + 1085
3 0x7f5ba6d93735 tensorrt_llm::kernels::cutlass_kernels::subbyte_transpose(signed char*, signed char const*, std::vector<unsigned long, std::allocator > const&, tensorrt_llm::kernels::cutlass_kernels::QuantType) + 101
4 0x7f5ba6d93a4a tensorrt_llm::kernels::cutlass_kernels::preprocess_weights_for_mixed_gemm(signed char*, signed char const*, std::vector<unsigned long, std::allocator > const&, tensorrt_llm::kernels::cutlass_kernels::QuantType, bool) + 714
5 0x7f5ba6d6d7f4 torch_ext::preprocess_weights_for_mixed_gemm(at::Tensor, c10::ScalarType, c10::ScalarType) + 596
6 0x7f5ba6d7940a c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor<at::Tensor ()(at::Tensor, c10::ScalarType, c10::ScalarType), at::Tensor, c10::guts::typelist::typelist<at::Tensor, c10::ScalarType, c10::ScalarType> >, true>::call(c10::OperatorKernel, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocatorc10::IValue >) + 138
7 0x7f5b08bcb818 c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocatorc10::IValue >
) const + 568
8 0x7f5b0895c4f3 torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptrtorch::jit::Operator, std::allocator<std::shared_ptrtorch::jit::Operator > > const&, pybind11::args, pybind11::kwargs const&, std::optionalc10::DispatchKey) + 451
9 0x7f5b0895cd41 torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptrtorch::jit::Operator, std::allocator<std::shared_ptrtorch::jit::Operator > > const&, c10::Symbol, pybind11::args, pybind11::kwargs const&, bool, std::optionalc10::DispatchKey) + 1329
10 0x7f5b08840833 /app/venv_dev/lib/python3.10/site-packages/torch/lib/libtorch_python.so(+0x848833) [0x7f5b08840833]
11 0x7f5b0840bea4 /app/venv_dev/lib/python3.10/site-packages/torch/lib/libtorch_python.so(+0x413ea4) [0x7f5b0840bea4]
12 0x53bd79 /app/venv_dev/bin/python3() [0x53bd79]
13 0x628a7b PyObject_Call + 491
14 0x5afa8e _PyEval_EvalFrameDefault + 24958
15 0x628d60 _PyFunction_Vectorcall + 592
16 0x62b899 _PyObject_FastCallDictTstate + 89
17 0x62b9ca _PyObject_Call_Prepend + 90
18 0x6e8da7 /app/venv_dev/bin/python3() [0x6e8da7]
19 0x629d24 _PyObject_MakeTpCall + 356
20 0x5ae9e9 _PyEval_EvalFrameDefault + 20697
21 0x628d60 _PyFunction_Vectorcall + 592
22 0x5a9c1b _PyEval_EvalFrameDefault + 779
23 0x628d60 _PyFunction_Vectorcall + 592
24 0x5a9c1b _PyEval_EvalFrameDefault + 779
25 0x628d60 _PyFunction_Vectorcall + 592
26 0x62893c PyObject_Call + 172
27 0x5ac51b _PyEval_EvalFrameDefault + 11275
28 0x628d60 _PyFunction_Vectorcall + 592
29 0x62893c PyObject_Call + 172
30 0x5ac51b _PyEval_EvalFrameDefault + 11275
31 0x628d60 _PyFunction_Vectorcall + 592
32 0x62893c PyObject_Call + 172
33 0x5ac51b _PyEval_EvalFrameDefault + 11275
34 0x628d60 _PyFunction_Vectorcall + 592
35 0x5a9c1b _PyEval_EvalFrameDefault + 779
36 0x5a8bf1 /app/venv_dev/bin/python3() [0x5a8bf1]
37 0x6d77cf PyEval_EvalCode + 127
38 0x6bb91b /app/venv_dev/bin/python3() [0x6bb91b]
39 0x6bb9a4 /app/venv_dev/bin/python3() [0x6bb9a4]
40 0x6bbde6 /app/venv_dev/bin/python3() [0x6bbde6]
41 0x6c0c84 _PyRun_SimpleFileObject + 404
42 0x6c0d57 _PyRun_AnyFileObject + 71
43 0x7042dd Py_RunMain + 877
44 0x7044bd Py_BytesMain + 45
45 0x7f5bab4e4083 __libc_start_main + 243
46 0x62ff4e _start + 46

additional notes

N/A

@gloritygithub11 gloritygithub11 added the bug Something isn't working label May 12, 2024
@byshiue
Copy link
Collaborator

byshiue commented May 15, 2024

Thank you for the report. INT4 AWQ is not supported on MoE model.

@byshiue byshiue self-assigned this May 15, 2024
@byshiue byshiue added triaged Issue has been triaged by maintainers not a bug Some known limitation, but not a bug. and removed bug Something isn't working labels May 15, 2024
@gloritygithub11
Copy link
Author

Thanks @byshiue for the response. Will it be supported at sometime in future?

@byshiue
Copy link
Collaborator

byshiue commented May 17, 2024

We are working on the feature. We will update here if the feature is supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not a bug Some known limitation, but not a bug. triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

2 participants