You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
trtllm-build failed with following error:
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024050700
[05/12/2024-03:05:39] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set gemm_plugin to bfloat16.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set nccl_plugin to float16.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set lookup_plugin to None.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set lora_plugin to None.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set moe_plugin to float16.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set context_fmha to True.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set paged_kv_cache to True.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set remove_input_padding to True.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set multi_block_mode to False.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set enable_xqa to True.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set tokens_per_block to 128.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set multiple_profiles to False.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set paged_state to True.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set streamingllm to False.
[05/12/2024-03:05:39] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_sizemax_input_len.
It may not be optimal to set max_num_tokens=max_batch_sizemax_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
[05/12/2024-03:05:39] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
/app/tensorrt-llm/tensorrt_llm/models/modeling_utils.py:964: UserWarning: The use of x.T on tensors of dimension other than 2 to reverse their shape is deprecated and it will throw an error in a future release. Consider x.mT to transpose batches of matrices or x.permute(*torch.arange(x.ndim - 1, -1, -1)) to reverse the dimensions of a tensor. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3637.)
weights[name] = preprocessor(param.T.contiguous(),
Traceback (most recent call last):
File "/app/venv_dev/bin/trtllm-build", line 8, in
sys.exit(main())
File "/app/tensorrt-llm/tensorrt_llm/commands/build.py", line 486, in main
parallel_build(source, build_config, args.output_dir, workers,
File "/app/tensorrt-llm/tensorrt_llm/commands/build.py", line 370, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File "/app/tensorrt-llm/tensorrt_llm/commands/build.py", line 329, in build_and_save
engine = build_model(build_config,
File "/app/tensorrt-llm/tensorrt_llm/commands/build.py", line 305, in build_model
model = load_model(rank_config, ckpt_dir, model_cls)
File "/app/tensorrt-llm/tensorrt_llm/models/modeling_utils.py", line 1100, in load_model
preprocess_weights(weights, model_config)
File "/app/tensorrt-llm/tensorrt_llm/models/modeling_utils.py", line 964, in preprocess_weights
weights[name] = preprocessor(param.T.contiguous(),
File "/app/venv_dev/lib/python3.10/site-packages/torch/_ops.py", line 755, in call
return self.op(args, (kwargs or {}))
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 7168 and num_col_bytes = 8. (/app/tensorrt-llm/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:278)
1 0x7f597e9b665a tensorrt_llm::common::throwRuntimeError(char const, int, std::string const&) + 102
2 0x7f5ba6d945dd void tensorrt_llm::kernels::cutlass_kernels::subbyte_transpose_impl<(tensorrt_llm::kernels::cutlass_kernels::QuantType)1>(signed char, signed char const, std::vector<unsigned long, std::allocator > const&) + 1085
3 0x7f5ba6d93735 tensorrt_llm::kernels::cutlass_kernels::subbyte_transpose(signed char*, signed char const*, std::vector<unsigned long, std::allocator > const&, tensorrt_llm::kernels::cutlass_kernels::QuantType) + 101
4 0x7f5ba6d93a4a tensorrt_llm::kernels::cutlass_kernels::preprocess_weights_for_mixed_gemm(signed char*, signed char const*, std::vector<unsigned long, std::allocator > const&, tensorrt_llm::kernels::cutlass_kernels::QuantType, bool) + 714
5 0x7f5ba6d6d7f4 torch_ext::preprocess_weights_for_mixed_gemm(at::Tensor, c10::ScalarType, c10::ScalarType) + 596
6 0x7f5ba6d7940a c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor<at::Tensor ()(at::Tensor, c10::ScalarType, c10::ScalarType), at::Tensor, c10::guts::typelist::typelist<at::Tensor, c10::ScalarType, c10::ScalarType> >, true>::call(c10::OperatorKernel, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocatorc10::IValue >) + 138
7 0x7f5b08bcb818 c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocatorc10::IValue >) const + 568
8 0x7f5b0895c4f3 torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptrtorch::jit::Operator, std::allocator<std::shared_ptrtorch::jit::Operator > > const&, pybind11::args, pybind11::kwargs const&, std::optionalc10::DispatchKey) + 451
9 0x7f5b0895cd41 torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptrtorch::jit::Operator, std::allocator<std::shared_ptrtorch::jit::Operator > > const&, c10::Symbol, pybind11::args, pybind11::kwargs const&, bool, std::optionalc10::DispatchKey) + 1329
10 0x7f5b08840833 /app/venv_dev/lib/python3.10/site-packages/torch/lib/libtorch_python.so(+0x848833) [0x7f5b08840833]
11 0x7f5b0840bea4 /app/venv_dev/lib/python3.10/site-packages/torch/lib/libtorch_python.so(+0x413ea4) [0x7f5b0840bea4]
12 0x53bd79 /app/venv_dev/bin/python3() [0x53bd79]
13 0x628a7b PyObject_Call + 491
14 0x5afa8e _PyEval_EvalFrameDefault + 24958
15 0x628d60 _PyFunction_Vectorcall + 592
16 0x62b899 _PyObject_FastCallDictTstate + 89
17 0x62b9ca _PyObject_Call_Prepend + 90
18 0x6e8da7 /app/venv_dev/bin/python3() [0x6e8da7]
19 0x629d24 _PyObject_MakeTpCall + 356
20 0x5ae9e9 _PyEval_EvalFrameDefault + 20697
21 0x628d60 _PyFunction_Vectorcall + 592
22 0x5a9c1b _PyEval_EvalFrameDefault + 779
23 0x628d60 _PyFunction_Vectorcall + 592
24 0x5a9c1b _PyEval_EvalFrameDefault + 779
25 0x628d60 _PyFunction_Vectorcall + 592
26 0x62893c PyObject_Call + 172
27 0x5ac51b _PyEval_EvalFrameDefault + 11275
28 0x628d60 _PyFunction_Vectorcall + 592
29 0x62893c PyObject_Call + 172
30 0x5ac51b _PyEval_EvalFrameDefault + 11275
31 0x628d60 _PyFunction_Vectorcall + 592
32 0x62893c PyObject_Call + 172
33 0x5ac51b _PyEval_EvalFrameDefault + 11275
34 0x628d60 _PyFunction_Vectorcall + 592
35 0x5a9c1b _PyEval_EvalFrameDefault + 779
36 0x5a8bf1 /app/venv_dev/bin/python3() [0x5a8bf1]
37 0x6d77cf PyEval_EvalCode + 127
38 0x6bb91b /app/venv_dev/bin/python3() [0x6bb91b]
39 0x6bb9a4 /app/venv_dev/bin/python3() [0x6bb9a4]
40 0x6bbde6 /app/venv_dev/bin/python3() [0x6bbde6]
41 0x6c0c84 _PyRun_SimpleFileObject + 404
42 0x6c0d57 _PyRun_AnyFileObject + 71
43 0x7042dd Py_RunMain + 877
44 0x7044bd Py_BytesMain + 45
45 0x7f5bab4e4083 __libc_start_main + 243
46 0x62ff4e _start + 46
additional notes
N/A
The text was updated successfully, but these errors were encountered:
System Info
ubuntu 20.04
tensorrt 10.0.1
tensorrt-cu12 10.0.1
tensorrt-cu12-bindings 10.0.1
tensorrt-cu12-libs 10.0.1
tensorrt-llm 0.10.0.dev2024050700
Who can help?
@Tracin
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
trtllm-build could execute success
actual behavior
trtllm-build failed with following error:
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024050700
[05/12/2024-03:05:39] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set gemm_plugin to bfloat16.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set nccl_plugin to float16.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set lookup_plugin to None.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set lora_plugin to None.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set moe_plugin to float16.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set context_fmha to True.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set paged_kv_cache to True.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set remove_input_padding to True.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set multi_block_mode to False.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set enable_xqa to True.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set tokens_per_block to 128.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set multiple_profiles to False.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set paged_state to True.
[05/12/2024-03:05:39] [TRT-LLM] [I] Set streamingllm to False.
[05/12/2024-03:05:39] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_sizemax_input_len.
It may not be optimal to set max_num_tokens=max_batch_sizemax_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
[05/12/2024-03:05:39] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
/app/tensorrt-llm/tensorrt_llm/models/modeling_utils.py:964: UserWarning: The use of
x.T
on tensors of dimension other than 2 to reverse their shape is deprecated and it will throw an error in a future release. Considerx.mT
to transpose batches of matrices orx.permute(*torch.arange(x.ndim - 1, -1, -1))
to reverse the dimensions of a tensor. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3637.)weights[name] = preprocessor(param.T.contiguous(),
Traceback (most recent call last):
File "/app/venv_dev/bin/trtllm-build", line 8, in
sys.exit(main())
File "/app/tensorrt-llm/tensorrt_llm/commands/build.py", line 486, in main
parallel_build(source, build_config, args.output_dir, workers,
File "/app/tensorrt-llm/tensorrt_llm/commands/build.py", line 370, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File "/app/tensorrt-llm/tensorrt_llm/commands/build.py", line 329, in build_and_save
engine = build_model(build_config,
File "/app/tensorrt-llm/tensorrt_llm/commands/build.py", line 305, in build_model
model = load_model(rank_config, ckpt_dir, model_cls)
File "/app/tensorrt-llm/tensorrt_llm/models/modeling_utils.py", line 1100, in load_model
preprocess_weights(weights, model_config)
File "/app/tensorrt-llm/tensorrt_llm/models/modeling_utils.py", line 964, in preprocess_weights
weights[name] = preprocessor(param.T.contiguous(),
File "/app/venv_dev/lib/python3.10/site-packages/torch/_ops.py", line 755, in call
return self.op(args, (kwargs or {}))
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 7168 and num_col_bytes = 8. (/app/tensorrt-llm/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:278)
1 0x7f597e9b665a tensorrt_llm::common::throwRuntimeError(char const, int, std::string const&) + 102
2 0x7f5ba6d945dd void tensorrt_llm::kernels::cutlass_kernels::subbyte_transpose_impl<(tensorrt_llm::kernels::cutlass_kernels::QuantType)1>(signed char, signed char const, std::vector<unsigned long, std::allocator > const&) + 1085
3 0x7f5ba6d93735 tensorrt_llm::kernels::cutlass_kernels::subbyte_transpose(signed char*, signed char const*, std::vector<unsigned long, std::allocator > const&, tensorrt_llm::kernels::cutlass_kernels::QuantType) + 101
4 0x7f5ba6d93a4a tensorrt_llm::kernels::cutlass_kernels::preprocess_weights_for_mixed_gemm(signed char*, signed char const*, std::vector<unsigned long, std::allocator > const&, tensorrt_llm::kernels::cutlass_kernels::QuantType, bool) + 714
5 0x7f5ba6d6d7f4 torch_ext::preprocess_weights_for_mixed_gemm(at::Tensor, c10::ScalarType, c10::ScalarType) + 596
6 0x7f5ba6d7940a c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor<at::Tensor ()(at::Tensor, c10::ScalarType, c10::ScalarType), at::Tensor, c10::guts::typelist::typelist<at::Tensor, c10::ScalarType, c10::ScalarType> >, true>::call(c10::OperatorKernel, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocatorc10::IValue >) + 138
7 0x7f5b08bcb818 c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocatorc10::IValue >) const + 568
8 0x7f5b0895c4f3 torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptrtorch::jit::Operator, std::allocator<std::shared_ptrtorch::jit::Operator > > const&, pybind11::args, pybind11::kwargs const&, std::optionalc10::DispatchKey) + 451
9 0x7f5b0895cd41 torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptrtorch::jit::Operator, std::allocator<std::shared_ptrtorch::jit::Operator > > const&, c10::Symbol, pybind11::args, pybind11::kwargs const&, bool, std::optionalc10::DispatchKey) + 1329
10 0x7f5b08840833 /app/venv_dev/lib/python3.10/site-packages/torch/lib/libtorch_python.so(+0x848833) [0x7f5b08840833]
11 0x7f5b0840bea4 /app/venv_dev/lib/python3.10/site-packages/torch/lib/libtorch_python.so(+0x413ea4) [0x7f5b0840bea4]
12 0x53bd79 /app/venv_dev/bin/python3() [0x53bd79]
13 0x628a7b PyObject_Call + 491
14 0x5afa8e _PyEval_EvalFrameDefault + 24958
15 0x628d60 _PyFunction_Vectorcall + 592
16 0x62b899 _PyObject_FastCallDictTstate + 89
17 0x62b9ca _PyObject_Call_Prepend + 90
18 0x6e8da7 /app/venv_dev/bin/python3() [0x6e8da7]
19 0x629d24 _PyObject_MakeTpCall + 356
20 0x5ae9e9 _PyEval_EvalFrameDefault + 20697
21 0x628d60 _PyFunction_Vectorcall + 592
22 0x5a9c1b _PyEval_EvalFrameDefault + 779
23 0x628d60 _PyFunction_Vectorcall + 592
24 0x5a9c1b _PyEval_EvalFrameDefault + 779
25 0x628d60 _PyFunction_Vectorcall + 592
26 0x62893c PyObject_Call + 172
27 0x5ac51b _PyEval_EvalFrameDefault + 11275
28 0x628d60 _PyFunction_Vectorcall + 592
29 0x62893c PyObject_Call + 172
30 0x5ac51b _PyEval_EvalFrameDefault + 11275
31 0x628d60 _PyFunction_Vectorcall + 592
32 0x62893c PyObject_Call + 172
33 0x5ac51b _PyEval_EvalFrameDefault + 11275
34 0x628d60 _PyFunction_Vectorcall + 592
35 0x5a9c1b _PyEval_EvalFrameDefault + 779
36 0x5a8bf1 /app/venv_dev/bin/python3() [0x5a8bf1]
37 0x6d77cf PyEval_EvalCode + 127
38 0x6bb91b /app/venv_dev/bin/python3() [0x6bb91b]
39 0x6bb9a4 /app/venv_dev/bin/python3() [0x6bb9a4]
40 0x6bbde6 /app/venv_dev/bin/python3() [0x6bbde6]
41 0x6c0c84 _PyRun_SimpleFileObject + 404
42 0x6c0d57 _PyRun_AnyFileObject + 71
43 0x7042dd Py_RunMain + 877
44 0x7044bd Py_BytesMain + 45
45 0x7f5bab4e4083 __libc_start_main + 243
46 0x62ff4e _start + 46
additional notes
N/A
The text was updated successfully, but these errors were encountered: