error running simple example #7118

geraldstanje · 2024-04-15T13:17:06Z

Description
A clear and concise description of what the bug is.

triton_python_backend_stub /tensorrt/triton-repos/trtibf-Trendyol-LLM-7b-chat-v1.0/preprocessing/1/model.py triton_python_backend_shm_region_6 1048576 1048576 2796 /opt/tritonserver/backends/python 336 preprocessing_0_0 DEFAULT
I0415 13:01:09.290891 2796 python_be.cc:2384] TRITONBACKEND_ModelInstanceInitialize: instance initialization successful postprocessing_0_0 (device 0)
I0415 13:01:09.291048 2796 backend_model_instance.cc:772] Starting backend thread for postprocessing_0_0 at nice 0 on device 0...
I0415 13:01:09.291151 2796 backend_model.cc:618] Created model instance named 'postprocessing_0_0' with device id '0'
I0415 13:01:09.291257 2796 model_lifecycle.cc:675] OnLoadComplete() 'postprocessing' version 1
I0415 13:01:09.291277 2796 model_lifecycle.cc:713] OnLoadFinal() 'postprocessing' for all version(s)
I0415 13:01:09.291284 2796 model_lifecycle.cc:818] successfully loaded 'postprocessing'
I0415 13:01:09.291368 2796 model_lifecycle.cc:286] VersionStates() 'postprocessing'
I0415 13:01:09.414078 2796 python_be.cc:2384] TRITONBACKEND_ModelInstanceInitialize: instance initialization successful preprocessing_0_0 (device 0)
I0415 13:01:09.414210 2796 backend_model_instance.cc:772] Starting backend thread for preprocessing_0_0 at nice 0 on device 0...
I0415 13:01:09.414317 2796 backend_model.cc:618] Created model instance named 'preprocessing_0_0' with device id '0'
I0415 13:01:09.414441 2796 model_lifecycle.cc:675] OnLoadComplete() 'preprocessing' version 1
I0415 13:01:09.414469 2796 model_lifecycle.cc:713] OnLoadFinal() 'preprocessing' for all version(s)
I0415 13:01:09.414492 2796 model_lifecycle.cc:818] successfully loaded 'preprocessing'
I0415 13:01:09.414545 2796 model_lifecycle.cc:286] VersionStates() 'preprocessing'
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2560
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available.
[TensorRT-LLM][INFO] Loaded engine size: 14008 MiB
[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: d == a + length (/app/tensorrt_llm/cpp/tensorrt_llm/plugins/gemmPlugin/gemmPlugin.cpp:156)
1       0x7fd1a11c612f /opt/tritonserver/backends/tensorrtllm/libnvinfer_plugin_tensorrt_llm.so.9(+0x4512f) [0x7fd1a11c612f]
2       0x7fd1a1256ac8 tensorrt_llm::plugins::GemmPlugin::GemmPlugin(void const*, unsigned long, std::shared_ptr<tensorrt_llm::plugins::CublasLtGemmPluginProfiler> const&) + 440
3       0x7fd1a1256bdf tensorrt_llm::plugins::GemmPluginCreator::deserializePlugin(char const*, void const*, unsigned long) + 191
4       0x7fd1e8b118a6 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10d68a6) [0x7fd1e8b118a6]
5       0x7fd1e8b0966e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10ce66e) [0x7fd1e8b0966e]
6       0x7fd1e8aa4217 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1069217) [0x7fd1e8aa4217]
7       0x7fd1e8aa219e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x106719e) [0x7fd1e8aa219e]
8       0x7fd1e8ab9c2b /usr/local/tensorrt/lib/libnvinfer.so.9(+0x107ec2b) [0x7fd1e8ab9c2b]
9       0x7fd1e8abce32 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1081e32) [0x7fd1e8abce32]
10      0x7fd1e8abd20c /usr/local/tensorrt/lib/libnvinfer.so.9(+0x108220c) [0x7fd1e8abd20c]
11      0x7fd1e8af09b1 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b59b1) [0x7fd1e8af09b1]
12      0x7fd1e8af1777 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b6777) [0x7fd1e8af1777]
13      0x7fd1f7579022 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x127022) [0x7fd1f7579022]
14      0x7fd1f74d29cf /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x809cf) [0x7fd1f74d29cf]
15      0x7fd1f74c299f /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7099f) [0x7fd1f74c299f]
16      0x7fd1f74b9222 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x67222) [0x7fd1f74b9222]
17      0x7fd1f749acbc /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x48cbc) [0x7fd1f749acbc]
18      0x7fd1f749bd62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x49d62) [0x7fd1f749bd62]
19      0x7fd1f748bbb5 TRITONBACKEND_ModelInstanceInitialize + 101
20      0x7fd2b1589226 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a7226) [0x7fd2b1589226]
21      0x7fd2b158a466 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a8466) [0x7fd2b158a466]
22      0x7fd2b156d165 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b165) [0x7fd2b156d165]
23      0x7fd2b156d7a6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b7a6) [0x7fd2b156d7a6]
24      0x7fd2b1579a1d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x197a1d) [0x7fd2b1579a1d]
25      0x7fd2b0be4ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fd2b0be4ee8]
26      0x7fd2b1563feb /opt/tritonserver/bin/../lib/libtritonserver.so(+0x181feb) [0x7fd2b1563feb]
27      0x7fd2b1573dc5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191dc5) [0x7fd2b1573dc5]
28      0x7fd2b1578d36 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x196d36) [0x7fd2b1578d36]
29      0x7fd2b1669330 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x287330) [0x7fd2b1669330]
30      0x7fd2b166ca23 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28aa23) [0x7fd2b166ca23]
31      0x7fd2b17c0d82 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3ded82) [0x7fd2b17c0d82]
32      0x7fd2b0e4f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fd2b0e4f253]
33      0x7fd2b0bdfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fd2b0bdfac3]
34      0x7fd2b0c70814 clone + 68
[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: d == a + length (/app/tensorrt_llm/cpp/tensorrt_llm/plugins/gptAttentionCommon/gptAttentionCommon.cpp:418)
1       0x7fd1a11c612f /opt/tritonserver/backends/tensorrtllm/libnvinfer_plugin_tensorrt_llm.so.9(+0x4512f) [0x7fd1a11c612f]
2       0x7fd1a1232846 tensorrt_llm::plugins::GPTAttentionPluginCommon::GPTAttentionPluginCommon(void const*, unsigned long) + 870
3       0x7fd1a12498b3 tensorrt_llm::plugins::GPTAttentionPlugin::GPTAttentionPlugin(void const*, unsigned long) + 19
4       0x7fd1a1249932 tensorrt_llm::plugins::GPTAttentionPluginCreator::deserializePlugin(char const*, void const*, unsigned long) + 50
5       0x7fd1e8b118a6 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10d68a6) [0x7fd1e8b118a6]
6       0x7fd1e8b0966e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10ce66e) [0x7fd1e8b0966e]
7       0x7fd1e8aa4217 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1069217) [0x7fd1e8aa4217]
8       0x7fd1e8aa219e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x106719e) [0x7fd1e8aa219e]
9       0x7fd1e8ab9c2b /usr/local/tensorrt/lib/libnvinfer.so.9(+0x107ec2b) [0x7fd1e8ab9c2b]
10      0x7fd1e8abce32 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1081e32) [0x7fd1e8abce32]
11      0x7fd1e8abd20c /usr/local/tensorrt/lib/libnvinfer.so.9(+0x108220c) [0x7fd1e8abd20c]
12      0x7fd1e8af09b1 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b59b1) [0x7fd1e8af09b1]
13      0x7fd1e8af1777 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b6777) [0x7fd1e8af1777]
14      0x7fd1f7579022 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x127022) [0x7fd1f7579022]
15      0x7fd1f74d29cf /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x809cf) [0x7fd1f74d29cf]
16      0x7fd1f74c299f /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7099f) [0x7fd1f74c299f]
17      0x7fd1f74b9222 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x67222) [0x7fd1f74b9222]
18      0x7fd1f749acbc /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x48cbc) [0x7fd1f749acbc]
19      0x7fd1f749bd62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x49d62) [0x7fd1f749bd62]
20      0x7fd1f748bbb5 TRITONBACKEND_ModelInstanceInitialize + 101
21      0x7fd2b1589226 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a7226) [0x7fd2b1589226]
22      0x7fd2b158a466 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a8466) [0x7fd2b158a466]
23      0x7fd2b156d165 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b165) [0x7fd2b156d165]
24      0x7fd2b156d7a6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b7a6) [0x7fd2b156d7a6]
25      0x7fd2b1579a1d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x197a1d) [0x7fd2b1579a1d]
26      0x7fd2b0be4ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fd2b0be4ee8]
27      0x7fd2b1563feb /opt/tritonserver/bin/../lib/libtritonserver.so(+0x181feb) [0x7fd2b1563feb]
28      0x7fd2b1573dc5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191dc5) [0x7fd2b1573dc5]
29      0x7fd2b1578d36 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x196d36) [0x7fd2b1578d36]
30      0x7fd2b1669330 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x287330) [0x7fd2b1669330]
31      0x7fd2b166ca23 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28aa23) [0x7fd2b166ca23]
32      0x7fd2b17c0d82 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3ded82) [0x7fd2b17c0d82]
33      0x7fd2b0e4f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fd2b0e4f253]
34      0x7fd2b0bdfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fd2b0bdfac3]
35      0x7fd2b0c70814 clone + 68
[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: d == a + length (/app/tensorrt_llm/cpp/tensorrt_llm/plugins/gemmPlugin/gemmPlugin.cpp:156)
1       0x7fd1a11c612f /opt/tritonserver/backends/tensorrtllm/libnvinfer_plugin_tensorrt_llm.so.9(+0x4512f) [0x7fd1a11c612f]
2       0x7fd1a1256ac8 tensorrt_llm::plugins::GemmPlugin::GemmPlugin(void const*, unsigned long, std::shared_ptr<tensorrt_llm::plugins::CublasLtGemmPluginProfiler> const&) + 440
3       0x7fd1a1256bdf tensorrt_llm::plugins::GemmPluginCreator::deserializePlugin(char const*, void const*, unsigned long) + 191
4       0x7fd1e8b118a6 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10d68a6) [0x7fd1e8b118a6]
5       0x7fd1e8b0966e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10ce66e) [0x7fd1e8b0966e]
6       0x7fd1e8aa4217 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1069217) [0x7fd1e8aa4217]
7       0x7fd1e8aa219e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x106719e) [0x7fd1e8aa219e]
8       0x7fd1e8ab9c2b /usr/local/tensorrt/lib/libnvinfer.so.9(+0x107ec2b) [0x7fd1e8ab9c2b]
9       0x7fd1e8abce32 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1081e32) [0x7fd1e8abce32]
10      0x7fd1e8abd20c /usr/local/tensorrt/lib/libnvinfer.so.9(+0x108220c) [0x7fd1e8abd20c]
11      0x7fd1e8af09b1 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b59b1) [0x7fd1e8af09b1]
12      0x7fd1e8af1777 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b6777) [0x7fd1e8af1777]
13      0x7fd1f7579022 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x127022) [0x7fd1f7579022]
14      0x7fd1f74d29cf /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x809cf) [0x7fd1f74d29cf]
15      0x7fd1f74c299f /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7099f) [0x7fd1f74c299f]
16      0x7fd1f74b9222 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x67222) [0x7fd1f74b9222]
17      0x7fd1f749acbc /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x48cbc) [0x7fd1f749acbc]
18      0x7fd1f749bd62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x49d62) [0x7fd1f749bd62]
19      0x7fd1f748bbb5 TRITONBACKEND_ModelInstanceInitialize + 101
20      0x7fd2b1589226 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a7226) [0x7fd2b1589226]
21      0x7fd2b158a466 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a8466) [0x7fd2b158a466]
22      0x7fd2b156d165 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b165) [0x7fd2b156d165]
23      0x7fd2b156d7a6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b7a6) [0x7fd2b156d7a6]
24      0x7fd2b1579a1d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x197a1d) [0x7fd2b1579a1d]
25      0x7fd2b0be4ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fd2b0be4ee8]
26      0x7fd2b1563feb /opt/tritonserver/bin/../lib/libtritonserver.so(+0x181feb) [0x7fd2b1563feb]
27      0x7fd2b1573dc5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191dc5) [0x7fd2b1573dc5]
28      0x7fd2b1578d36 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x196d36) [0x7fd2b1578d36]
29      0x7fd2b1669330 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x287330) [0x7fd2b1669330]
30      0x7fd2b166ca23 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28aa23) [0x7fd2b166ca23]
31      0x7fd2b17c0d82 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3ded82) [0x7fd2b17c0d82]
32      0x7fd2b0e4f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fd2b0e4f253]
33      0x7fd2b0bdfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fd2b0bdfac3]
34      0x7fd2b0c70814 clone + 68
[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: d == a + length (/app/tensorrt_llm/cpp/tensorrt_llm/plugins/gemmPlugin/gemmPlugin.cpp:156)
1       0x7fd1a11c612f /opt/tritonserver/backends/tensorrtllm/libnvinfer_plugin_tensorrt_llm.so.9(+0x4512f) [0x7fd1a11c612f]
2       0x7fd1a1256ac8 tensorrt_llm::plugins::GemmPlugin::GemmPlugin(void const*, unsigned long, std::shared_ptr<tensorrt_llm::plugins::CublasLtGemmPluginProfiler> const&) + 440
3       0x7fd1a1256bdf tensorrt_llm::plugins::GemmPluginCreator::deserializePlugin(char const*, void const*, unsigned long) + 191
4       0x7fd1e8b118a6 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10d68a6) [0x7fd1e8b118a6]
5       0x7fd1e8b0966e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10ce66e) [0x7fd1e8b0966e]
6       0x7fd1e8aa4217 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1069217) [0x7fd1e8aa4217]
7       0x7fd1e8aa219e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x106719e) [0x7fd1e8aa219e]
8       0x7fd1e8ab9c2b /usr/local/tensorrt/lib/libnvinfer.so.9(+0x107ec2b) [0x7fd1e8ab9c2b]
9       0x7fd1e8abce32 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1081e32) [0x7fd1e8abce32]
10      0x7fd1e8abd20c /usr/local/tensorrt/lib/libnvinfer.so.9(+0x108220c) [0x7fd1e8abd20c]
11      0x7fd1e8af09b1 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b59b1) [0x7fd1e8af09b1]
12      0x7fd1e8af1777 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b6777) [0x7fd1e8af1777]
13      0x7fd1f7579022 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x127022) [0x7fd1f7579022]
14      0x7fd1f74d29cf /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x809cf) [0x7fd1f74d29cf]
15      0x7fd1f74c299f /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7099f) [0x7fd1f74c299f]
16      0x7fd1f74b9222 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x67222) [0x7fd1f74b9222]
17      0x7fd1f749acbc /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x48cbc) [0x7fd1f749acbc]
18      0x7fd1f749bd62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x49d62) [0x7fd1f749bd62]
19      0x7fd1f748bbb5 TRITONBACKEND_ModelInstanceInitialize + 101
20      0x7fd2b1589226 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a7226) [0x7fd2b1589226]
21      0x7fd2b158a466 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a8466) [0x7fd2b158a466]
22      0x7fd2b156d165 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b165) [0x7fd2b156d165]
23      0x7fd2b156d7a6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b7a6) [0x7fd2b156d7a6]
24      0x7fd2b1579a1d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x197a1d) [0x7fd2b1579a1d]
25      0x7fd2b0be4ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fd2b0be4ee8]
26      0x7fd2b1563feb /opt/tritonserver/bin/../lib/libtritonserver.so(+0x181feb) [0x7fd2b1563feb]
27      0x7fd2b1573dc5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191dc5) [0x7fd2b1573dc5]
28      0x7fd2b1578d36 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x196d36) [0x7fd2b1578d36]
29      0x7fd2b1669330 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x287330) [0x7fd2b1669330]
30      0x7fd2b166ca23 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28aa23) [0x7fd2b166ca23]
31      0x7fd2b17c0d82 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3ded82) [0x7fd2b17c0d82]
32      0x7fd2b0e4f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fd2b0e4f253]
33      0x7fd2b0bdfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fd2b0bdfac3]
34      0x7fd2b0c70814 clone + 68
[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: d == a + length (/app/tensorrt_llm/cpp/tensorrt_llm/plugins/gemmPlugin/gemmPlugin.cpp:156)
1       0x7fd1a11c612f /opt/tritonserver/backends/tensorrtllm/libnvinfer_plugin_tensorrt_llm.so.9(+0x4512f) [0x7fd1a11c612f]
2       0x7fd1a1256ac8 tensorrt_llm::plugins::GemmPlugin::GemmPlugin(void const*, unsigned long, std::shared_ptr<tensorrt_llm::plugins::CublasLtGemmPluginProfiler> const&) + 440
3       0x7fd1a1256bdf tensorrt_llm::plugins::GemmPluginCreator::deserializePlugin(char const*, void const*, unsigned long) + 191
4       0x7fd1e8b118a6 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10d68a6) [0x7fd1e8b118a6]
5       0x7fd1e8b0966e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10ce66e) [0x7fd1e8b0966e]
6       0x7fd1e8aa4217 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1069217) [0x7fd1e8aa4217]
7       0x7fd1e8aa219e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x106719e) [0x7fd1e8aa219e]
8       0x7fd1e8ab9c2b /usr/local/tensorrt/lib/libnvinfer.so.9(+0x107ec2b) [0x7fd1e8ab9c2b]
9       0x7fd1e8abce32 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1081e32) [0x7fd1e8abce32]
10      0x7fd1e8abd20c /usr/local/tensorrt/lib/libnvinfer.so.9(+0x108220c) [0x7fd1e8abd20c]
11      0x7fd1e8af09b1 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b59b1) [0x7fd1e8af09b1]
12      0x7fd1e8af1777 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b6777) [0x7fd1e8af1777]
13      0x7fd1f7579022 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x127022) [0x7fd1f7579022]
14      0x7fd1f74d29cf /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x809cf) [0x7fd1f74d29cf]
15      0x7fd1f74c299f /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7099f) [0x7fd1f74c299f]
16      0x7fd1f74b9222 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x67222) [0x7fd1f74b9222]
17      0x7fd1f749acbc /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x48cbc) [0x7fd1f749acbc]
18      0x7fd1f749bd62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x49d62) [0x7fd1f749bd62]
19      0x7fd1f748bbb5 TRITONBACKEND_ModelInstanceInitialize + 101
20      0x7fd2b1589226 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a7226) [0x7fd2b1589226]
21      0x7fd2b158a466 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a8466) [0x7fd2b158a466]
22      0x7fd2b156d165 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b165) [0x7fd2b156d165]
23      0x7fd2b156d7a6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b7a6) [0x7fd2b156d7a6]
24      0x7fd2b1579a1d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x197a1d) [0x7fd2b1579a1d]
25      0x7fd2b0be4ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fd2b0be4ee8]
26      0x7fd2b1563feb /opt/tritonserver/bin/../lib/libtritonserver.so(+0x181feb) [0x7fd2b1563feb]
27      0x7fd2b1573dc5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191dc5) [0x7fd2b1573dc5]
28      0x7fd2b1578d36 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x196d36) [0x7fd2b1578d36]
29      0x7fd2b1669330 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x287330) [0x7fd2b1669330]
30      0x7fd2b166ca23 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28aa23) [0x7fd2b166ca23]
31      0x7fd2b17c0d82 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3ded82) [0x7fd2b17c0d82]
32      0x7fd2b0e4f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fd2b0e4f253]
33      0x7fd2b0bdfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fd2b0bdfac3]
34      0x7fd2b0c70814 clone + 68
[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: d == a + length (/app/tensorrt_llm/cpp/tensorrt_llm/plugins/gemmPlugin/gemmPlugin.cpp:156)
1       0x7fd1a11c612f /opt/tritonserver/backends/tensorrtllm/libnvinfer_plugin_tensorrt_llm.so.9(+0x4512f) [0x7fd1a11c612f]
2       0x7fd1a1256ac8 tensorrt_llm::plugins::GemmPlugin::GemmPlugin(void const*, unsigned long, std::shared_ptr<tensorrt_llm::plugins::CublasLtGemmPluginProfiler> const&) + 440
3       0x7fd1a1256bdf tensorrt_llm::plugins::GemmPluginCreator::deserializePlugin(char const*, void const*, unsigned long) + 191
4       0x7fd1e8b118a6 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10d68a6) [0x7fd1e8b118a6]
5       0x7fd1e8b0966e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10ce66e) [0x7fd1e8b0966e]
6       0x7fd1e8aa4217 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1069217) [0x7fd1e8aa4217]
7       0x7fd1e8aa219e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x106719e) [0x7fd1e8aa219e]
8       0x7fd1e8ab9c2b /usr/local/tensorrt/lib/libnvinfer.so.9(+0x107ec2b) [0x7fd1e8ab9c2b]
9       0x7fd1e8abce32 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1081e32) [0x7fd1e8abce32]
10      0x7fd1e8abd20c /usr/local/tensorrt/lib/libnvinfer.so.9(+0x108220c) [0x7fd1e8abd20c]
11      0x7fd1e8af09b1 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b59b1) [0x7fd1e8af09b1]
12      0x7fd1e8af1777 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b6777) [0x7fd1e8af1777]
13      0x7fd1f7579022 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x127022) [0x7fd1f7579022]
14      0x7fd1f74d29cf /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x809cf) [0x7fd1f74d29cf]
15      0x7fd1f74c299f /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7099f) [0x7fd1f74c299f]
16      0x7fd1f74b9222 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x67222) [0x7fd1f74b9222]
17      0x7fd1f749acbc /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x48cbc) [0x7fd1f749acbc]
18      0x7fd1f749bd62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x49d62) [0x7fd1f749bd62]
19      0x7fd1f748bbb5 TRITONBACKEND_ModelInstanceInitialize + 101
20      0x7fd2b1589226 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a7226) [0x7fd2b1589226]
21      0x7fd2b158a466 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a8466) [0x7fd2b158a466]
22      0x7fd2b156d165 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b165) [0x7fd2b156d165]
23      0x7fd2b156d7a6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b7a6) [0x7fd2b156d7a6]
24      0x7fd2b1579a1d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x197a1d) [0x7fd2b1579a1d]
25      0x7fd2b0be4ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fd2b0be4ee8]
26      0x7fd2b1563feb /opt/tritonserver/bin/../lib/libtritonserver.so(+0x181feb) [0x7fd2b1563feb]
27      0x7fd2b1573dc5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191dc5) [0x7fd2b1573dc5]
28      0x7fd2b1578d36 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x196d36) [0x7fd2b1578d36]
29      0x7fd2b1669330 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x287330) [0x7fd2b1669330]
30      0x7fd2b166ca23 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28aa23) [0x7fd2b166ca23]
31      0x7fd2b17c0d82 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3ded82) [0x7fd2b17c0d82]
32      0x7fd2b0e4f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fd2b0e4f253]
33      0x7fd2b0bdfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fd2b0bdfac3]
34      0x7fd2b0c70814 clone + 68
[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: d == a + length (/app/tensorrt_llm/cpp/tensorrt_llm/plugins/gemmPlugin/gemmPlugin.cpp:156)
1       0x7fd1a11c612f /opt/tritonserver/backends/tensorrtllm/libnvinfer_plugin_tensorrt_llm.so.9(+0x4512f) [0x7fd1a11c612f]
2       0x7fd1a1256ac8 tensorrt_llm::plugins::GemmPlugin::GemmPlugin(void const*, unsigned long, std::shared_ptr<tensorrt_llm::plugins::CublasLtGemmPluginProfiler> const&) + 440
3       0x7fd1a1256bdf tensorrt_llm::plugins::GemmPluginCreator::deserializePlugin(char const*, void const*, unsigned long) + 191
4       0x7fd1e8b118a6 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10d68a6) [0x7fd1e8b118a6]
5       0x7fd1e8b0966e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10ce66e) [0x7fd1e8b0966e]
6       0x7fd1e8aa4217 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1069217) [0x7fd1e8aa4217]
7       0x7fd1e8aa219e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x106719e) [0x7fd1e8aa219e]
8       0x7fd1e8ab9c2b /usr/local/tensorrt/lib/libnvinfer.so.9(+0x107ec2b) [0x7fd1e8ab9c2b]
9       0x7fd1e8abce32 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1081e32) [0x7fd1e8abce32]
10      0x7fd1e8abd20c /usr/local/tensorrt/lib/libnvinfer.so.9(+0x108220c) [0x7fd1e8abd20c]
11      0x7fd1e8af09b1 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b59b1) [0x7fd1e8af09b1]
12      0x7fd1e8af1777 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b6777) [0x7fd1e8af1777]
13      0x7fd1f7579022 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x127022) [0x7fd1f7579022]
14      0x7fd1f74d29cf /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x809cf) [0x7fd1f74d29cf]
15      0x7fd1f74c299f /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7099f) [0x7fd1f74c299f]
16      0x7fd1f74b9222 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x67222) [0x7fd1f74b9222]
17      0x7fd1f749acbc /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x48cbc) [0x7fd1f749acbc]
18      0x7fd1f749bd62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x49d62) [0x7fd1f749bd62]
19      0x7fd1f748bbb5 TRITONBACKEND_ModelInstanceInitialize + 101
20      0x7fd2b1589226 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a7226) [0x7fd2b1589226]
21      0x7fd2b158a466 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a8466) [0x7fd2b158a466]
22      0x7fd2b156d165 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b165) [0x7fd2b156d165]
23      0x7fd2b156d7a6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b7a6) [0x7fd2b156d7a6]
24      0x7fd2b1579a1d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x197a1d) [0x7fd2b1579a1d]
25      0x7fd2b0be4ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fd2b0be4ee8]
26      0x7fd2b1563feb /opt/tritonserver/bin/../lib/libtritonserver.so(+0x181feb) [0x7fd2b1563feb]
27      0x7fd2b1573dc5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191dc5) [0x7fd2b1573dc5]
28      0x7fd2b1578d36 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x196d36) [0x7fd2b1578d36]
29      0x7fd2b1669330 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x287330) [0x7fd2b1669330]
30      0x7fd2b166ca23 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28aa23) [0x7fd2b166ca23]
31      0x7fd2b17c0d82 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3ded82) [0x7fd2b17c0d82]
32      0x7fd2b0e4f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fd2b0e4f253]
33      0x7fd2b0bdfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fd2b0bdfac3]
34      0x7fd2b0c70814 clone + 68
[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: d == a + length (/app/tensorrt_llm/cpp/tensorrt_llm/plugins/gptAttentionCommon/gptAttentionCommon.cpp:418)
1       0x7fd1a11c612f /opt/tritonserver/backends/tensorrtllm/libnvinfer_plugin_tensorrt_llm.so.9(+0x4512f) [0x7fd1a11c612f]
2       0x7fd1a1232846 tensorrt_llm::plugins::GPTAttentionPluginCommon::GPTAttentionPluginCommon(void const*, unsigned long) + 870
3       0x7fd1a12498b3 tensorrt_llm::plugins::GPTAttentionPlugin::GPTAttentionPlugin(void const*, unsigned long) + 19
4       0x7fd1a1249932 tensorrt_llm::plugins::GPTAttentionPluginCreator::deserializePlugin(char const*, void const*, unsigned long) + 50
5       0x7fd1e8b118a6 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10d68a6) [0x7fd1e8b118a6]
6       0x7fd1e8b0966e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10ce66e) [0x7fd1e8b0966e]
7       0x7fd1e8aa4217 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1069217) [0x7fd1e8aa4217]
8       0x7fd1e8aa219e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x106719e) [0x7fd1e8aa219e]
9       0x7fd1e8ab9c2b /usr/local/tensorrt/lib/libnvinfer.so.9(+0x107ec2b) [0x7fd1e8ab9c2b]
10      0x7fd1e8abce32 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1081e32) [0x7fd1e8abce32]
11      0x7fd1e8abd20c /usr/local/tensorrt/lib/libnvinfer.so.9(+0x108220c) [0x7fd1e8abd20c]
12      0x7fd1e8af09b1 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b59b1) [0x7fd1e8af09b1]
13      0x7fd1e8af1777 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b6777) [0x7fd1e8af1777]
14      0x7fd1f7579022 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x127022) [0x7fd1f7579022]
15      0x7fd1f74d29cf /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x809cf) [0x7fd1f74d29cf]
16      0x7fd1f74c299f /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7099f) [0x7fd1f74c299f]
17      0x7fd1f74b9222 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x67222) [0x7fd1f74b9222]
18      0x7fd1f749acbc /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x48cbc) [0x7fd1f749acbc]
19      0x7fd1f749bd62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x49d62) [0x7fd1f749bd62]
20      0x7fd1f748bbb5 TRITONBACKEND_ModelInstanceInitialize + 101
21      0x7fd2b1589226 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a7226) [0x7fd2b1589226]
22      0x7fd2b158a466 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a8466) [0x7fd2b158a466]
23      0x7fd2b156d165 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b165) [0x7fd2b156d165]
24      0x7fd2b156d7a6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b7a6) [0x7fd2b156d7a6]
25      0x7fd2b1579a1d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x197a1d) [0x7fd2b1579a1d]
26      0x7fd2b0be4ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fd2b0be4ee8]
27      0x7fd2b1563feb /opt/tritonserver/bin/../lib/libtritonserver.so(+0x181feb) [0x7fd2b1563feb]
28      0x7fd2b1573dc5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191dc5) [0x7fd2b1573dc5]
29      0x7fd2b1578d36 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x196d36) [0x7fd2b1578d36]
30      0x7fd2b1669330 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x287330) [0x7fd2b1669330]
31      0x7fd2b166ca23 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28aa23) [0x7fd2b166ca23]
32      0x7fd2b17c0d82 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3ded82) [0x7fd2b17c0d82]
33      0x7fd2b0e4f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fd2b0e4f253]
34      0x7fd2b0bdfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fd2b0bdfac3]
35      0x7fd2b0c70814 clone + 68
[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: d == a + length (/app/tensorrt_llm/cpp/tensorrt_llm/plugins/gemmPlugin/gemmPlugin.cpp:156)
1       0x7fd1a11c612f /opt/tritonserver/backends/tensorrtllm/libnvinfer_plugin_tensorrt_llm.so.9(+0x4512f) [0x7fd1a11c612f]
2       0x7fd1a1256ac8 tensorrt_llm::plugins::GemmPlugin::GemmPlugin(void const*, unsigned long, std::shared_ptr<tensorrt_llm::plugins::CublasLtGemmPluginProfiler> const&) + 440
3       0x7fd1a1256bdf tensorrt_llm::plugins::GemmPluginCreator::deserializePlugin(char const*, void const*, unsigned long) + 191
4       0x7fd1e8b118a6 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10d68a6) [0x7fd1e8b118a6]
5       0x7fd1e8b0966e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10ce66e) [0x7fd1e8b0966e]
6       0x7fd1e8aa4217 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1069217) [0x7fd1e8aa4217]
7       0x7fd1e8aa219e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x106719e) [0x7fd1e8aa219e]
8       0x7fd1e8ab9c2b /usr/local/tensorrt/lib/libnvinfer.so.9(+0x107ec2b) [0x7fd1e8ab9c2b]
9       0x7fd1e8abce32 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1081e32) [0x7fd1e8abce32]
10      0x7fd1e8abd20c /usr/local/tensorrt/lib/libnvinfer.so.9(+0x108220c) [0x7fd1e8abd20c]
11      0x7fd1e8af09b1 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b59b1) [0x7fd1e8af09b1]
12      0x7fd1e8af1777 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b6777) [0x7fd1e8af1777]
13      0x7fd1f7579022 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x127022) [0x7fd1f7579022]
14      0x7fd1f74d29cf /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x809cf) [0x7fd1f74d29cf]
15      0x7fd1f74c299f /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7099f) [0x7fd1f74c299f]
16      0x7fd1f74b9222 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x67222) [0x7fd1f74b9222]
17      0x7fd1f749acbc /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x48cbc) [0x7fd1f749acbc]
18      0x7fd1f749bd62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x49d62) [0x7fd1f749bd62]
19      0x7fd1f748bbb5 TRITONBACKEND_ModelInstanceInitialize + 101
20      0x7fd2b1589226 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a7226) [0x7fd2b1589226]
21      0x7fd2b158a466 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a8466) [0x7fd2b158a466]
22      0x7fd2b156d165 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b165) [0x7fd2b156d165]
23      0x7fd2b156d7a6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b7a6) [0x7fd2b156d7a6]
24      0x7fd2b1579a1d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x197a1d) [0x7fd2b1579a1d]
25      0x7fd2b0be4ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fd2b0be4ee8]
26      0x7fd2b1563feb /opt/tritonserver/bin/../lib/libtritonserver.so(+0x181feb) [0x7fd2b1563feb]
27      0x7fd2b1573dc5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191dc5) [0x7fd2b1573dc5]
28      0x7fd2b1578d36 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x196d36) [0x7fd2b1578d36]
29      0x7fd2b1669330 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x287330) [0x7fd2b1669330]
30      0x7fd2b166ca23 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28aa23) [0x7fd2b166ca23]
31      0x7fd2b17c0d82 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3ded82) [0x7fd2b17c0d82]
32      0x7fd2b0e4f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fd2b0e4f253]
33      0x7fd2b0bdfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fd2b0bdfac3]
34      0x7fd2b0c70814 clone + 68
[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: d == a + length (/app/tensorrt_llm/cpp/tensorrt_llm/plugins/gemmPlugin/gemmPlugin.cpp:156)
1       0x7fd1a11c612f /opt/tritonserver/backends/tensorrtllm/libnvinfer_plugin_tensorrt_llm.so.9(+0x4512f) [0x7fd1a11c612f]
2       0x7fd1a1256ac8 tensorrt_llm::plugins::GemmPlugin::GemmPlugin(void const*, unsigned long, std::shared_ptr<tensorrt_llm::plugins::CublasLtGemmPluginProfiler> const&) + 440
3       0x7fd1a1256bdf tensorrt_llm::plugins::GemmPluginCreator::deserializePlugin(char const*, void const*, unsigned long) + 191
4       0x7fd1e8b118a6 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10d68a6) [0x7fd1e8b118a6]
5       0x7fd1e8b0966e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10ce66e) [0x7fd1e8b0966e]
6       0x7fd1e8aa4217 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1069217) [0x7fd1e8aa4217]
7       0x7fd1e8aa219e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x106719e) [0x7fd1e8aa219e]
8       0x7fd1e8ab9c2b /usr/local/tensorrt/lib/libnvinfer.so.9(+0x107ec2b) [0x7fd1e8ab9c2b]
9       0x7fd1e8abce32 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1081e32) [0x7fd1e8abce32]
10      0x7fd1e8abd20c /usr/local/tensorrt/lib/libnvinfer.so.9(+0x108220c) [0x7fd1e8abd20c]
11      0x7fd1e8af09b1 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b59b1) [0x7fd1e8af09b1]
12      0x7fd1e8af1777 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b6777) [0x7fd1e8af1777]
13      0x7fd1f7579022 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x127022) [0x7fd1f7579022]
14      0x7fd1f74d29cf /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x809cf) [0x7fd1f74d29cf]
15      0x7fd1f74c299f /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7099f) [0x7fd1f74c299f]
16      0x7fd1f74b9222 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x67222) [0x7fd1f74b9222]
17      0x7fd1f749acbc /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x48cbc) [0x7fd1f749acbc]
18      0x7fd1f749bd62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x49d62) [0x7fd1f749bd62]
19      0x7fd1f748bbb5 TRITONBACKEND_ModelInstanceInitialize + 101
20      0x7fd2b1589226 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a7226) [0x7fd2b1589226]
21      0x7fd2b158a466 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a8466) [0x7fd2b158a466]
22      0x7fd2b156d165 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b165) [0x7fd2b156d165]
23      0x7fd2b156d7a6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b7a6) [0x7fd2b156d7a6]
24      0x7fd2b1579a1d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x197a1d) [0x7fd2b1579a1d]
25      0x7fd2b0be4ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fd2b0be4ee8]
26      0x7fd2b1563feb /opt/tritonserver/bin/../lib/libtritonserver.so(+0x181feb) [0x7fd2b1563feb]
27      0x7fd2b1573dc5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191dc5) [0x7fd2b1573dc5]
28      0x7fd2b1578d36 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x196d36) [0x7fd2b1578d36]
29      0x7fd2b1669330 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x287330) [0x7fd2b1669330]
30      0x7fd2b166ca23 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28aa23) [0x7fd2b166ca23]
31      0x7fd2b17c0d82 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3ded82) [0x7fd2b17c0d82]
32      0x7fd2b0e4f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fd2b0e4f253]
33      0x7fd2b0bdfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fd2b0bdfac3]
34      0x7fd2b0c70814 clone + 68
...

Triton Information
What version of Triton are you using?
Triton: 2.41
tensorrtllm_backend: 0.8.0

Are you using the Triton container or did you build it yourself?
i used docker: nvcr.io/nvidia/tritonserver:23.12-trtllm-python-py3 - running on ubuntu 22.04

To Reproduce
all the steps to reproduce are described here: https://github.com/mtezgider/triton-tensorrt-llm-model-preparation-and-deployment
than i started the server: tritonserver --model-repository=/tensorrt/triton-repos/trtibf-Trendyol-LLM-7b-chat-v1.0 --model-control-mode=explicit --load-model=preprocessing --load-model=postprocessing --load-model=tensorrt_llm --load-model=tensorrt_llm_bls --load-model=ensemble --log-verbose=2 --log-info=1 --log-warning=1 --log-error=1

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).
see here: https://github.com/mtezgider/triton-tensorrt-llm-model-preparation-and-deployment

Expected behavior
A clear and concise description of what you expected to happen.
no error running the model

the entire logs:
logs.txt

The text was updated successfully, but these errors were encountered:

rmccorm4 · 2024-04-15T21:37:50Z

Hi @geraldstanje,

Thanks for raising this issue.

I believe this error generally indicates a version mismatch issue:

[TensorRT-LLM][ERROR] Assertion failed: d == a + length

You mentioned the following environment:

Triton: 2.41
tensorrtllm_backend: 0.8.0

However, Triton v2.41 (23.12) is built for TRT-LLM backend v0.7.0 per the release notes: https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-23-12.html#rel-23-12

If you'd like to use TRT-LLM v0.8.0, I recommend using Triton 24.03 or 24.02 which were built and tested for TRT-LLM version v0.8.0.

Please let us know if this fixes your issue.

geraldstanje · 2024-04-17T03:31:34Z

@rmccorm4 thanks for your reply - can i use the following on ubuntu 20.04 host?

gpu (nvidia-smi on my ubuntu 20.04 host):
docker image: nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3
- sudo docker run -it --ipc=host --gpus all --ulimit memlock=-1 --shm-size="2g" nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3 /bin/bash
tensorrtllm_backend: 0.8.0
models:
- Llama-2-7b-chat (https://huggingface.co/meta-llama/Llama-2-7b-chat)
- Llama-2-7b-chat-hf (https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
- Llama-2-7B-Chat-fp16 (https://huggingface.co/TheBloke/Llama-2-7B-Chat-fp16)
  will compile model according to: https://medium.com/trendyol-tech/deploying-a-large-language-model-llm-with-tensorrt-llm-on-triton-inference-server-a-step-by-step-d53fccc856fa - with tf_size=4

i will rerun after you confirm it.

rmccorm4 · 2024-04-17T17:29:00Z

Hi @geraldstanje, Triton 24.02 + TRTLLM v0.8.0 should work. The 7b models should likely fit on a single GPU with 24GB memory, but you can use tensor parallelism to split across gpus based on your use case.

geraldstanje · 2024-04-17T17:58:17Z

@rmccorm4 any issues regarding the ubuntu 20.04 host or cuda version 12.2 on the host? i plan to run the docker image: sudo docker run -it --ipc=host --gpus all --ulimit memlock=-1 --shm-size="2g" nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3 /bin/bash

can i run any of the models above?

rmccorm4 · 2024-04-17T18:26:42Z

I don't believe the Ubuntu 20.04 host should be an issue, as the container will have the required Ubuntu 22.04 inside.

As for the CUDA/driver version, see this note from the tritonserver release notes:

Driver Requirements
Release 24.02 is based on CUDA 12.3.2, which requires NVIDIA Driver release 545 or later. However, if you are running on a data center GPU (for example, T4 or any other data center GPU), you can use NVIDIA driver release 470.57 (or later R470), 525.85 (or later R525), 535.86 (or later R535), or 545.23 (or later R545).

The CUDA driver's compatibility package only supports particular drivers. Thus, users should upgrade from all R418, R440, R450, R460, R510, and R520 drivers, which are not forward-compatible with CUDA 12.3. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Since you have a datacenter GPU (A10G), and driver R535.161* on the host from your screenshot, it should be compatible based on However, if you are running on a data center GPU ... you can use NVIDIA driver release ... 535.86 (or later R535). If not compatible for some reason, the container should print a banner with a descriptive error when you start the container. Please try it out and let us know if it doesn't work for some reason.

geraldstanje · 2024-04-22T17:06:44Z

Hi @rmccorm4 @Tabrizian

i still see the problem using nvcr.io/nvidia/tritionserver:24.02-trtllm-python-py3:

[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available.

more infos from inside the docker container:

lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.3 LTS
Release:	22.04
Codename:	jammy

nvidia-smi
Mon Apr 22 17:00:40 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    On  | 00000000:00:1B.0 Off |                    0 |
|  0%   17C    P8              15W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A10G                    On  | 00000000:00:1C.0 Off |                    0 |
|  0%   16C    P8              15W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A10G                    On  | 00000000:00:1D.0 Off |                    0 |
|  0%   16C    P8              15W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A10G                    On  | 00000000:00:1E.0 Off |                    0 |
|  0%   16C    P8              15W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

model building:

./llama2_llm_tensorrt_engine_build_and_test.sh 
[TensorRT-LLM] TensorRT-LLM version: 0.8.00.8.0
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.36s/it]
Weights loaded. Total time: 00:00:10
Total time of converting checkpoints: 00:02:05
[TensorRT-LLM] TensorRT-LLM version: 0.8.0[04/22/2024-16:40:34] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set gemm_plugin to float16.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set lookup_plugin to None.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set lora_plugin to None.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set context_fmha to True.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set paged_kv_cache to True.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set remove_input_padding to True.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set multi_block_mode to False.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set enable_xqa to True.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set tokens_per_block to 128.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[04/22/2024-16:40:34] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.
[04/22/2024-16:40:34] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
[04/22/2024-16:40:34] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 183, GPU 256 (MiB)
[04/22/2024-16:41:24] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1798, GPU +312, now: CPU 2117, GPU 568 (MiB)
[04/22/2024-16:41:24] [TRT-LLM] [I] Set nccl_plugin to None.
[04/22/2024-16:41:24] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[04/22/2024-16:41:25] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[04/22/2024-16:41:25] [TRT] [W] Unused Input: position_ids
[04/22/2024-16:41:25] [TRT] [W] Detected layernorm nodes in FP16.
[04/22/2024-16:41:25] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.
[04/22/2024-16:41:25] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[04/22/2024-16:41:25] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2153, GPU 594 (MiB)
[04/22/2024-16:41:25] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 2155, GPU 604 (MiB)
[04/22/2024-16:41:25] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[04/22/2024-16:41:25] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[04/22/2024-16:41:35] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[04/22/2024-16:41:35] [TRT] [I] Detected 106 inputs and 1 output network tensors.
[04/22/2024-16:41:40] [TRT] [I] Total Host Persistent Memory: 82640
[04/22/2024-16:41:40] [TRT] [I] Total Device Persistent Memory: 0
[04/22/2024-16:41:40] [TRT] [I] Total Scratch Memory: 537001984
[04/22/2024-16:41:40] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 619 steps to complete.
[04/22/2024-16:41:40] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 24.3962ms to assign 12 blocks to 619 nodes requiring 3238006272 bytes.
[04/22/2024-16:41:40] [TRT] [I] Total Activation Memory: 3238006272
[04/22/2024-16:41:40] [TRT] [I] Total Weights Memory: 13476831232
[04/22/2024-16:41:40] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2192, GPU 13474 (MiB)
[04/22/2024-16:41:40] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 2193, GPU 13484 (MiB)
[04/22/2024-16:41:40] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[04/22/2024-16:41:40] [TRT] [I] Engine generation completed in 15.4387 seconds.
[04/22/2024-16:41:40] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 12853 MiB
[04/22/2024-16:41:40] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +12853, now: CPU 0, GPU 12853 (MiB)
[04/22/2024-16:41:47] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 28514 MiB
[04/22/2024-16:41:47] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:22
[04/22/2024-16:41:48] [TRT-LLM] [I] Serializing engine to /tensorrt/tensorrt-models/Llama-2-7b-chat-hf/v0.8.0/trt-engines/fp16/1-gpu/rank0.engine...
[04/22/2024-16:42:09] [TRT-LLM] [I] Engine serialized. Total time: 00:00:21
[04/22/2024-16:42:10] [TRT-LLM] [I] Total time of building all engines: 00:01:36
[TensorRT-LLM][INFO] Engine version 0.8.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be array, but is null
[TensorRT-LLM][WARNING] Optional value for parameter lora_target_modules will not be set.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available.
[TensorRT-LLM][INFO] Loaded engine size: 12855 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13001, GPU 13130 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 13002, GPU 13140 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +12852, now: CPU 0, GPU 12852 (MiB)
[TensorRT-LLM][WARNING] The value of maxAttentionWindow cannot exceed maxSequenceLength. Therefore, it has been adjusted to match the value of maxSequenceLength.
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13035, GPU 16242 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 13035, GPU 16250 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12852 (MiB)
[TensorRT-LLM][INFO] Allocate 5972688896 bytes for k/v cache. 
[TensorRT-LLM][INFO] Using 11392 tokens in paged KV cache.
[TensorRT-LLM] TensorRT-LLM version: 0.8.0Input [Text 0]: "<s> [INST] What is deep learning? [/INST]"
Output [Text 0 Beam 0]: " Deep learning is a subfield of machine learning that involves the use of artificial neural networks to model and solve complex problems. Here are some key things to know about deep learning:

1. Artificial Neural Networks (ANNs): Deep learning algorithms are based on artificial neural networks, which are modeled after the structure and function of the human brain. ANNs consist of interconnected nodes or neurons that process inputs and produce outputs.
2. Multi-Layer Perceptron (MLP): The most common type of deep learning algorithm is the multi-layer perceptron (MLP), which consists of multiple layers of neurons with nonlinear activation functions. Each layer processes the output from the previous layer, allowing the network to learn increasingly complex patterns in the data.
3. Convolutional Neural Networks (CNNs): CNNs are a type of deep learning algorithm specifically designed for image recognition tasks. They use convolutional and pooling layers to extract features from images, followed by fully connected layers to make predictions.
4. Recurrent Neural Networks (RNNs): RNNs are a type of deep learning algorithm used for sequential data, such as"

llama2_llm_tensorrt_engine_build_and_test.sh looks like this:

#!/bin/bash

HF_MODEL_NAME="Llama-2-7b-chat-hf"
HF_MODEL_PATH="meta-llama/Llama-2-7b-chat-h"
# Clone the Hugging Face model repository
# ...
# Convert the model checkpoint to TensorRT format
python /tensorrt/v0.8.0/tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py \
    --model_dir /tensorrt/models/$HF_MODEL_NAME \
    --output_dir /tensorrt/tensorrt-models/$HF_MODEL_NAME/v0.8.0/trt-checkpoints/fp16/1-gpu/ \
    --dtype float16
# Build TensorRT engine
trtllm-build --checkpoint_dir /tensorrt/tensorrt-models/$HF_MODEL_NAME/v0.8.0/trt-checkpoints/fp16/1-gpu/ \
    --output_dir /tensorrt/tensorrt-models/$HF_MODEL_NAME/v0.8.0/trt-engines/fp16/1-gpu/ \
    --remove_input_padding enable \
    --context_fmha enable \
    --gemm_plugin float16 \
    --max_input_len 32768 \
    --strongly_typed
# Run inference with the TensorRT engine
python3 /tensorrt/v0.8.0/tensorrtllm_backend/tensorrt_llm/examples/run.py \
    --max_output_len=250 \
    --tokenizer_dir /tensorrt/models/$HF_MODEL_NAME \
    --engine_dir=/tensorrt/tensorrt-models/$HF_MODEL_NAME/v0.8.0/trt-engines/fp16/1-gpu/ \
    --max_attention_window_size=4096 \
    --temperature=0.3 \
    --top_k=50 \
    --top_p=0.9 \
    --repetition_penalty=1.2 \
    --input_text="[INST] What is deep learning? [/INST]"

also what i notices is when i measure the latency of of the run.py - it takes 21 seconds to run it - why is that so slow?

time python3 /tensorrt/v0.8.0/tensorrtllm_backend/tensorrt_llm/examples/run.py \
    --max_output_len=250 \
    --tokenizer_dir /tensorrt/models/$HF_MODEL_NAME \
    --engine_dir=/tensorrt/tensorrt-models/$HF_MODEL_NAME/v0.8.0/trt-engines/fp16/1-gpu/ \
    --max_attention_window_size=4096 \
    --temperature=0.3 \
    --top_k=50 \
    --top_p=0.9 \
    --repetition_penalty=1.2 \
    --input_text="[INST] What is deep learning? [/INST]"

...

real   0m21.735s
user  0m11.898s
sys    0m14.218s

Thanks,
Gerald

rmccorm4 · 2024-04-22T18:12:06Z

Hi @geraldstanje, for questions about running the engine directly (outside of Triton) via run.py and specific details of the standalone engine performance, I would reach out in the TRT-LLM github channel: https://github.com/NVIDIA/TensorRT-LLM/issues

geraldstanje · 2024-04-22T19:40:37Z

@rmccorm4 what about these warnings here? if see these warnings - compiling the model with tp_size = 4 would not work than...

[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available.

rmccorm4 · 2024-04-22T19:50:45Z

@fpetrini15 @krishung5 do you know anything about these multi-gpu engine build warnings?

My assumption is that this is saying multi-gpu performance may be degraded without direct p2p access like NVLink, but may otherwise be functional? But will let others who know more comment. Otherwise this is a question for the TRT-LLM team as well.

krishung5 · 2024-04-22T19:59:16Z

It looks like your GPU doesn't support peer-to-peer access. Could you run nvidia-smi topo -m to see if that's the case? I did have a similar issue before where my GPUs don't support peer access:

root@g242-p33-0002:/opt/tritonserver/tensorrtllm_backend/ci/L0_backend_trtllm# nvidia-smi topo -m
        GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     0-79    0               N/A
GPU1    SYS      X      0-79    0               N/A

The way to resolve the runtime issue for me was just to add this flag --use_custom_all_reduce=disable when building the engines. For more detailed info, I would suggest asking TRT-LLM team.

geraldstanje · 2024-04-22T20:05:08Z

@krishung5 here is my gpu topo - it looks like they have p2p access via PHB?

nvidia-smi topo -m

       GPU0   GPU1   GPU2   GPU3   CPU Affinity  NUMA Affinity GPU NUMA ID
GPU0   X     PHB    PHB    PHB    0-47   0             N/A
GPU1   PHB    X     PHB    PHB    0-47   0             N/A
GPU2   PHB    PHB    X     PHB    0-47   0             N/A
GPU3   PHB    PHB    PHB    X     0-47   0             N/A

Legend:
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

can i still use tp_size = 4 and use all gpus?

krishung5 · 2024-04-22T22:43:58Z

@geraldstanje I think it might also require nvlinks for p2p access - not sure about this part, should have more clarification from the TRT-LLM GitHub channel.

From my experience, I was able to specify tp_size and use all gpus by using this flag --use_custom_all_reduce=disable when building the engines.

geraldstanje · 2024-04-22T23:23:04Z

@krishung5 sure lets way for trt-llm people to look at it - can you show me what you used exactly in the meantime?

krishung5 · 2024-04-22T23:37:29Z

@geraldstanje Sure thing! I'm using the command in the README as example. Basically just adding the last line when building engines:

# Build TensorRT engines
trtllm-build --checkpoint_dir ./c-model/gpt2/fp16/4-gpu \
        --gpt_attention_plugin float16 \
        --remove_input_padding enable \
        --paged_kv_cache enable \
        --gemm_plugin float16 \
        --output_dir engines/fp16/4-gpu \
       --use_custom_all_reduce=disable  # Add this line

For the question for TRT-LLM team, can you file a separate GitHub issue for this topic in the TRT-LLM channel? I believe this will be faster to get a respond from them.

geraldstanje · 2024-04-27T21:02:13Z

@krishung5 thanks for quick reply. i created an issue for the TRT-LLM team: NVIDIA/TensorRT-LLM#1487 - they said its only a warning and it should work still for 1 or 4 gpus?

geraldstanje changed the title ~~error running example~~ error running simple example Apr 15, 2024

rmccorm4 added the module: backends Issues related to the backends label Apr 15, 2024

Tabrizian assigned Tabrizian and unassigned Tabrizian Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error running simple example #7118

error running simple example #7118

geraldstanje commented Apr 15, 2024 •

edited

rmccorm4 commented Apr 15, 2024

geraldstanje commented Apr 17, 2024 •

edited

rmccorm4 commented Apr 17, 2024

geraldstanje commented Apr 17, 2024 •

edited

rmccorm4 commented Apr 17, 2024

geraldstanje commented Apr 22, 2024 •

edited

rmccorm4 commented Apr 22, 2024

geraldstanje commented Apr 22, 2024 •

edited

rmccorm4 commented Apr 22, 2024

krishung5 commented Apr 22, 2024

geraldstanje commented Apr 22, 2024 •

edited

krishung5 commented Apr 22, 2024

geraldstanje commented Apr 22, 2024

krishung5 commented Apr 22, 2024

geraldstanje commented Apr 27, 2024 •

edited

error running simple example #7118

error running simple example #7118

Comments

geraldstanje commented Apr 15, 2024 • edited

rmccorm4 commented Apr 15, 2024

geraldstanje commented Apr 17, 2024 • edited

rmccorm4 commented Apr 17, 2024

geraldstanje commented Apr 17, 2024 • edited

rmccorm4 commented Apr 17, 2024

geraldstanje commented Apr 22, 2024 • edited

rmccorm4 commented Apr 22, 2024

geraldstanje commented Apr 22, 2024 • edited

rmccorm4 commented Apr 22, 2024

krishung5 commented Apr 22, 2024

geraldstanje commented Apr 22, 2024 • edited

krishung5 commented Apr 22, 2024

geraldstanje commented Apr 22, 2024

krishung5 commented Apr 22, 2024

geraldstanje commented Apr 27, 2024 • edited

geraldstanje commented Apr 15, 2024 •

edited

geraldstanje commented Apr 17, 2024 •

edited

geraldstanje commented Apr 17, 2024 •

edited

geraldstanje commented Apr 22, 2024 •

edited

geraldstanje commented Apr 22, 2024 •

edited

geraldstanje commented Apr 22, 2024 •

edited

geraldstanje commented Apr 27, 2024 •

edited