Fail to run Mixtral 8x7b with tp size 4 on w4a16 #1596

gloritygithub11 · 2024-05-14T01:39:54Z

System Info

tensorrt 10.0.1
tensorrt-cu12 10.0.1
tensorrt-cu12-bindings 10.0.1
tensorrt-cu12-libs 10.0.1
tensorrt-llm 0.10.0.dev2024050700

A100 * 4

Who can help?

@Tracin

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

build success with following script on Mixtral 8x7b

set -ex

export MODEL_DIR=/mnt/memory
export MODEL_NAME=Mixtral-8x7B-Instruct-v0.1
export LD_LIBRARY_PATH=/usr/local/tensorrt/lib:$LD_LIBRARY_PATH
export PATH=/usr/local/tensorrt/bin:$PATH
export PRECISION=W4A16
export DTYPE=bfloat16
export PYTHONPATH=/app/tensorrt-llm:$PYTHONPATH
export TP_SIZE=4

export CUDA_VISIBLE_DEVICES=0,1,2,3
python ../llama/convert_checkpoint.py \
    --model_dir $MODEL_DIR/${MODEL_NAME} \
    --output_dir $MODEL_DIR/tmp/trt_models/${MODEL_NAME}/$PRECISION/${TP_SIZE}-gpu-tp \
    --dtype $DTYPE \
    --use_weight_only \
    --tp_size ${TP_SIZE} \
    --weight_only_precision int4


trtllm-build \
    --checkpoint_dir $MODEL_DIR/tmp/trt_models/${MODEL_NAME}/$PRECISION/${TP_SIZE}-gpu-tp \
    --output_dir $MODEL_DIR/tmp/trt_engines/${MODEL_NAME}/$PRECISION/${TP_SIZE}-gpu-tp \
    --gemm_plugin $DTYPE \
    --gpt_attention_plugin $DTYPE \
    --max_batch_size 1 \
    --max_input_len 2048 \
    --max_output_len 1024 \
    --max_multimodal_len 576

run with command:

mpirun --allow-run-as-root -n 4 python3 /app/tensorrt-llm/examples/run.py --engine_dir /mnt/memory/tmp/trt_engines/Mixtral-8x7B-Instruct-v0.1/W4A16/4-gpu-tp --tokenizer_dir /mnt/memory/Mixtral-8x7B-Instruct-v0.1 --max_output_len 1024 --input_text "I love french quiche" --run_profiling

Expected behavior

run success

actual behavior

get error

[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024050700
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024050700
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024050700
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024050700
[TensorRT-LLM][INFO] Engine version 0.10.0.dev2024050700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found
[TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set.
[TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 4, rank: 3
[TensorRT-LLM][INFO] Engine version 0.10.0.dev2024050700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found
[TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set.
[TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 4, rank: 1
[TensorRT-LLM][INFO] Engine version 0.10.0.dev2024050700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.10.0.dev2024050700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found
[TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found
[TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set.
[TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 4, rank: 2
[TensorRT-LLM][INFO] MPI size: 4, rank: 0
[TensorRT-LLM][INFO] Loaded engine size: 5878 MiB
[TensorRT-LLM][INFO] Loaded engine size: 5878 MiB
[TensorRT-LLM][WARNING] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[TensorRT-LLM][INFO] Loaded engine size: 5878 MiB
[TensorRT-LLM][INFO] Loaded engine size: 5878 MiB
[TensorRT-LLM][INFO] Allocated 244.15 MiB for execution context memory.
[TensorRT-LLM][INFO] Allocated 244.15 MiB for execution context memory.
[TensorRT-LLM][INFO] Allocated 244.15 MiB for execution context memory.
[TensorRT-LLM][INFO] Allocated 244.15 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5874 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5874 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5874 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5874 (MiB)
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 9
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 9
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 9
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 9
[TensorRT-LLM][INFO] Max tokens in paged KV cache: 930560. Allocating 30492590080 bytes.
[TensorRT-LLM][INFO] Max tokens in paged KV cache: 930560. Allocating 30492590080 bytes.
[TensorRT-LLM][INFO] Max tokens in paged KV cache: 930560. Allocating 30492590080 bytes.
[TensorRT-LLM][INFO] Max tokens in paged KV cache: 930560. Allocating 30492590080 bytes.
[TensorRT-LLM][WARNING] prompt_embedding_table: expected dim[1] = 4096, provided dim[1] = 1024
[TensorRT-LLM][WARNING] prompt_embedding_table: expected dim[1] = 4096, provided dim[1] = 1024
[TensorRT-LLM][ERROR] 3: [executionContext.cpp::setInputShape::2037] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2037, condition: engineDims.d[i] == dims.d[i] Static dimension mismatch while setting input shape.)
Traceback (most recent call last):
  File "/app/tensorrt-llm/examples/run.py", line 595, in <module>
    main(args)
  File "/app/tensorrt-llm/examples/run.py", line 426, in main
    outputs = runner.generate(
  File "/app/tensorrt-llm/tensorrt_llm/runtime/model_runner_cpp.py", line 368, in generate
    self.session.generate(generation_output, generation_input,
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Tensor 'prompt_embedding_table' has invalid shape (1, 1024), expected (-1, 4096) (/app/tensorrt-llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:150)
1       0x7f5e3eccf65a tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7f5e407bf224 tensorrt_llm::runtime::TllmRuntime::setInputTensors(int, std::unordered_map<std::string, std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::shared_ptr<tensorrt_llm::runtime::ITensor> > > > const&) + 1380
3       0x7f5e4076a1bb tensorrt_llm::runtime::GptSession::executeContextStep(std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator<tensorrt_llm::runtime::GenerationInput> > const&, std::vector<int, std::allocator<int> > const&, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const*) + 891
4       0x7f5e4076b044 tensorrt_llm::runtime::GptSession::generateBatched(std::vector<tensorrt_llm::runtime::GenerationOutput, std::allocator<tensorrt_llm::runtime::GenerationOutput> >&, std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator<tensorrt_llm::runtime::GenerationInput> > const&, tensorrt_llm::runtime::SamplingConfig const&, std::function<void (int, bool)> const&, std::shared_ptr<tensorrt_llm::runtime::GptSession::GenerationProfiler>) + 2148
5       0x7f5e4076ca35 tensorrt_llm::runtime::GptSession::generate(tensorrt_llm::runtime::GenerationOutput&, tensorrt_llm::runtime::GenerationInput const&, tensorrt_llm::runtime::SamplingConfig const&, std::shared_ptr<tensorrt_llm::runtime::GptSession::GenerationProfiler>) + 2261
6       0x7f5eb4375d58 /app/tensorrt-llm/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xb5d58) [0x7f5eb4375d58]
7       0x7f5eb43927ea /app/tensorrt-llm/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xd27ea) [0x7f5eb43927ea]
8             0x53bd79 python3() [0x53bd79]
9             0x629d24 _PyObject_MakeTpCall + 356
10            0x549c2e python3() [0x549c2e]
11            0x5ae603 _PyEval_EvalFrameDefault + 19699
12            0x548efa python3() [0x548efa]
13            0x62893c PyObject_Call + 172
14            0x5ac51b _PyEval_EvalFrameDefault + 11275
15            0x628d60 _PyFunction_Vectorcall + 592
16            0x5a9c1b _PyEval_EvalFrameDefault + 779
17            0x5a8bf1 python3() [0x5a8bf1]
18            0x6d77cf PyEval_EvalCode + 127
19            0x6bb91b python3() [0x6bb91b]
20            0x6bb9a4 python3() [0x6bb9a4]
21            0x6bbde6 python3() [0x6bbde6]
22            0x6c0c84 _PyRun_SimpleFileObject + 404
23            0x6c0d57 _PyRun_AnyFileObject + 71
24            0x7042dd Py_RunMain + 877
25            0x7044bd Py_BytesMain + 45
26      0x7f606c86e083 __libc_start_main + 243
27            0x62ff4e _start + 46
[TensorRT-LLM][ERROR] 3: [executionContext.cpp::setInputShape::2037] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2037, condition: engineDims.d[i] == dims.d[i] Static dimension mismatch while setting input shape.)
Traceback (most recent call last):
  File "/app/tensorrt-llm/examples/run.py", line 595, in <module>
[TensorRT-LLM][WARNING] prompt_embedding_table: expected dim[1] = 4096, provided dim[1] = 1024
    main(args)
  File "/app/tensorrt-llm/examples/run.py", line 426, in main
    outputs = runner.generate(
  File "/app/tensorrt-llm/tensorrt_llm/runtime/model_runner_cpp.py", line 368, in generate
    self.session.generate(generation_output, generation_input,
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Tensor 'prompt_embedding_table' has invalid shape (1, 1024), expected (-1, 4096) (/app/tensorrt-llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:150)
1       0x7fafe4b1f65a tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7fafe660f224 tensorrt_llm::runtime::TllmRuntime::setInputTensors(int, std::unordered_map<std::string, std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::shared_ptr<tensorrt_llm::runtime::ITensor> > > > const&) + 1380
3       0x7fafe65ba1bb tensorrt_llm::runtime::GptSession::executeContextStep(std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator<tensorrt_llm::runtime::GenerationInput> > const&, std::vector<int, std::allocator<int> > const&, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const*) + 891
4       0x7fafe65bb044 tensorrt_llm::runtime::GptSession::generateBatched(std::vector<tensorrt_llm::runtime::GenerationOutput, std::allocator<tensorrt_llm::runtime::GenerationOutput> >&, std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator<tensorrt_llm::runtime::GenerationInput> > const&, tensorrt_llm::runtime::SamplingConfig const&, std::function<void (int, bool)> const&, std::shared_ptr<tensorrt_llm::runtime::GptSession::GenerationProfiler>) + 2148
5       0x7fafe65bca35 tensorrt_llm::runtime::GptSession::generate(tensorrt_llm::runtime::GenerationOutput&, tensorrt_llm::runtime::GenerationInput const&, tensorrt_llm::runtime::SamplingConfig const&, std::shared_ptr<tensorrt_llm::runtime::GptSession::GenerationProfiler>) + 2261
6       0x7fb05a1c5d58 /app/tensorrt-llm/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xb5d58) [0x7fb05a1c5d58]
7       0x7fb05a1e27ea /app/tensorrt-llm/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xd27ea) [0x7fb05a1e27ea]
8             0x53bd79 python3() [0x53bd79]
9             0x629d24 _PyObject_MakeTpCall + 356
10            0x549c2e python3() [0x549c2e]
11            0x5ae603 _PyEval_EvalFrameDefault + 19699
12            0x548efa python3() [0x548efa]
13            0x62893c PyObject_Call + 172
14            0x5ac51b _PyEval_EvalFrameDefault + 11275
15            0x628d60 _PyFunction_Vectorcall + 592
16            0x5a9c1b _PyEval_EvalFrameDefault + 779
17            0x5a8bf1 python3() [0x5a8bf1]
18            0x6d77cf PyEval_EvalCode + 127
19            0x6bb91b python3() [0x6bb91b]
20            0x6bb9a4 python3() [0x6bb9a4]
21            0x6bbde6 python3() [0x6bbde6]
22            0x6c0c84 _PyRun_SimpleFileObject + 404
23            0x6c0d57 _PyRun_AnyFileObject + 71
24            0x7042dd Py_RunMain + 877
25            0x7044bd Py_BytesMain + 45
26      0x7fb2126be083 __libc_start_main + 243
27            0x62ff4e _start + 46
[TensorRT-LLM][ERROR] 3: [executionContext.cpp::setInputShape::2037] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2037, condition: engineDims.d[i] == dims.d[i] Static dimension mismatch while setting input shape.)
[TensorRT-LLM][WARNING] prompt_embedding_table: expected dim[1] = 4096, provided dim[1] = 1024
Traceback (most recent call last):
  File "/app/tensorrt-llm/examples/run.py", line 595, in <module>
    main(args)
  File "/app/tensorrt-llm/examples/run.py", line 426, in main
    outputs = runner.generate(
  File "/app/tensorrt-llm/tensorrt_llm/runtime/model_runner_cpp.py", line 368, in generate
    self.session.generate(generation_output, generation_input,
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Tensor 'prompt_embedding_table' has invalid shape (1, 1024), expected (-1, 4096) (/app/tensorrt-llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:150)
1       0x7f2c2067d65a tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7f2c2216d224 tensorrt_llm::runtime::TllmRuntime::setInputTensors(int, std::unordered_map<std::string, std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::shared_ptr<tensorrt_llm::runtime::ITensor> > > > const&) + 1380
3       0x7f2c221181bb tensorrt_llm::runtime::GptSession::executeContextStep(std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator<tensorrt_llm::runtime::GenerationInput> > const&, std::vector<int, std::allocator<int> > const&, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const*) + 891
4       0x7f2c22119044 tensorrt_llm::runtime::GptSession::generateBatched(std::vector<tensorrt_llm::runtime::GenerationOutput, std::allocator<tensorrt_llm::runtime::GenerationOutput> >&, std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator<tensorrt_llm::runtime::GenerationInput> > const&, tensorrt_llm::runtime::SamplingConfig const&, std::function<void (int, bool)> const&, std::shared_ptr<tensorrt_llm::runtime::GptSession::GenerationProfiler>) + 2148
5       0x7f2c2211aa35 tensorrt_llm::runtime::GptSession::generate(tensorrt_llm::runtime::GenerationOutput&, tensorrt_llm::runtime::GenerationInput const&, tensorrt_llm::runtime::SamplingConfig const&, std::shared_ptr<tensorrt_llm::runtime::GptSession::GenerationProfiler>) + 2261
6       0x7f2c95d23d58 /app/tensorrt-llm/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xb5d58) [0x7f2c95d23d58]
7       0x7f2c95d407ea /app/tensorrt-llm/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xd27ea) [0x7f2c95d407ea]
8             0x53bd79 python3() [0x53bd79]
9             0x629d24 _PyObject_MakeTpCall + 356
10            0x549c2e python3() [0x549c2e]
11            0x5ae603 _PyEval_EvalFrameDefault + 19699
12            0x548efa python3() [0x548efa]
13            0x62893c PyObject_Call + 172
14            0x5ac51b _PyEval_EvalFrameDefault + 11275
15            0x628d60 _PyFunction_Vectorcall + 592
16            0x5a9c1b _PyEval_EvalFrameDefault + 779
17            0x5a8bf1 python3() [0x5a8bf1]
18            0x6d77cf PyEval_EvalCode + 127
19            0x6bb91b python3() [0x6bb91b]
20            0x6bb9a4 python3() [0x6bb9a4]
21            0x6bbde6 python3() [0x6bbde6]
22            0x6c0c84 _PyRun_SimpleFileObject + 404
23            0x6c0d57 _PyRun_AnyFileObject + 71
24            0x7042dd Py_RunMain + 877
25            0x7044bd Py_BytesMain + 45
26      0x7f2e4e21c083 __libc_start_main + 243
27            0x62ff4e _start + 46
[TensorRT-LLM][ERROR] 3: [executionContext.cpp::setInputShape::2037] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2037, condition: engineDims.d[i] == dims.d[i] Static dimension mismatch while setting input shape.)
Traceback (most recent call last):
  File "/app/tensorrt-llm/examples/run.py", line 595, in <module>
    main(args)
  File "/app/tensorrt-llm/examples/run.py", line 426, in main
    outputs = runner.generate(
  File "/app/tensorrt-llm/tensorrt_llm/runtime/model_runner_cpp.py", line 368, in generate
    self.session.generate(generation_output, generation_input,
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Tensor 'prompt_embedding_table' has invalid shape (1, 1024), expected (-1, 4096) (/app/tensorrt-llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:150)
1       0x7f422613d65a tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7f4227c2d224 tensorrt_llm::runtime::TllmRuntime::setInputTensors(int, std::unordered_map<std::string, std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::shared_ptr<tensorrt_llm::runtime::ITensor> > > > const&) + 1380
3       0x7f4227bd81bb tensorrt_llm::runtime::GptSession::executeContextStep(std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator<tensorrt_llm::runtime::GenerationInput> > const&, std::vector<int, std::allocator<int> > const&, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const*) + 891
4       0x7f4227bd9044 tensorrt_llm::runtime::GptSession::generateBatched(std::vector<tensorrt_llm::runtime::GenerationOutput, std::allocator<tensorrt_llm::runtime::GenerationOutput> >&, std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator<tensorrt_llm::runtime::GenerationInput> > const&, tensorrt_llm::runtime::SamplingConfig const&, std::function<void (int, bool)> const&, std::shared_ptr<tensorrt_llm::runtime::GptSession::GenerationProfiler>) + 2148
5       0x7f4227bdaa35 tensorrt_llm::runtime::GptSession::generate(tensorrt_llm::runtime::GenerationOutput&, tensorrt_llm::runtime::GenerationInput const&, tensorrt_llm::runtime::SamplingConfig const&, std::shared_ptr<tensorrt_llm::runtime::GptSession::GenerationProfiler>) + 2261
6       0x7f429b7e3d58 /app/tensorrt-llm/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xb5d58) [0x7f429b7e3d58]
7       0x7f429b8007ea /app/tensorrt-llm/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xd27ea) [0x7f429b8007ea]
8             0x53bd79 python3() [0x53bd79]
9             0x629d24 _PyObject_MakeTpCall + 356
10            0x549c2e python3() [0x549c2e]
11            0x5ae603 _PyEval_EvalFrameDefault + 19699
12            0x548efa python3() [0x548efa]
13            0x62893c PyObject_Call + 172
14            0x5ac51b _PyEval_EvalFrameDefault + 11275
15            0x628d60 _PyFunction_Vectorcall + 592
16            0x5a9c1b _PyEval_EvalFrameDefault + 779
17            0x5a8bf1 python3() [0x5a8bf1]
18            0x6d77cf PyEval_EvalCode + 127
19            0x6bb91b python3() [0x6bb91b]
20            0x6bb9a4 python3() [0x6bb9a4]
21            0x6bbde6 python3() [0x6bbde6]
22            0x6c0c84 _PyRun_SimpleFileObject + 404
23            0x6c0d57 _PyRun_AnyFileObject + 71
24            0x7042dd Py_RunMain + 877
25            0x7044bd Py_BytesMain + 45
26      0x7f4453cdc083 __libc_start_main + 243
27            0x62ff4e _start + 46
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[57329,1],1]
  Exit code:    1
--------------------------------------------------------------------------

additional notes

if run with tp size 8, get similar error:

RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Tensor 'prompt_embedding_table' has invalid shape (1, 512), expected (-1, 4096)

The text was updated successfully, but these errors were encountered:

gloritygithub11 · 2024-05-14T04:12:03Z

remove the --max_multimodal_len 576 option resolve the problem

gloritygithub11 added the bug Something isn't working label May 14, 2024

gloritygithub11 closed this as completed May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to run Mixtral 8x7b with tp size 4 on w4a16 #1596

Fail to run Mixtral 8x7b with tp size 4 on w4a16 #1596

gloritygithub11 commented May 14, 2024 •

edited

gloritygithub11 commented May 14, 2024

Fail to run Mixtral 8x7b with tp size 4 on w4a16 #1596

Fail to run Mixtral 8x7b with tp size 4 on w4a16 #1596

Comments

gloritygithub11 commented May 14, 2024 • edited

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

gloritygithub11 commented May 14, 2024

gloritygithub11 commented May 14, 2024 •

edited