Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error occured when running medusa inference. #1575

Open
littletomatodonkey opened this issue May 10, 2024 · 1 comment
Open

Error occured when running medusa inference. #1575

littletomatodonkey opened this issue May 10, 2024 · 1 comment
Assignees

Comments

@littletomatodonkey
Copy link

Hi, when i use medusa decoding on trtllm-090 which profiling, error occrued as follows. Could you please help to have a look? Thanks!

If i do not use --run_profiling, the inference process is normal.

  File "/opt/tiger/miniconda3/lib/python3.10/site-packages/tensorrt_llm/runtime/generation.py", line 2431, in handle_per_step
    self.accept_lengths).item()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[05/10/2024-21:45:33] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::55] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:345] Error 700 destroying stream '0x56052ab0a0f0'.)
[05/10/2024-21:45:33] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::55] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:345] Error 700 destroying stream '0x56052ab69e70'.)
[05/10/2024-21:45:33] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::55] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:345] Error 700 destroying stream '0x56052abb95f0'.)
  • conversion
model_dir="/mnt/bn/multimodel/models/medusa/vicuna-7b-v1.3"
medusa_model_dir="/mnt/bn/multimodel/models/medusa/medusa-vicuna-7b-v1.3"

tmp_dir=$(mktemp -d)

trt_model_dir="./output/medusa"

python convert_checkpoint.py \
--model_dir "${model_dir}" \
--medusa_model_dir "${medusa_model_dir}" \
--output_dir "${tmp_dir}" \
--dtype float16 \
--fixed_num_medusa_heads 4

trtllm-build \
--checkpoint_dir ${tmp_dir} \
--output_dir ${trt_model_dir} \
--gemm_plugin float16 \
--remove_input_padding "enable" \
--context_fmha "enable" \
--gemm_plugin="float16" \
--gpt_attention_plugin "float16" \
--max_batch_size 16 \
--max_input_len 4096 \
--max_output_len 1024 \
--paged_kv_cache enable \
--use_paged_context_fmha enable

cp -r ${model_dir}/*token* ${output_dir}/
  • inference
trt_model_dir="./output/medusa"

python ../run.py \
--engine_dir ${trt_model_dir} \
--tokenizer_dir ${trt_model_dir} \
--max_output_len=100 \
--medusa_choices="[[0], [0, 0], [1], [0, 1], [2], [0, 0, 0], [1, 0], [0, 2], [3], [0, 3], [4], [0, 4], [2, 0], [0, 5], [0, 0, 1], [5], [0, 6], [6], [0, 7], [0, 1, 0], [1, 1], [7], [0, 8], [0, 0, 2], [3, 0], [0, 9], [8], [9], [1, 0, 0], [0, 2, 0], [1, 2], [0, 0, 3], [4, 0], [2, 1], [0, 0, 4], [0, 0, 5], [0, 0, 0, 0], [0, 1, 1], [0, 0, 6], [0, 3, 0], [5, 0], [1, 3], [0, 0, 7], [0, 0, 8], [0, 0, 9], [6, 0], [0, 4, 0], [1, 4], [7, 0], [0, 1, 2], [2, 0, 0], [3, 1], [2, 2], [8, 0], [0, 5, 0], [1, 5], [1, 0, 1], [0, 2, 1], [9, 0], [0, 6, 0], [0, 0, 0, 1], [1, 6], [0, 7, 0]]" \
--use_py_session \
--temperature 1.0 \
--input_text "Once upon" \
--run_profiling
@dongxuy04
Copy link
Collaborator

Hi, I tried with latest main and it seems OK, could you please try with that? Thanks!
BTW, with latest main, C++ runtime can also be used by removing --use_py_session.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants