New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error running simple example #7118
Comments
Hi @geraldstanje, Thanks for raising this issue. I believe this error generally indicates a version mismatch issue:
You mentioned the following environment:
However, Triton v2.41 (23.12) is built for TRT-LLM backend v0.7.0 per the release notes: https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-23-12.html#rel-23-12 If you'd like to use TRT-LLM v0.8.0, I recommend using Triton 24.03 or 24.02 which were built and tested for TRT-LLM version v0.8.0. Please let us know if this fixes your issue. |
@rmccorm4 thanks for your reply - can i use the following on ubuntu 20.04 host?
i will rerun after you confirm it. |
Hi @geraldstanje, Triton 24.02 + TRTLLM v0.8.0 should work. The 7b models should likely fit on a single GPU with 24GB memory, but you can use tensor parallelism to split across gpus based on your use case. |
@rmccorm4 any issues regarding the ubuntu 20.04 host or cuda version 12.2 on the host? i plan to run the docker image: can i run any of the models above? |
I don't believe the Ubuntu 20.04 host should be an issue, as the container will have the required Ubuntu 22.04 inside. As for the CUDA/driver version, see this note from the tritonserver release notes:
Since you have a datacenter GPU (A10G), and driver R535.161* on the host from your screenshot, it should be compatible based on |
i still see the problem using nvcr.io/nvidia/tritionserver:24.02-trtllm-python-py3:
more infos from inside the docker container:
model building:
llama2_llm_tensorrt_engine_build_and_test.sh looks like this:
also what i notices is when i measure the latency of of the run.py - it takes 21 seconds to run it - why is that so slow?
Thanks, |
Hi @geraldstanje, for questions about running the engine directly (outside of Triton) via |
@rmccorm4 what about these warnings here? if see these warnings - compiling the model with tp_size = 4 would not work than...
|
@fpetrini15 @krishung5 do you know anything about these multi-gpu engine build warnings? My assumption is that this is saying multi-gpu performance may be degraded without direct p2p access like NVLink, but may otherwise be functional? But will let others who know more comment. Otherwise this is a question for the TRT-LLM team as well. |
It looks like your GPU doesn't support peer-to-peer access. Could you run
The way to resolve the runtime issue for me was just to add this flag |
@krishung5 here is my gpu topo - it looks like they have p2p access via PHB?
can i still use tp_size = 4 and use all gpus? |
@geraldstanje I think it might also require nvlinks for p2p access - not sure about this part, should have more clarification from the TRT-LLM GitHub channel. From my experience, I was able to specify tp_size and use all gpus by using this flag |
@krishung5 sure lets way for trt-llm people to look at it - can you show me what you used exactly in the meantime? |
@geraldstanje Sure thing! I'm using the command in the README as example. Basically just adding the last line when building engines:
For the question for TRT-LLM team, can you file a separate GitHub issue for this topic in the TRT-LLM channel? I believe this will be faster to get a respond from them. |
@krishung5 thanks for quick reply. i created an issue for the TRT-LLM team: NVIDIA/TensorRT-LLM#1487 - they said its only a warning and it should work still for 1 or 4 gpus? |
Description
A clear and concise description of what the bug is.
Triton Information
What version of Triton are you using?
Triton: 2.41
tensorrtllm_backend: 0.8.0
Are you using the Triton container or did you build it yourself?
i used docker: nvcr.io/nvidia/tritonserver:23.12-trtllm-python-py3 - running on ubuntu 22.04
To Reproduce
all the steps to reproduce are described here: https://github.com/mtezgider/triton-tensorrt-llm-model-preparation-and-deployment
than i started the server:
tritonserver --model-repository=/tensorrt/triton-repos/trtibf-Trendyol-LLM-7b-chat-v1.0 --model-control-mode=explicit --load-model=preprocessing --load-model=postprocessing --load-model=tensorrt_llm --load-model=tensorrt_llm_bls --load-model=ensemble --log-verbose=2 --log-info=1 --log-warning=1 --log-error=1
Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).
see here: https://github.com/mtezgider/triton-tensorrt-llm-model-preparation-and-deployment
Expected behavior
A clear and concise description of what you expected to happen.
no error running the model
the entire logs:
logs.txt
The text was updated successfully, but these errors were encountered: