-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about serving PyTorch LLM in Python backend with token streaming using "Decoupled Mode" #7210
Comments
Thanks for your questions!
Let us know if you have any additional questions! |
Hi @Tabrizian , thank you very much for your detailed reply and insights! I have a few follow-up questions:
Thanks a lot for your time and help again! |
|
I see, thank you very much for your answers and insights! |
I am planning to use Triton's Python backend to serve a LLM model in Pytorch; and more specifically, I want to implement token streaming and hence based on the suggestions I read here #5913, I think I am going to use the "Decoupled Mode" to achieve this. I went over the
square_model.py
https://github.com/triton-inference-server/python_backend/blob/main/examples/decoupled/square_model.py example and have some questions about the "Decoupled Mode" and also the Python backend in general, and I would really appreciate your help and insights! Thanks for your time in advance!Questions:
max_batch_size: 8
, andcount: 4
in the instance group config, does it mean the max batch size for each one of the model instances is 8 (e.g. batch size of 8 for instance 0, batch size of 8 for instance 1, batch size of 8 for instance 2, batch size of 8 for instance 3), or does it mean the max batch size for all 4 model instances in combination is 8 (e.g. batch size of 2 for instance 0, batch size of 3 for instance 1, batch size of 1 for instance 2, batch size of 2 for instance 3)?square_model.py
https://github.com/triton-inference-server/python_backend/blob/main/examples/decoupled/square_model.py example in particular, I see that we are spawning a new thread to handle each request in the batch of requests; my question is: is threading in decoupled mode required or is it meant for demonstration purposes only? In other words, is it valid to handle each request in the batch of requests in a loop sequentially without using threading? This is because I am not sure if the LLM PyTorch model that I am using is thread-safe during inference (I could put a lock for each thread's token generation, but I am not sure if that would impact the server performance).TritonPythonModel
class, and assuming I only use 1 model instance, my question is: if I use the Pytorch model to perform inference on each thread, the model's weights are only loaded once on the GPU right? In other words, if my model is 4 GB and I have 2 threads serving a batch of requests of size 2, the model is still only loaded once and hence only takes up 4 GB of the GPU memory right?square_model.py
https://github.com/triton-inference-server/python_backend/blob/main/examples/decoupled/square_model.py example in particular, I see that we are setting the spawned thread to daemon thread with the linethread.daemon = True
. Is this line always necessary/required because I also read in that file's comments thatIn real-world models, the developer should be mindful of when to return from execute and be willing to accept next request batch.
? So if I only want the server to accept the next request batch after all requests in the current batch have been completed, should I usejoin()
to wait for the spawned threads to complete before callingreturn None
inexecute()
?The text was updated successfully, but these errors were encountered: