You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
At present text generation is only supported for http_server.cc and not supported in sagemaker_server.cc. This was verified using vLLM backend and Triton server. http_server.cc supports this by implementing HandleGenerate, which allows for the use of decoupled models (which vLLM backend models are).
Describe the solution you'd like
Implement the equivalent of HandleGenerate for sagemaker_server.cc
Describe alternatives you've considered
Using alternative servers (like DJLServing) with vLLM/TensorRT-LLM or different stacks (e.g. HuggingFace TGI)
Elaborating on this further:
Certain backends (e.g. vLLM) currently runs only in decoupled model transaction policy. sagemaker_server.cc inference function checks and fails any call for models that runs with decoupled model transaction policy.
http_server.cc on the other hand has a few functions for inference. HandleInfer does the same check for decoupled model transaction policy, and fails if the models runs with decoupled model transaction policy. HandleGenerate on the other hand doesn't, and is designed for text generation purposes. Hence, seeking advice/assistance to implement HandleGenerate equivalent for sagemaker_service.cc.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
At present text generation is only supported for http_server.cc and not supported in sagemaker_server.cc. This was verified using vLLM backend and Triton server. http_server.cc supports this by implementing HandleGenerate, which allows for the use of decoupled models (which vLLM backend models are).
Describe the solution you'd like
Implement the equivalent of HandleGenerate for sagemaker_server.cc
Describe alternatives you've considered
Using alternative servers (like DJLServing) with vLLM/TensorRT-LLM or different stacks (e.g. HuggingFace TGI)
Elaborating on this further:
Certain backends (e.g. vLLM) currently runs only in decoupled model transaction policy.
sagemaker_server.cc
inference function checks and fails any call for models that runs with decoupled model transaction policy.http_server.cc
on the other hand has a few functions for inference.HandleInfer
does the same check for decoupled model transaction policy, and fails if the models runs with decoupled model transaction policy.HandleGenerate
on the other hand doesn't, and is designed for text generation purposes. Hence, seeking advice/assistance to implementHandleGenerate
equivalent forsagemaker_service.cc
.The text was updated successfully, but these errors were encountered: