-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
triton model breaks serving instance #60
Comments
Hi @stephanbertl, thanks for this report. We will look into it 🙂 |
any update? The serving module seems totally unstable, a model that is not working breaks the whole serving server. How is that supposed to work in prod? |
Hi @stephanbertl, I have not managed to reproduce this, can you perhaps provide some more information? Specifically, I assume you're using the serving helm chart, is that correct? Can you share how you configured it? |
@jkhenning sorry for not coming back earlier to you. I would say the culprit is the tritonserver default value of --exit-on-error=true. I quickly checked the code and I could not found a way to set this in clearm-serving. |
We have setup clearml serving on Kubernetes including triton support. Our triton instance has no GPU, so deploying a model leads to the following error in the triton instance:
E0718 07:41:21.083440 30 model_lifecycle.cc:596] failed to load 'distilbert-test2' version 1: Invalid argument: unable to load model 'distilbert-test2', TensorRT backend supports only GPU device
Trying to remove the model again is not possible:
clearml-serving --id 5097f44fe9cb45f7be2a917c6fe8cad9 model remove --endpoint distilbert-test2
yields the following:
`clearml-serving - CLI for launching ClearML serving engine
2023-07-18 09:47:59,260 - clearml.Task - ERROR - Failed reloading task 5097f44fe9cb45f7be2a917c6fe8cad9
2023-07-18 09:47:59,290 - clearml.Task - ERROR - Failed reloading task 5097f44fe9cb45f7be2a917c6fe8cad9
Error: Task ID "5097f44fe9cb45f7be2a917c6fe8cad9" could not be found
`
In general, our observation is that the serving is not resilient against these kind of problems. A broken model should not break the instance.
The text was updated successfully, but these errors were encountered: