Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

triton model breaks serving instance #60

Open
stephanbertl opened this issue Jul 18, 2023 · 4 comments · May be fixed by #76
Open

triton model breaks serving instance #60

stephanbertl opened this issue Jul 18, 2023 · 4 comments · May be fixed by #76

Comments

@stephanbertl
Copy link

We have setup clearml serving on Kubernetes including triton support. Our triton instance has no GPU, so deploying a model leads to the following error in the triton instance:

E0718 07:41:21.083440 30 model_lifecycle.cc:596] failed to load 'distilbert-test2' version 1: Invalid argument: unable to load model 'distilbert-test2', TensorRT backend supports only GPU device

Trying to remove the model again is not possible:
clearml-serving --id 5097f44fe9cb45f7be2a917c6fe8cad9 model remove --endpoint distilbert-test2

yields the following:

`clearml-serving - CLI for launching ClearML serving engine
2023-07-18 09:47:59,260 - clearml.Task - ERROR - Failed reloading task 5097f44fe9cb45f7be2a917c6fe8cad9
2023-07-18 09:47:59,290 - clearml.Task - ERROR - Failed reloading task 5097f44fe9cb45f7be2a917c6fe8cad9

Error: Task ID "5097f44fe9cb45f7be2a917c6fe8cad9" could not be found
`

In general, our observation is that the serving is not resilient against these kind of problems. A broken model should not break the instance.

@jkhenning
Copy link
Member

Hi @stephanbertl, thanks for this report. We will look into it 🙂

@stephanbertl
Copy link
Author

any update? The serving module seems totally unstable, a model that is not working breaks the whole serving server. How is that supposed to work in prod?

@jkhenning
Copy link
Member

Hi @stephanbertl, I have not managed to reproduce this, can you perhaps provide some more information? Specifically, I assume you're using the serving helm chart, is that correct? Can you share how you configured it?

@stephanbertl
Copy link
Author

@jkhenning sorry for not coming back earlier to you.

I would say the culprit is the tritonserver default value of --exit-on-error=true.

I quickly checked the code and I could not found a way to set this in clearm-serving.

stephanbertl pushed a commit to stephanbertl/clearml-serving that referenced this issue May 3, 2024
@stephanbertl stephanbertl linked a pull request May 3, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants