triton model breaks serving instance #60

stephanbertl · 2023-07-18T07:55:26Z

We have setup clearml serving on Kubernetes including triton support. Our triton instance has no GPU, so deploying a model leads to the following error in the triton instance:

E0718 07:41:21.083440 30 model_lifecycle.cc:596] failed to load 'distilbert-test2' version 1: Invalid argument: unable to load model 'distilbert-test2', TensorRT backend supports only GPU device

Trying to remove the model again is not possible:
clearml-serving --id 5097f44fe9cb45f7be2a917c6fe8cad9 model remove --endpoint distilbert-test2

yields the following:

`clearml-serving - CLI for launching ClearML serving engine
2023-07-18 09:47:59,260 - clearml.Task - ERROR - Failed reloading task 5097f44fe9cb45f7be2a917c6fe8cad9
2023-07-18 09:47:59,290 - clearml.Task - ERROR - Failed reloading task 5097f44fe9cb45f7be2a917c6fe8cad9

Error: Task ID "5097f44fe9cb45f7be2a917c6fe8cad9" could not be found
`

In general, our observation is that the serving is not resilient against these kind of problems. A broken model should not break the instance.

The text was updated successfully, but these errors were encountered:

jkhenning · 2023-07-18T20:44:55Z

Hi @stephanbertl, thanks for this report. We will look into it 🙂

stephanbertl · 2023-11-20T15:09:18Z

any update? The serving module seems totally unstable, a model that is not working breaks the whole serving server. How is that supposed to work in prod?

jkhenning · 2023-11-21T07:51:43Z

Hi @stephanbertl, I have not managed to reproduce this, can you perhaps provide some more information? Specifically, I assume you're using the serving helm chart, is that correct? Can you share how you configured it?

stephanbertl · 2024-05-03T08:29:13Z

@jkhenning sorry for not coming back earlier to you.

I would say the culprit is the tritonserver default value of --exit-on-error=true.

I quickly checked the code and I could not found a way to set this in clearm-serving.

stephanbertl pushed a commit to stephanbertl/clearml-serving that referenced this issue May 3, 2024

added exit-on-error option for tritonserver. Fixes allegroai#60

375431b

stephanbertl linked a pull request May 3, 2024 that will close this issue

added exit-on-error option for tritonserver. #76

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

triton model breaks serving instance #60

triton model breaks serving instance #60

stephanbertl commented Jul 18, 2023

jkhenning commented Jul 18, 2023

stephanbertl commented Nov 20, 2023

jkhenning commented Nov 21, 2023

stephanbertl commented May 3, 2024

triton model breaks serving instance #60

triton model breaks serving instance #60

Comments

stephanbertl commented Jul 18, 2023

jkhenning commented Jul 18, 2023

stephanbertl commented Nov 20, 2023

jkhenning commented Nov 21, 2023

stephanbertl commented May 3, 2024