[Question] Is it possible to shutdown Triton if we detect certain cuda errors ? #7164

MatthieuToulemont · 2024-04-26T15:34:53Z

Thanks for the tremendous work here.

I am using Triton in production on H100 and I am running into some issues when certain requests trigger cuda errors. Those are usually breaking the GPU for the lifetime of the process. Usually restart the container solves this issue.

Hence, I was wondering if there was a way to restart the container, from within, if we detect those errors.

Best,

MatthieuToulemont · 2024-04-26T15:53:48Z

If we allow for strict readiness and model control = none, would /v2/health/ready return False if the one of the models is unhealthy ? For instance when a python model stub is considered unhealthy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Is it possible to shutdown Triton if we detect certain cuda errors ? #7164

[Question] Is it possible to shutdown Triton if we detect certain cuda errors ? #7164

MatthieuToulemont commented Apr 26, 2024

MatthieuToulemont commented Apr 26, 2024

[Question] Is it possible to shutdown Triton if we detect certain cuda errors ? #7164

[Question] Is it possible to shutdown Triton if we detect certain cuda errors ? #7164

Comments

MatthieuToulemont commented Apr 26, 2024

MatthieuToulemont commented Apr 26, 2024