-
Hi, I recently set my inference up with a liveness check to the On closer inspection I realise this is because when the Inference call is made, it is blocking the server, which I think I tracked down to this call here: kserve/python/kserve/kserve/model.py Lines 141 to 145 in c9570d6 where the predict function is just called directly if it is not a coroutine. I have a custom predictor that follows the example here: https://kserve.github.io/website/master/modelserving/v1beta1/custom/custom_model/ so the predict is a sync as it stands. I can see this by simply trying to call the live endpoint myself while an inference is running and indeed the request hangs until completion. If the server is under load, it can be the case that it is busy getting through all the inference requests that have come in and the liveness check cannot get a response in time. Hence the pod is restarted. They are not large inference requests, just a < 6MB png and an inference that takes maybe a second. I was also seeing strange drops and odd stuff happening in the queue-proxy (like I wanted to ask what is the intended design to keep the server responsive? I can probably think of a few ways I can go:
Any advice on this topic would be greatly appreciated! Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
I will add notes here in case someone else gets in the same confusion as me. Maybe this conversation with myself will help. I sat down again and read through all the KServe and KNative docs to understand it better and now I see my errors. I've managed to get a performant server that is able to cope with bursts with no errors, for which I have done:
I also experimented with making the inference async friendly out of interest, to keep the server responsive, but as I expected the performance was not great, and 5XX responses returned. This really needs a separate worker process but I couldn't get, for example, the ray serving to work at the moment. The probes are not ideal being kind of squeezed in-between model calls, but it seems performant enough for my needs! I do see there is a proposal for async inferencing, I will follow that to see how it goes! |
Beta Was this translation helpful? Give feedback.
I will add notes here in case someone else gets in the same confusion as me. Maybe this conversation with myself will help. I sat down again and read through all the KServe and KNative docs to understand it better and now I see my errors.
I've managed to get a performant server that is able to cope with bursts with no errors, for which I have done:
containerConcurrency
. The nature of the sync model call blocking the loop means that I should havecontainerConcurrency=1
. It can only deal with one at a time and this is a hard limit.