Question on Inference loop blocking behaviour and how to handle #3592

ajstewart · 2024-04-10T14:22:29Z

ajstewart
Apr 10, 2024

Hi, I recently set my inference up with a liveness check to the /v2/health/live endpoint and noticed that the liveness check ended up restarting the pod.

On closer inspection I realise this is because when the Inference call is made, it is blocking the server, which I think I tracked down to this call here:

kserve/python/kserve/kserve/model.py

Lines 141 to 145 in c9570d6

    
           elif verb == InferenceVerb.PREDICT: 
        
               with PREDICT_HIST_TIME.labels(**prom_labels).time(): 
        
                   start = time.time() 
        
                   response = (await self.predict(payload, headers)) if inspect.iscoroutinefunction(self.predict) \ 
        
                       else self.predict(payload, headers)

where the predict function is just called directly if it is not a coroutine.

I have a custom predictor that follows the example here:

https://kserve.github.io/website/master/modelserving/v1beta1/custom/custom_model/

so the predict is a sync as it stands.

I can see this by simply trying to call the live endpoint myself while an inference is running and indeed the request hangs until completion.

If the server is under load, it can be the case that it is busy getting through all the inference requests that have come in and the liveness check cannot get a response in time. Hence the pod is restarted.

They are not large inference requests, just a < 6MB png and an inference that takes maybe a second.

I was also seeing strange drops and odd stuff happening in the queue-proxy (like connection reset by peer) and wondered if this might have something to do with it.

I wanted to ask what is the intended design to keep the server responsive?

I can probably think of a few ways I can go:

Write the predict (and other steps) in an async way. As the inference is quite light I can probably get away with sending it to a threadpool.
- Though my question here is that if I'm using an open async connection to external storage, how can I make sure this is nicely closed down. E.g. like the fastapi lifespan functionality.
- And this is likely to also cause performance issues with memory with requests starting at the same time.
- Is there a neater way to accept a number of concurrent events and distribute them or sequentially process them while still being responsive?
Try and balance the concurrency number and the timeout of the liveness probe to avoid the restart.
- This is probably a KNative question but here I am not sure of what the actual process is of the queue-proxy -> infer-container. I think maybe containerConcurrency might be useful here which I am not using at the moment (only the soft target).
Remove the probes and just let retries deal with the failed requests (which may or may not be related). Or raise the timeout significantly.

Any advice on this topic would be greatly appreciated! Thanks!

Answered by ajstewart

Apr 11, 2024

I will add notes here in case someone else gets in the same confusion as me. Maybe this conversation with myself will help. I sat down again and read through all the KServe and KNative docs to understand it better and now I see my errors.

I've managed to get a performant server that is able to cope with bursts with no errors, for which I have done:

The main bit is that I was not setting it up correctly in terms of the containerConcurrency. The nature of the sync model call blocking the loop means that I should have containerConcurrency=1. It can only deal with one at a time and this is a hard limit.
As requests can be bursty, but not required to have the response returned immediately, I …

View full answer

ajstewart · 2024-04-11T22:20:25Z

ajstewart
Apr 11, 2024
Author

I will add notes here in case someone else gets in the same confusion as me. Maybe this conversation with myself will help. I sat down again and read through all the KServe and KNative docs to understand it better and now I see my errors.

I've managed to get a performant server that is able to cope with bursts with no errors, for which I have done:

The main bit is that I was not setting it up correctly in terms of the containerConcurrency. The nature of the sync model call blocking the loop means that I should have containerConcurrency=1. It can only deal with one at a time and this is a hard limit.
As requests can be bursty, but not required to have the response returned immediately, I combine this with a maxReplicas value to limit the scale out.
The liveness and readiness probes I have lowered the frequency to 30s with a generous timeout. This avoids any not ready or not alive as the requests will find their response in-between a model inference response (as I know this will not take very long).

I also experimented with making the inference async friendly out of interest, to keep the server responsive, but as I expected the performance was not great, and 5XX responses returned. This really needs a separate worker process but I couldn't get, for example, the ray serving to work at the moment.

The probes are not ideal being kind of squeezed in-between model calls, but it seems performant enough for my needs! I do see there is a proposal for async inferencing, I will follow that to see how it goes!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on Inference loop blocking behaviour and how to handle #3592

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Question on Inference loop blocking behaviour and how to handle #3592

ajstewart Apr 10, 2024

Replies: 1 comment

ajstewart Apr 11, 2024 Author

ajstewart
Apr 10, 2024

ajstewart
Apr 11, 2024
Author