Large memory usage when using readiness and liveness probes. Doing something wrong? #3588

ajstewart · 2024-04-09T17:35:48Z

ajstewart
Apr 9, 2024

Hi, I have a strange issue that I cannot get to the bottom of that I wondered if anyone could give any pointers.

Our model was happily being served but every now and then we are getting 502 Bad Gateways, especially under load.

This model is just a custom predictor that runs a PyTorch model segmentation model. Normal RAM usage is around 600 MB and inference takes about a second. It also has a resizing pre-processing step. It is using the v2 protocol.

In investigating this (which I'm still not sure of the answer) I realised our container doesn't contain any readiness or liveness checks. As my hypothesis was that jobs were being sent to the container by the queue proxy before the model was ready, as the model needs to be downloaded from a cloud provider first.

So I added checks in the form of:

      livenessProbe:
        exec:
          command:
          - /bin/sh
          - -c
          - >-
            curl -s http://localhost:8080/v2/health/live | jq -e '.live == true'
        initialDelaySeconds: 10
        periodSeconds: 10
        timeoutSeconds: 5

      readinessProbe:
        exec:
          command:
          - /bin/sh
          - -c
          - >-
            curl -s http://localhost:8080/v2/models/model-name/ready | jq -e '.ready == true'
        initialDelaySeconds: 15
        periodSeconds: 15
        timeoutSeconds: 5

Which also involved making sure jq was installed in the container image.

Now adding these checks is causing the memory of the pod to slowly go up and up, or at least is spikes and at some point doesn't return to a normal level without the probe checks. So eventually the spike is too big for the memory limit of the pod and it is killed with an OOM.

If I remove the probe checks the pod runs ok.

So I wanted to ask am I doing something silly here? Is this the best way to define the readiness and liveness checks? Has anyone else ever experienced something like this?

ajstewart · 2024-04-10T09:26:03Z

ajstewart
Apr 10, 2024
Author

Tracked down the memory problem in my inference that I guess was heightened by the probes!

So I'm closing this particular discussion.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large memory usage when using readiness and liveness probes. Doing something wrong? #3588

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Large memory usage when using readiness and liveness probes. Doing something wrong? #3588

ajstewart Apr 9, 2024

Replies: 1 comment

ajstewart Apr 10, 2024 Author

ajstewart
Apr 9, 2024

ajstewart
Apr 10, 2024
Author