Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

psutil 5.9.6 seems to be throwing ZombieProcess when retrieving the mms process #132

Open
charlietruong-wk opened this issue Oct 23, 2023 · 5 comments

Comments

@charlietruong-wk
Copy link

Describe the bug
We use a custom image for our Sagemaker endpoint, and on Friday, Oct 20, 2023, we experienced instability in our endpoint after re-deploying. It seems that the latest version fo psutil 5.9.6 will throw ZombieProcess more frequently, causing the server to restart. This causes the endpoint to occasionally return non-200 responses when predictions are requested.

The change in psutil may be this fix on their end with what they recognize as a ZombieProcess.
giampaolo/psutil#2288

We were able to resolve our issue by rolling back to psutil 5.9.5. So, I'm unsure if sagemaker-inference should pin the version of psutil in your package or if the fix needs to be done here:

https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L276

To reproduce
Create a custom sagemaker endpoint image with psutil 5.9.6 and deploy it.

Expected behavior
The model endpoint is stable and consistently returns successful predictions and the ZombieProcess exception is not being raised frequently.

Screenshots or logs
Here is a traceback we are seeing:

  File "/usr/local/lib/python3.8/site-packages/sagemaker_inference/model_server.py", line 99, in start_model_server
    mms_process = _retry_retrieve_mms_server_process(env.startup_timeout)
  File "/usr/local/lib/python3.8/site-packages/sagemaker_inference/model_server.py", line 199, in _retry_retrieve_mms_server_process
    return retrieve_mms_server_process()
  File "/usr/local/lib/python3.8/site-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/usr/local/lib/python3.8/site-packages/retrying.py", line 212, in call
    raise attempt.get()
  File "/usr/local/lib/python3.8/site-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/usr/local/lib/python3.8/site-packages/six.py", line 719, in reraise
    raise value
  File "/usr/local/lib/python3.8/site-packages/retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/usr/local/lib/python3.8/site-packages/sagemaker_inference/model_server.py", line 206, in _retrieve_mms_server_process
    if MMS_NAMESPACE in process.cmdline():
  File "/usr/local/lib64/python3.8/site-packages/psutil/__init__.py", line 702, in cmdline
    return self._proc.cmdline()
  File "/usr/local/lib64/python3.8/site-packages/psutil/_pslinux.py", line 1650, in wrapper
    return fun(self, *args, **kwargs)
  File "/usr/local/lib64/python3.8/site-packages/psutil/_pslinux.py", line 1788, in cmdline
    self._raise_if_zombie()
  File "/usr/local/lib64/python3.8/site-packages/psutil/_pslinux.py", line 1693, in _raise_if_zombie
    raise ZombieProcess(self.pid, self._name, self._ppid)

System information

  • sagemaker inference version 1.5.11
  • custom docker image based on amazon linux 2
    • framework name: scikit-learn
    • framework version: 1.0.2
    • Python version: 3.8
    • processing unit type: cpu

Additional context
n/a

@parthvadhadiya
Copy link

I am having same issue with sg inference: 1.10.1 and multimodel server: 1.1.11

@andre-marcos-perez
Copy link

Same problem with sagemaker-inference 1.7.1 and multi-model-server 1.1.8.

@parthvadhadiya
Copy link

Try updating python version as well i updated ubuntu version of my docker version. @andre-marcos-perez

@andre-marcos-perez
Copy link

Hey, installing psutil version 5.9.5 first worked.

RUN pip3 install --upgrade pip && \
    pip3 install multi-model-server==1.1.8 && \
    pip3 install psutil==5.9.5 && \
    pip3 install sagemaker-inference==1.7.1

@andre-marcos-perez
Copy link

Likely solved by #133

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants