You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
We use a custom image for our Sagemaker endpoint, and on Friday, Oct 20, 2023, we experienced instability in our endpoint after re-deploying. It seems that the latest version fo psutil 5.9.6 will throw ZombieProcess more frequently, causing the server to restart. This causes the endpoint to occasionally return non-200 responses when predictions are requested.
The change in psutil may be this fix on their end with what they recognize as a ZombieProcess. giampaolo/psutil#2288
We were able to resolve our issue by rolling back to psutil 5.9.5. So, I'm unsure if sagemaker-inference should pin the version of psutil in your package or if the fix needs to be done here:
To reproduce
Create a custom sagemaker endpoint image with psutil 5.9.6 and deploy it.
Expected behavior
The model endpoint is stable and consistently returns successful predictions and the ZombieProcess exception is not being raised frequently.
Screenshots or logs
Here is a traceback we are seeing:
File "/usr/local/lib/python3.8/site-packages/sagemaker_inference/model_server.py", line 99, in start_model_server
mms_process = _retry_retrieve_mms_server_process(env.startup_timeout)
File "/usr/local/lib/python3.8/site-packages/sagemaker_inference/model_server.py", line 199, in _retry_retrieve_mms_server_process
return retrieve_mms_server_process()
File "/usr/local/lib/python3.8/site-packages/retrying.py", line 49, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/usr/local/lib/python3.8/site-packages/retrying.py", line 212, in call
raise attempt.get()
File "/usr/local/lib/python3.8/site-packages/retrying.py", line 247, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/usr/local/lib/python3.8/site-packages/six.py", line 719, in reraise
raise value
File "/usr/local/lib/python3.8/site-packages/retrying.py", line 200, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/usr/local/lib/python3.8/site-packages/sagemaker_inference/model_server.py", line 206, in _retrieve_mms_server_process
if MMS_NAMESPACE in process.cmdline():
File "/usr/local/lib64/python3.8/site-packages/psutil/__init__.py", line 702, in cmdline
return self._proc.cmdline()
File "/usr/local/lib64/python3.8/site-packages/psutil/_pslinux.py", line 1650, in wrapper
return fun(self, *args, **kwargs)
File "/usr/local/lib64/python3.8/site-packages/psutil/_pslinux.py", line 1788, in cmdline
self._raise_if_zombie()
File "/usr/local/lib64/python3.8/site-packages/psutil/_pslinux.py", line 1693, in _raise_if_zombie
raise ZombieProcess(self.pid, self._name, self._ppid)
System information
sagemaker inference version 1.5.11
custom docker image based on amazon linux 2
framework name: scikit-learn
framework version: 1.0.2
Python version: 3.8
processing unit type: cpu
Additional context
n/a
The text was updated successfully, but these errors were encountered:
Describe the bug
We use a custom image for our Sagemaker endpoint, and on Friday, Oct 20, 2023, we experienced instability in our endpoint after re-deploying. It seems that the latest version fo psutil 5.9.6 will throw ZombieProcess more frequently, causing the server to restart. This causes the endpoint to occasionally return non-200 responses when predictions are requested.
The change in psutil may be this fix on their end with what they recognize as a ZombieProcess.
giampaolo/psutil#2288
We were able to resolve our issue by rolling back to psutil 5.9.5. So, I'm unsure if sagemaker-inference should pin the version of psutil in your package or if the fix needs to be done here:
https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L276
To reproduce
Create a custom sagemaker endpoint image with psutil 5.9.6 and deploy it.
Expected behavior
The model endpoint is stable and consistently returns successful predictions and the ZombieProcess exception is not being raised frequently.
Screenshots or logs
Here is a traceback we are seeing:
System information
Additional context
n/a
The text was updated successfully, but these errors were encountered: