RuntimeError: can't start new thread #882

Harald-koeln · 2023-10-06T07:40:00Z

Hello, in just one cluster (out of ~20) GPM is not starting (Crashloopbackoff) with this log output.
We are using version 0.7.0 and deploy the helm chart with ARGOCD. Kubernetes Version is 1.24.13
Please let me know, if other infos are needed.
Any help appreciated. Thank you!

...
[2023-10-06 07:16:41 +0000] [8] [INFO] In cluster configuration loaded successfully.
[2023-10-06 07:16:41 +0000] [8] [ERROR] Exception in worker process
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/gunicorn/arbiter.py", line 609, in spawn_worker
worker.init_process()
File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/gthread.py", line 95, in init_process
super().init_process()
File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/base.py", line 142, in init_process
self.run()
File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/gthread.py", line 214, in run
callback(key.fileobj)
File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/gthread.py", line 150, in on_client_socket_readable
self.enqueue_req(conn)
File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/gthread.py", line 117, in enqueue_req
fs = self.tpool.submit(self.handle, conn)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 176, in submit
self._adjust_thread_count()
File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 199, in _adjust_thread_count
t.start()
File "/usr/local/lib/python3.11/threading.py", line 957, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

ralgozino · 2023-10-06T08:25:26Z

Hi @Harald-koeln

This seems indeed strange, let me ask you some questions to better understand the issue:

Are the rest of the applications running without issues on the cluster?
Are the other clusters that you have running the same version of Kubernetes and GMP?
Do you have any resource limits set? can you share the values that you are using to install the chart? please make sure to mask all the sensible info like secrets and URLs
Is the node where GPM is running healthy?
are there any other security tools or measures running in the cluster that could prevent GPM for creating threads?

Thanks!

Harald-koeln · 2023-10-06T10:24:53Z

Hi @ralgozino ,

thanks for your reply. I will answer your questions, but actually I do not have the need anymore to deploy GPM on that cluster. I've figured out that trivy-operator was additionally deployed on that cluster by another team. We do not need both and maybe this is also the reason for the problems.
Concerning your questions:

yes, gatekeeper (and trivy and other applications) are running without issues
yes, there are all the same version
no resource limits set by me, only default helm chart values (see below)
node is healthy
trivy (probably responsible for GPM problems)

source: chart: gatekeeper-policy-manager repoURL: https://sighupio.github.io/gatekeeper-policy-manager targetRevision: 0.7.0 helm: releaseName: gatekeeper-policy-manager values: |- config: secretKey: "gatekeeper-policy-manager" ingress: enabled: true hosts: - host: "gpm.{{ metadata.annotations.cluster_wildcard_domain }}" paths: - "/"

ralgozino · 2023-10-06T10:47:14Z

I'm glad you sorted it out, I'll probably do some tests with trivy-operator (I haven't heard of it before) anyway to see if there's something we can do to make them compatible.

thanks!

Harald-koeln · 2023-10-09T15:07:10Z

Unfortunately the same error occurs again on another cluster (without trivy-operator deployed). Kubernetes Version now is older: 1.21.14
OS of the nodes is Ubuntu 20.04 and docker version is 19.03.6
Gatekeeper version 0.7.0 is deployed and running without Problems
Error messages are the same as above

ralgozino · 2023-10-09T15:10:24Z

sorry to hear that @Harald-koeln, can you please check in the pod events if there are any more details that could be useful to debug?

Harald-koeln · 2023-10-09T15:19:32Z

Hi @ralgozino, here are the pod events:

│ Normal Scheduled 7m21s default-scheduler Successfully assigned gatekeeper-system/gatekeeper-policy-manager-75ff6b ││ 956-sdsxw to vache-3 ││ Warning Unhealthy 7m10s kubelet Liveness probe failed: Get "http://10.42.3.168:8080/health": context dea ││ dline exceeded (Client.Timeout exceeded while awaiting headers) ││ Warning Unhealthy 7m1s kubelet Liveness probe failed: Get "http://10.42.3.168:8080/health": read tcp 10 ││ .42.3.1:59484->10.42.3.168:8080: read: connection reset by peer ││ Warning Unhealthy 7m1s kubelet Readiness probe failed: Get "http://10.42.3.168:8080/health": read tcp 1 ││ 0.42.3.1:59486->10.42.3.168:8080: read: connection reset by peer ││ Warning Unhealthy 6m51s kubelet Liveness probe failed: Get "http://10.42.3.168:8080/health": read tcp 10 ││ .42.3.1:59514->10.42.3.168:8080: read: connection reset by peer ││ Warning Unhealthy 6m51s kubelet Readiness probe failed: Get "http://10.42.3.168:8080/health": read tcp 1 ││ 0.42.3.1:59512->10.42.3.168:8080: read: connection reset by peer ││ Warning Unhealthy 6m46s (x3 over 7m12s) kubelet Readiness probe failed: Get "http://10.42.3.168:8080/health": context de ││ adline exceeded (Client.Timeout exceeded while awaiting headers) ││ Warning Unhealthy 6m41s kubelet Readiness probe failed: Get "http://10.42.3.168:8080/health": read tcp 1 ││ 0.42.3.1:59542->10.42.3.168:8080: read: connection reset by peer ││ Warning Unhealthy 6m41s kubelet Liveness probe failed: Get "http://10.42.3.168:8080/health": read tcp 10 ││ .42.3.1:59544->10.42.3.168:8080: read: connection reset by peer ││ Warning Unhealthy 6m31s kubelet Readiness probe failed: Get "http://10.42.3.168:8080/health": read tcp 1 ││ 0.42.3.1:59566->10.42.3.168:8080: read: connection reset by peer ││ Normal Killing 6m21s (x2 over 6m51s) kubelet Container gatekeeper-policy-manager failed liveness probe, will be resta ││ rted ││ Warning Unhealthy 6m21s (x3 over 6m31s) kubelet (combined from similar events): Liveness probe failed: Get "http://10.42 ││ .3.168:8080/health": read tcp 10.42.3.1:59590->10.42.3.168:8080: read: connection reset by peer ││ Normal Started 6m17s (x3 over 7m14s) kubelet Started container gatekeeper-policy-manager ││ Normal Created 6m17s (x3 over 7m14s) kubelet Created container gatekeeper-policy-manager ││ Normal Pulled 2m7s (x7 over 7m17s) kubelet Container image "quay.io/sighup/gatekeeper-policy-manager:v1.0.8" alread ││ y present on machine

ralgozino · 2023-10-16T11:04:10Z

hey @Harald-koeln

I tried reproducing the error with some load testing but I can't trigger it.

Do you have some limit sets on the number of processes that a container can run? or are the used inodes in the node close to the limit maybe?

Anything particular of your setup that we should know to replicate the issue?

I wonder if the same/similar issue happens to you with the new Go backend that is in development, would you mind testing it? You just need to change the image tag to go, i.e. v1.0.8 -> go. Please let me know if it happens there also. Notice that the go backend does not support OIDC auth yet in case you are using it.

Harald-koeln · 2023-10-17T07:31:56Z

Hi @ralgozino , thank you very much. The go-Version is working on all 5 clusters where I observed problems with the python version.
As far as I know there are no special limits about the number of processes or inodes.
We do not use OIDC here so using the go-version is fine for us. Thanks!

ralgozino · 2023-10-17T07:41:46Z

glad to hear that! any feedback on the go backend version is very welcomed :-)

Harald-koeln closed this as completed Oct 6, 2023

Harald-koeln reopened this Oct 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: can't start new thread #882

RuntimeError: can't start new thread #882

Harald-koeln commented Oct 6, 2023

ralgozino commented Oct 6, 2023

Harald-koeln commented Oct 6, 2023

ralgozino commented Oct 6, 2023

Harald-koeln commented Oct 9, 2023

ralgozino commented Oct 9, 2023

Harald-koeln commented Oct 9, 2023

ralgozino commented Oct 16, 2023 •

edited

Harald-koeln commented Oct 17, 2023

ralgozino commented Oct 17, 2023

RuntimeError: can't start new thread #882

RuntimeError: can't start new thread #882

Comments

Harald-koeln commented Oct 6, 2023

ralgozino commented Oct 6, 2023

Harald-koeln commented Oct 6, 2023

ralgozino commented Oct 6, 2023

Harald-koeln commented Oct 9, 2023

ralgozino commented Oct 9, 2023

Harald-koeln commented Oct 9, 2023

ralgozino commented Oct 16, 2023 • edited

Harald-koeln commented Oct 17, 2023

ralgozino commented Oct 17, 2023

ralgozino commented Oct 16, 2023 •

edited