Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: can't start new thread #882

Open
Harald-koeln opened this issue Oct 6, 2023 · 9 comments
Open

RuntimeError: can't start new thread #882

Harald-koeln opened this issue Oct 6, 2023 · 9 comments

Comments

@Harald-koeln
Copy link

Hello, in just one cluster (out of ~20) GPM is not starting (Crashloopbackoff) with this log output.
We are using version 0.7.0 and deploy the helm chart with ARGOCD. Kubernetes Version is 1.24.13
Please let me know, if other infos are needed.
Any help appreciated. Thank you!

...
[2023-10-06 07:16:41 +0000] [8] [INFO] In cluster configuration loaded successfully.
[2023-10-06 07:16:41 +0000] [8] [ERROR] Exception in worker process
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/gunicorn/arbiter.py", line 609, in spawn_worker
worker.init_process()
File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/gthread.py", line 95, in init_process
super().init_process()
File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/base.py", line 142, in init_process
self.run()
File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/gthread.py", line 214, in run
callback(key.fileobj)
File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/gthread.py", line 150, in on_client_socket_readable
self.enqueue_req(conn)
File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/gthread.py", line 117, in enqueue_req
fs = self.tpool.submit(self.handle, conn)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 176, in submit
self._adjust_thread_count()
File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 199, in _adjust_thread_count
t.start()
File "/usr/local/lib/python3.11/threading.py", line 957, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

@ralgozino
Copy link
Member

Hi @Harald-koeln

This seems indeed strange, let me ask you some questions to better understand the issue:

  1. Are the rest of the applications running without issues on the cluster?
  2. Are the other clusters that you have running the same version of Kubernetes and GMP?
  3. Do you have any resource limits set? can you share the values that you are using to install the chart? please make sure to mask all the sensible info like secrets and URLs
  4. Is the node where GPM is running healthy?
  5. are there any other security tools or measures running in the cluster that could prevent GPM for creating threads?

Thanks!

@Harald-koeln
Copy link
Author

Hi @ralgozino ,

thanks for your reply. I will answer your questions, but actually I do not have the need anymore to deploy GPM on that cluster. I've figured out that trivy-operator was additionally deployed on that cluster by another team. We do not need both and maybe this is also the reason for the problems.
Concerning your questions:

  1. yes, gatekeeper (and trivy and other applications) are running without issues
  2. yes, there are all the same version
  3. no resource limits set by me, only default helm chart values (see below)
  4. node is healthy
  5. trivy (probably responsible for GPM problems)

source: chart: gatekeeper-policy-manager repoURL: https://sighupio.github.io/gatekeeper-policy-manager targetRevision: 0.7.0 helm: releaseName: gatekeeper-policy-manager values: |- config: secretKey: "gatekeeper-policy-manager" ingress: enabled: true hosts: - host: "gpm.{{ metadata.annotations.cluster_wildcard_domain }}" paths: - "/"

@ralgozino
Copy link
Member

I'm glad you sorted it out, I'll probably do some tests with trivy-operator (I haven't heard of it before) anyway to see if there's something we can do to make them compatible.

thanks!

@Harald-koeln
Copy link
Author

Unfortunately the same error occurs again on another cluster (without trivy-operator deployed). Kubernetes Version now is older: 1.21.14
OS of the nodes is Ubuntu 20.04 and docker version is 19.03.6
Gatekeeper version 0.7.0 is deployed and running without Problems
Error messages are the same as above

@Harald-koeln Harald-koeln reopened this Oct 9, 2023
@ralgozino
Copy link
Member

sorry to hear that @Harald-koeln, can you please check in the pod events if there are any more details that could be useful to debug?

@Harald-koeln
Copy link
Author

Hi @ralgozino, here are the pod events:

│ Normal Scheduled 7m21s default-scheduler Successfully assigned gatekeeper-system/gatekeeper-policy-manager-75ff6b ││ 956-sdsxw to vache-3 ││ Warning Unhealthy 7m10s kubelet Liveness probe failed: Get "http://10.42.3.168:8080/health": context dea ││ dline exceeded (Client.Timeout exceeded while awaiting headers) ││ Warning Unhealthy 7m1s kubelet Liveness probe failed: Get "http://10.42.3.168:8080/health": read tcp 10 ││ .42.3.1:59484->10.42.3.168:8080: read: connection reset by peer ││ Warning Unhealthy 7m1s kubelet Readiness probe failed: Get "http://10.42.3.168:8080/health": read tcp 1 ││ 0.42.3.1:59486->10.42.3.168:8080: read: connection reset by peer ││ Warning Unhealthy 6m51s kubelet Liveness probe failed: Get "http://10.42.3.168:8080/health": read tcp 10 ││ .42.3.1:59514->10.42.3.168:8080: read: connection reset by peer ││ Warning Unhealthy 6m51s kubelet Readiness probe failed: Get "http://10.42.3.168:8080/health": read tcp 1 ││ 0.42.3.1:59512->10.42.3.168:8080: read: connection reset by peer ││ Warning Unhealthy 6m46s (x3 over 7m12s) kubelet Readiness probe failed: Get "http://10.42.3.168:8080/health": context de ││ adline exceeded (Client.Timeout exceeded while awaiting headers) ││ Warning Unhealthy 6m41s kubelet Readiness probe failed: Get "http://10.42.3.168:8080/health": read tcp 1 ││ 0.42.3.1:59542->10.42.3.168:8080: read: connection reset by peer ││ Warning Unhealthy 6m41s kubelet Liveness probe failed: Get "http://10.42.3.168:8080/health": read tcp 10 ││ .42.3.1:59544->10.42.3.168:8080: read: connection reset by peer ││ Warning Unhealthy 6m31s kubelet Readiness probe failed: Get "http://10.42.3.168:8080/health": read tcp 1 ││ 0.42.3.1:59566->10.42.3.168:8080: read: connection reset by peer ││ Normal Killing 6m21s (x2 over 6m51s) kubelet Container gatekeeper-policy-manager failed liveness probe, will be resta ││ rted ││ Warning Unhealthy 6m21s (x3 over 6m31s) kubelet (combined from similar events): Liveness probe failed: Get "http://10.42 ││ .3.168:8080/health": read tcp 10.42.3.1:59590->10.42.3.168:8080: read: connection reset by peer ││ Normal Started 6m17s (x3 over 7m14s) kubelet Started container gatekeeper-policy-manager ││ Normal Created 6m17s (x3 over 7m14s) kubelet Created container gatekeeper-policy-manager ││ Normal Pulled 2m7s (x7 over 7m17s) kubelet Container image "quay.io/sighup/gatekeeper-policy-manager:v1.0.8" alread ││ y present on machine

@ralgozino
Copy link
Member

ralgozino commented Oct 16, 2023

hey @Harald-koeln

I tried reproducing the error with some load testing but I can't trigger it.

Do you have some limit sets on the number of processes that a container can run? or are the used inodes in the node close to the limit maybe?

Anything particular of your setup that we should know to replicate the issue?

I wonder if the same/similar issue happens to you with the new Go backend that is in development, would you mind testing it? You just need to change the image tag to go, i.e. v1.0.8 -> go. Please let me know if it happens there also. Notice that the go backend does not support OIDC auth yet in case you are using it.

@Harald-koeln
Copy link
Author

Hi @ralgozino , thank you very much. The go-Version is working on all 5 clusters where I observed problems with the python version.
As far as I know there are no special limits about the number of processes or inodes.
We do not use OIDC here so using the go-version is fine for us. Thanks!

@ralgozino
Copy link
Member

glad to hear that! any feedback on the go backend version is very welcomed :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants