Self-hosted runners now always hang on jobs? No errors, no changes... #122999
-
Select Topic AreaQuestion BodyI am looking after two self-hosted Ubuntu runners which have been working fine for months until a few hours ago. Now they are seemingly stuck on build jobs that never go anywhere. No errors in the worker logs. I restarted both runners and they pick up jobs fine, but they will never finish them. I verified we're using the latest agent, it self-upgraded. v2.316.1 I saw one failed web call in the runner log but when I tried to curl it I was able to reach just fine. Is there anything else I can look at to troubleshoot further? The Worker log just keeps showing this over and over: [2024-05-08 16:47:40Z INFO HostContext] Well known directory 'Work': '/home/ubuntu/runner/_work' The runner log just shows the job keeps renewing every minute. [2024-05-08 16:48:23Z INFO JobDispatcher] Successfully renew job request 206793, job is valid till 05/08/2024 16:58:22 We haven't made any changes to either the runners or the actions configurations in the repo so I'm not exactly sure what is going on here. |
Beta Was this translation helpful? Give feedback.
Replies: 10 comments 22 replies
-
I forgot to write that I looked at the usual system things; disk space is fine, for example. No containers listed in docker, but docker is running for our build job... The last successful job was about 18 hours ago and to my knowledge, nothing changed that would cause a job to just hang in the middle. I'm not having much luck in searching for similar issues either, it's usually hung waiting for a job to be picked up. This is stuck inside the workflow and it just refuses to ever error out. It's also happening at the same exact place on both runners which are two different architectures (ARM and X86). |
Beta Was this translation helpful? Give feedback.
-
It seems to hang running actions/checkout@V4. No errors, and I think it runs everything properly. At the end of that step it just spins and spins and never finishes. We have to eventually cancel the jobs as they seemingly never timeout or exit! Nothing in the error logs to indicate a problem that I can tell. I tried using actions/checkout@V3 but it gives the same result, unfortunately. We're using just the bare checkout, no options at all... |
Beta Was this translation helpful? Give feedback.
-
[2024-05-09 01:13:18Z INFO ProcessInvokerWrapper] Starting process: $ ps aux | grep [n]ode It just hangs... I see no open issues for checkoutV4... |
Beta Was this translation helpful? Give feedback.
This comment was marked as spam.
This comment was marked as spam.
-
We are also experiencing this issue. During the checkout it just hangs at the last step when running checkout@v3 or checkout@v4. All we see is the node16 or node20 process running the dist/index file in the checkout which gets hung. |
Beta Was this translation helpful? Give feedback.
-
Incidentally the post checkout step also runs the checkout node step and gets stuck... 😿 |
Beta Was this translation helpful? Give feedback.
-
@jefflightweb do you have a firewall or egress white/black listing where you host your runners? We certainly do and are wondering if we are missing a whitelist IP/DNS or something new that was introduced recently that could be getting "stuck" without a retry or error handler. |
Beta Was this translation helpful? Give feedback.
-
Another day and this is still broken. :-( |
Beta Was this translation helpful? Give feedback.
-
We have 5-6 of our senior engineers working on this and nothing has presented itself and GitHub support has been engaged and their theory and comments seem to imply its something we did, but all we did was start runners using terraform the same way we always have on the same AMI and runner libraries. These were set to autoupdate the runners and we can see working runners started on 2.314 and are now on 2.316 through that auto-updating process. However even starting new runners on older versions with auto-update turned off still do not work. New versions of the runner 2.316.0 or 1 are also still not working on newly started machines. To be clear the process starts and reports home. They pick up jobs but checkout or any jobs running node still hang indefinitely. |
Beta Was this translation helpful? Give feedback.
-
It isn't even checkout that is broken. Any action that runs via node is breaking. We have tried node16 default and using the flag to force node20. Here is a ps from doing a jfrog action that uses node that also gets stuck. We didnt even try a checkout. We KNOW it is node just now why yet. This is the tree for the stuck process that is just hanging... ⬇️
|
Beta Was this translation helpful? Give feedback.
So we finally traced this down to an issue with datadog-apm-library-js:5.12.0-1 which on Ubunutu 22 and Ubuntu 20 injects itself in such a way that no js process will exit unless we SIGTERM. Downgrading this library to the last stable we had working (4.19.0-1) is what solved the issue for us.