Self-hosted runners now always hang on jobs? No errors, no changes... #122999

jefflightweb · 2024-05-08T16:50:18Z

jefflightweb
May 8, 2024

Select Topic Area

Question

Body

I am looking after two self-hosted Ubuntu runners which have been working fine for months until a few hours ago. Now they are seemingly stuck on build jobs that never go anywhere. No errors in the worker logs. I restarted both runners and they pick up jobs fine, but they will never finish them.

I verified we're using the latest agent, it self-upgraded. v2.316.1

I saw one failed web call in the runner log but when I tried to curl it I was able to reach just fine.

Is there anything else I can look at to troubleshoot further? The Worker log just keeps showing this over and over:

[2024-05-08 16:47:40Z INFO HostContext] Well known directory 'Work': '/home/ubuntu/runner/_work'
[2024-05-08 16:47:50Z INFO HostContext] Well known directory 'Bin': '/home/ubuntu/runner/bin.2.316.1'
[2024-05-08 16:47:50Z INFO HostContext] Well known directory 'Root': '/home/ubuntu/runner'
[2024-05-08 16:47:50Z INFO HostContext] Well known directory 'Work': '/home/ubuntu/runner/_work'
[2024-05-08 16:48:00Z INFO HostContext] Well known directory 'Bin': '/home/ubuntu/runner/bin.2.316.1'
[2024-05-08 16:48:00Z INFO HostContext] Well known directory 'Root': '/home/ubuntu/runner'
[2024-05-08 16:48:00Z INFO HostContext] Well known directory 'Work': '/home/ubuntu/runner/_work'
[2024-05-08 16:48:10Z INFO HostContext] Well known directory 'Bin': '/home/ubuntu/runner/bin.2.316.1'
[2024-05-08 16:48:10Z INFO HostContext] Well known directory 'Root': '/home/ubuntu/runner'
[2024-05-08 16:48:10Z INFO HostContext] Well known directory 'Work': '/home/ubuntu/runner/_work'

The runner log just shows the job keeps renewing every minute.

[2024-05-08 16:48:23Z INFO JobDispatcher] Successfully renew job request 206793, job is valid till 05/08/2024 16:58:22

We haven't made any changes to either the runners or the actions configurations in the repo so I'm not exactly sure what is going on here.

Answered by jeredp

May 10, 2024

So we finally traced this down to an issue with datadog-apm-library-js:5.12.0-1 which on Ubunutu 22 and Ubuntu 20 injects itself in such a way that no js process will exit unless we SIGTERM. Downgrading this library to the last stable we had working (4.19.0-1) is what solved the issue for us.

View full answer

jefflightweb · 2024-05-08T17:16:49Z

jefflightweb
May 8, 2024
Author

I forgot to write that I looked at the usual system things; disk space is fine, for example. No containers listed in docker, but docker is running for our build job...

The last successful job was about 18 hours ago and to my knowledge, nothing changed that would cause a job to just hang in the middle. I'm not having much luck in searching for similar issues either, it's usually hung waiting for a job to be picked up. This is stuck inside the workflow and it just refuses to ever error out. It's also happening at the same exact place on both runners which are two different architectures (ARM and X86).

0 replies

jefflightweb · 2024-05-08T19:41:19Z

jefflightweb
May 8, 2024
Author

It seems to hang running actions/checkout@V4. No errors, and I think it runs everything properly. At the end of that step it just spins and spins and never finishes. We have to eventually cancel the jobs as they seemingly never timeout or exit! Nothing in the error logs to indicate a problem that I can tell.

I tried using actions/checkout@V3 but it gives the same result, unfortunately. We're using just the bare checkout, no options at all...

0 replies

jefflightweb · 2024-05-09T01:22:41Z

jefflightweb
May 9, 2024
Author

[2024-05-09 01:13:18Z INFO ProcessInvokerWrapper] Starting process:
[2024-05-09 01:13:18Z INFO ProcessInvokerWrapper] File name: '/home/ubuntu/runner/externals/node20/bin/node'
[2024-05-09 01:13:18Z INFO ProcessInvokerWrapper] Arguments: '"/home/ubuntu/runner/_work/_actions/actions/checkout/v4/dist/index.js"'
[2024-05-09 01:13:18Z INFO ProcessInvokerWrapper] Working directory: '/home/ubuntu/runner/_work/ubsv/ubsv'
[2024-05-09 01:13:18Z INFO ProcessInvokerWrapper] Require exit code zero: 'False'
[2024-05-09 01:13:18Z INFO ProcessInvokerWrapper] Encoding web name: ; code page: ''
[2024-05-09 01:13:18Z INFO ProcessInvokerWrapper] Force kill process on cancellation: 'False'
[2024-05-09 01:13:18Z INFO ProcessInvokerWrapper] Redirected STDIN: 'False'
[2024-05-09 01:13:18Z INFO ProcessInvokerWrapper] Persist current code page: 'False'
[2024-05-09 01:13:18Z INFO ProcessInvokerWrapper] Keep redirected STDIN open: 'False'
[2024-05-09 01:13:18Z INFO ProcessInvokerWrapper] High priority process: 'False'
[2024-05-09 01:13:18Z INFO JobServerQueue] Try to append 1 batches web console lines for record '60056f82-3532-49f5-9f99-8e3416e8a772', success rate: 1/1.
[2024-05-09 01:13:18Z INFO JobServerQueue] Try to append 1 batches web console lines for record '58876758-6103-5463-d7fd-ca81ef53d5c8', success rate: 1/1.
[2024-05-09 01:13:18Z INFO ProcessInvokerWrapper] Updated oom_score_adj to 500 for PID: 156829.
[2024-05-09 01:13:18Z INFO ProcessInvokerWrapper] Process started with process id 156829, waiting for process exit.
[2024-05-09 01:13:18Z INFO JobServerQueue] Got a step log file to send to results service.
[2024-05-09 01:13:18Z INFO JobServerQueue] Starting upload of step log file to results service ResultsLog, /home/ubuntu/runner/_diag/blocks/52c8d28e-ab58-4b4f-8974-7ea2cab9d3be_60056f82-3532-49f5-9f99-8e3416e8a772.1
[2024-05-09 01:13:18Z INFO AddMaskCommandExtension] Add new secret mask with length of 76
[2024-05-09 01:13:18Z INFO JobServerQueue] Try to append 1 batches web console lines for record '58876758-6103-5463-d7fd-ca81ef53d5c8', success rate: 1/1.
[2024-05-09 01:13:18Z INFO JobServerQueue] Try to upload 1 log files or attachments, success rate: 1/1.
[2024-05-09 01:13:18Z INFO JobServerQueue] Try to append 1 batches web console lines for record '58876758-6103-5463-d7fd-ca81ef53d5c8', success rate: 1/1.
[2024-05-09 01:13:19Z INFO JobServerQueue] Try to append 1 batches web console lines for record '58876758-6103-5463-d7fd-ca81ef53d5c8', success rate: 1/1.
[2024-05-09 01:13:19Z INFO JobServerQueue] Tried to upload 1 file(s) to results, success rate: 1/1.
[2024-05-09 01:13:19Z INFO JobServerQueue] Try to append 1 batches web console lines for record '58876758-6103-5463-d7fd-ca81ef53d5c8', success rate: 1/1.
[2024-05-09 01:13:28Z INFO HostContext] Well known directory 'Bin': '/home/ubuntu/runner/bin.2.316.1'
[2024-05-09 01:13:28Z INFO HostContext] Well known directory 'Root': '/home/ubuntu/runner'
[2024-05-09 01:13:28Z INFO HostContext] Well known directory 'Work': '/home/ubuntu/runner/_work'
[2024-05-09 01:13:38Z INFO HostContext] Well known directory 'Bin': '/home/ubuntu/runner/bin.2.316.1'
[2024-05-09 01:13:38Z INFO HostContext] Well known directory 'Root': '/home/ubuntu/runner'
[2024-05-09 01:13:38Z INFO HostContext] Well known directory 'Work': '/home/ubuntu/runner/_work'
[2024-05-09 01:13:48Z INFO HostContext] Well known directory 'Bin': '/home/ubuntu/runner/bin.2.316.1'
[2024-05-09 01:13:48Z INFO HostContext] Well known directory 'Root': '/home/ubuntu/runner'
[2024-05-09 01:13:48Z INFO HostContext] Well known directory 'Work': '/home/ubuntu/runner/_work'
[2024-05-09 01:13:58Z INFO HostContext] Well known directory 'Bin': '/home/ubuntu/runner/bin.2.316.1'

$ ps aux | grep [n]ode
ubuntu 77909 0.0 0.0 723452 66644 pts/0 Sl+ May08 0:03 /home/ubuntu/runner/externals/node16/bin/node /home/ubuntu/runner/_work/_actions/actions/checkout/v3/dist/index.js
ubuntu 145228 0.0 0.0 5042040 70288 pts/0 Sl+ 00:24 0:01 /home/ubuntu/runner/externals/node20/bin/node /home/ubuntu/runner/_work/_actions/actions/checkout/v4/dist/index.js
ubuntu 156829 0.1 0.0 5038920 65640 pts/0 Sl+ 01:13 0:00 /home/ubuntu/runner/externals/node20/bin/node /home/ubuntu/runner/_work/_actions/actions/checkout/v4/dist/index.js

It just hangs... I see no open issues for checkoutV4...

0 replies

jefflightweb · 2024-05-09T02:48:23Z

jefflightweb May 9, 2024
Author

Thank you, I have completed steps 1 through 6...

ctoestreich · 2024-05-09T17:36:53Z

ctoestreich
May 9, 2024

We are also experiencing this issue. During the checkout it just hangs at the last step when running checkout@v3 or checkout@v4. All we see is the node16 or node20 process running the dist/index file in the checkout which gets hung.

10 replies

jeredp May 9, 2024

Unfortunately, 2.315.0 does not seem to work either.
When we execute the run.sh --check, it still hangs on the "Check: Node.js Certificate/Proxy Validation" indefinitely

jefflightweb May 9, 2024
Author

I am starting to suspect an issue with GHA infrastructure combined with insufficient timeout function in the runner itself...

ctoestreich May 9, 2024

Our most recent attempt was to replace the node16 in the externals dir with node20 and that also hung on the ./run.sh --check at the same spot. Screen shot to verify that we replaced the pathing

ctoestreich May 9, 2024

Perhaps, but anything that is running node is causing the issue. The weird part is that we have dozens and dozens of runners still working that have been running prior to this week (approx prior to May 3). Every runner we have spun up since then using terraform using the same AMI, github runner version and node versions are now suffering from the forever pause when they try do to anything in node16 or node20

ctoestreich May 9, 2024

We have now also switched out the externals node for v22.1.0 and the checkout action is still stuck. We are going to try setting the github runner process to force it to node20 vs node16 as it seems the runner process itself is still using 16 so might be something with the process node version?!?!

ctoestreich · 2024-05-09T18:46:44Z

ctoestreich
May 9, 2024

Incidentally the post checkout step also runs the checkout node step and gets stuck... 😿

0 replies

ctoestreich · 2024-05-09T19:24:01Z

ctoestreich
May 9, 2024

@jefflightweb do you have a firewall or egress white/black listing where you host your runners? We certainly do and are wondering if we are missing a whitelist IP/DNS or something new that was introduced recently that could be getting "stuck" without a retry or error handler.

1 reply

jefflightweb May 9, 2024
Author

No firewall issues here, our runners are in a separate VPC with proper security group and no outbound ACLs that would cause interference here... And they've basically been the same for months now. No changes.

jefflightweb · 2024-05-10T12:43:09Z

jefflightweb
May 10, 2024
Author

Another day and this is still broken. :-(

0 replies

ctoestreich · 2024-05-10T12:43:51Z

ctoestreich
May 10, 2024

We have 5-6 of our senior engineers working on this and nothing has presented itself and GitHub support has been engaged and their theory and comments seem to imply its something we did, but all we did was start runners using terraform the same way we always have on the same AMI and runner libraries. These were set to autoupdate the runners and we can see working runners started on 2.314 and are now on 2.316 through that auto-updating process. However even starting new runners on older versions with auto-update turned off still do not work. New versions of the runner 2.316.0 or 1 are also still not working on newly started machines.

To be clear the process starts and reports home. They pick up jobs but checkout or any jobs running node still hang indefinitely.

5 replies

jefflightweb May 10, 2024
Author

Thanks for sending an update. I'm glad you are engaged with GH Support, we have been tied up with other unrelated fires and haven't been able to go that route yet, but will if necessary.

This is a strange situation, indeed.

jefflightweb May 10, 2024
Author

FWIW, one of our engineers just verified that we can issue a new PAT from GH console and install on a fresh instance and then the runners work. I have a feeling that it won't last for very long, but we shall see.

ctoestreich May 10, 2024

We are trying the same thing from a clean ec2 on our end as well. Our prevailing theory is that the process callbacks to the DLLs that live in github runner are not working on Ubuntu 22.04. Sadly the issue prevailed when we reverted to an ubuntu 20 instance.

ctoestreich May 10, 2024

what is your uname -a?

Linux ip-10-99-173-126 6.5.0-1018-aws #18~22.04.1-Ubuntu SMP Fri Apr 5 17:44:33 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

jefflightweb May 10, 2024
Author

The exact same kernel (Ubuntu 22.04) on both X86 and aarch64.

ctoestreich · 2024-05-10T13:11:36Z

ctoestreich
May 10, 2024

It isn't even checkout that is broken. Any action that runs via node is breaking. We have tried node16 default and using the flag to force node20.

Here is a ps from doing a jfrog action that uses node that also gets stuck. We didnt even try a checkout. We KNOW it is node just now why yet.

This is the tree for the stuck process that is just hanging... ⬇️

  22936   22928 00:00:00 -bash
   5452    5443 00:00:00 bash -
  25376    5452 00:00:00  \_ ps -u github -o pid,ppid,time,cmd --sort=start_time --no-headers --forest
  22064       1 00:00:00 /bin/bash /actions-runner/runsvc.sh
  22081   22064 00:00:00  \_ ./externals/node16/bin/node ./bin/RunnerService.js
  22170   22081 00:00:03      \_ /actions-runner/bin/Runner.Listener run --startuptype service
  23935   22170 00:00:12          \_ /actions-runner/bin/Runner.Worker spawnclient 104 109
  24872   23935 00:00:02              \_ /actions-runner/externals/node20/bin/node /actions-runner/_work/_actions/jfrog/setup-jfrog-cli/v4/lib/main.js

5 replies

ctoestreich May 10, 2024

It looks about the same for checkout action just replace the action path

jeredp May 10, 2024

So we finally traced this down to an issue with datadog-apm-library-js:5.12.0-1 which on Ubunutu 22 and Ubuntu 20 injects itself in such a way that no js process will exit unless we SIGTERM. Downgrading this library to the last stable we had working (4.19.0-1) is what solved the issue for us.

Answer selected by jefflightweb

jefflightweb May 10, 2024
Author

Wow, we are using Datadog here also... I would not have expected that.

jefflightweb May 10, 2024
Author

Hrmm, I stopped and disabled all datadog agents for now, and restarted the runner processes, but this hasn't fixed it yet here.

jefflightweb May 10, 2024
Author

OK. I had to remove all the injection libraries and rebooted and everything works again. I wasn't able to find an older deb package for that library, only the source code from GitHub releases, so I think we'll just do without tracing for now (we weren't using it anyway).

Self-hosted runners now always hang on jobs? No errors, no changes... #122999

Select Topic Area

Body

Replies: 10 comments · 22 replies

jefflightweb May 8, 2024 Author

jefflightweb May 8, 2024 Author

jefflightweb May 9, 2024 Author

This comment was marked as spam.

jefflightweb May 9, 2024 Author

jefflightweb May 9, 2024 Author

jefflightweb May 9, 2024 Author

jefflightweb May 10, 2024 Author

jefflightweb May 10, 2024 Author

jefflightweb May 10, 2024 Author

jefflightweb May 10, 2024 Author

jefflightweb May 10, 2024 Author

jefflightweb May 10, 2024 Author

jefflightweb May 10, 2024 Author

Replies: 10 comments 22 replies

jefflightweb
May 8, 2024
Author

jefflightweb
May 8, 2024
Author

jefflightweb
May 9, 2024
Author

jefflightweb May 9, 2024
Author

jefflightweb May 9, 2024
Author

jefflightweb May 9, 2024
Author

jefflightweb
May 10, 2024
Author

jefflightweb May 10, 2024
Author

jefflightweb May 10, 2024
Author

jefflightweb May 10, 2024
Author

jefflightweb May 10, 2024
Author

jefflightweb May 10, 2024
Author

jefflightweb May 10, 2024
Author