-
Notifications
You must be signed in to change notification settings - Fork 23.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ansible hangs forever while executing playbook with no information on what is going on #30411
Comments
You can enable debug mode to get more information than We would probably need some more info to try and help you with this which hopefully the debug logs will provide. That many hosts does seem like a lot to connect concurrently, are you running in parallel or in blocks of hosts? needs_info |
Hmm, what are you setting forks to? Hitting lots of hosts at once might be taking up a lot of resources - worth checking /var/log/messages too. You might like to try wait_for_connection as a way of testing that all your hosts are available. Its nice as its aware of the connection type so it works even if you have mixed inventory of say windows and linux hosts. |
Hi, it's connecting max to 10 hosts at same time, this is config:
I managed to find a "block of hosts" that contain the one which is probably causing this hang, so I am now removing more and more hosts from that temporary file with hope to identify the one that causes this. My rough guess is that it maybe hangs on DNS query? It's possible that one of these hosts doesn't exist and misconfigured DNS server make the query take forever? I have no idea, but I am fairly sure that no detailed information is available even in debug log, I will try to provide that one soonish. I see some debug lines, then I see green ok: [hostname] and then nothing forever, not a single debug or verbose line, it just hangs. |
After very long of debugging and testing servers one by one, I figured out which one was causing this. Its DNS is OK, it responds and I can ssh there, but there is some problem:
Here it hangs forever. I replaced domain of this company with "domain.tld", the actual domain name is different. |
This is still bug in Ansible at least in a sense that Ansible should produce some verbose ERROR message that will let user know which host is causing the hang-up so that the host can be removed from inventory file with no need for complex investigation of what went wrong. |
Also running a large inventory (2400+), seeing similar errors, and going through the same lengthy troubleshooting in hopes of finding log verbosity that points me in a better direction.
|
Seeing it get to the end of a playbook and not getting the futex and closing calls. From the strace of a job that show the problem :
From a successful job :
|
There are multiple reasons why this can happen. I never said I found a solution. You can in theory run the playbook by hand in a loop on each host separately to figure out which one is causing problems, and then you need to investigate this host. Good example of problem that can cause this is disconnected NFS mount, which would hang even "df" command. |
Sorry, don't remember finding a solution. I think ours was hanging on a device we were trying to connect to, but honestly cant say. We're not hitting the problem anymore though. |
That was exactly my problem. No ansible.cfg parameter(neither timeout fact_caching_timeout) did helped me to interrupt the process. Thank you so much! |
I had a similar issue, I set Edit: @benapetr: Thanks! This was actually the underlying issue. A folder is mounted over SSHFS through a reverse SSH tunnel (for pushing from ansible control machine). After manually connecting over SSH to target machine and unmounting |
Looks like I have a similar problem with ansible and an nfs mount on the target. lsof, df, ansible.. all hanging. |
@jeroenflvr: In my case the sshfs mount hangs (also when invoking the mount command manually). I switched to CIFS/samba mount over the SSH reverse tunnel and now it works without hanging. This seems to be partly fuse- but not ansible-related. |
it doesn't matter if hang on target host is caused by fuse, nfs or anything, it's still Ansible bug in the way that 1 target host should not be able to lock up whole play for all other hosts. There must be some timeout or internal watchdog implemented that would kill playbook for misbehaving and broken machines. 1 broken machine out of 5000 shouldn't break Ansible for all 5000 machines. |
@benapetr yes, this has to be tackled some day. not as a bug, but as a design issue, such hangs are the big plague and no single "fix" will resolve them for real. I occasionally used it since 2011, and IMO it'll always be like this until finally someone accepts something has to be done on a higher level. |
Same issue for me.
strace:
How i can fix it? What wrong with my tcmacagent5?
|
If it is a mount and it is not usable(touch a file or something) remount it. In my special situation i always reverted a vm to a snapshot for testing purposes and the NFS handle doesnt match within this vm - so it is in stale state |
We have also been experiencing Ansible hanging without any error message, when running playbooks against 23 hosts, with |
I have this problem when the target server has an enormous load, so in case I am running playbook against 5000 servers, and there is one server with a load>100, the playbook will be hung infinitely - and mostly on gathering facts, and I didn't find any solution better than dump "Remove that server from the inventory and try again". |
Could this be caused by an input prompt (e.g. a password request)? |
Hello All, I have an issue any help/suggestions would be really helpful, I will just try to summarize the issue that I have in hand and please let me know if anything is not clear or more details are needed. I have a playbook, which is to scan the entire server(using UAC binary), so the tasks involved in this are as follows.
Troubleshooting that I have done on this are as below.
PS: The entire scan takes around 30mins to 2hours for the servers which we have tested, so we have not set any sort of Timeout. Any suggestions would be of real great help. |
I am also chasing a variant of the hanging problem using a with_nested lineinfile edit of /etc/group that includes a complex regexp expression with back references. The task in question has been in use for about four years now across a wide range of OS but has recently been hanging on every host. Clearly, something had changed in Ansible, python, or another supporting module. Being an old Unix developer, my first thought after "test simplification" was to chase the actual hang on the target host. As all of you likely know, when ansible-playbook is runs it copies a bundle of Python code to ~/.ansible/tmp/ansible-tmp-<UUID string> on the target host. In my case the bundle is named AnsiballZ_lineinfile.py As per https://docs.ansible.com/ansible/latest/dev_guide/debugging.html, it is possible to unpack this bundle (which contains some compressed python code) using the command: python AnsiballZ_lineinfile.py explode The unpacked code is placed into a "debug_dir" directory alongside AnsiballZ_lineinfile.py. The unpacked code can be modified and then run using the command: python AnsiballZ_lineinfile.py execute Obviously this execution hangs as well, but at least we have recourse to strace: strace -o /tmp/output.txt -f -r -s 128 python AnsiballZ_lineinfile.py execute (Note: additional strace arguments like "-r -s 128" increase the amount of useful output.) This at least provides some system call insights into what the process is doing just before it hangs. The following is a selected bit from one blocked-process debug session: 4902 0.000023 openat(AT_FDCWD, "/etc/group", O_RDONLY|O_CLOEXEC) = 4
4902 0.000019 newfstatat(4, "", {st_mode=S_IFREG|0644, st_size=1236, ...}, AT_EMPTY_PATH) = 0
4902 0.000020 ioctl(4, TCGETS, 0x7ffcb9818ac0) = -1 ENOTTY (Inappropriate ioctl for device)
4902 0.000017 lseek(4, 0, SEEK_CUR) = 0
4902 0.000017 read(4, "root:x:0:\ndaemon:x:1:\nbin:x:2:\nsys:x:3:\nadm:x:4:root,jeff,bill\ntty:x:5:\ndisk:x:6:\nlp:x"..., 4096) = 1236
4902 0.000021 read(4, "", 4096) = 0
4902 0.000016 close(4) = 0 After the close the process appears to be hanging but in my particular instance, given sufficient time it returns to complete execution: 4902 **456.133294** newfstatat(AT_FDCWD, "/etc/group", {st_mode=S_IFREG|0644, st_size=1236, ...}, 0) = 0
4902 0.001059 write(1, "\n{\"changed\": false, \"msg\": \"\", \"backup\": \"\", \"diff\": [{\"before\": \"\", \"after\": \"\", \"before_header\": \"/etc/group (content)\", \"afte"..., 783) = 783
4902 0.000201 write(1, "\n", 1) = 1
4902 0.000148 newfstatat(AT_FDCWD, "/tmp/ansible_ansible.builtin.lineinfile_payload_vy8b7kp2", {st_mode=S_IFDIR|0700, st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
4902 0.000032 openat(AT_FDCWD, "/tmp/ansible_ansible.builtin.lineinfile_payload_vy8b7kp2", O_RDONLY|O_CLOEXEC) = 4
...
4902 0.000011 close(5) = 0
4902 0.000020 unlinkat(4, "ansible_ansible.builtin.lineinfile_payload.zip", 0) = 0
4902 0.000045 close(4) = 0
4902 0.000012 rmdir("/tmp/ansible_ansible.builtin.lineinfile_payload_vy8b7kp2") = 0
...
4902 0.000199 exit_group(0) = ?
4902 0.000525 +++ exited with 0 +++ Note the comparatively huge relative time between when the process is finished reading /etc/group and when it decides the regular expression has determined that no changes are required. This points at potential inefficiencies in the regexp being used in the ansible.builtin.lineinfile. This is not a substitute for Ansible failing to support a task timeout value, something badly needed. But at least someone facing what appears to be a block can obtain some clues as to possible causes. |
@BillKanawyer like the existing timeout keyword? Which i had commented about here previously #30411 (comment) |
Thank you; I respectfully withdraw that observation. |
@BillKanawyer you have reminded me of an idea, something we have sometimes done internally, appended 'strace' to the remote commands to try to give the user a way to narrow down the issue. This might be worth adding to an 'execution mode' that will do this w/o making you jump through 8-10 steps. |
Having an 'execution mode' option that adds strace functionality would indeed be useful. I have also been looking into using python trace: # python3 -m trace --ignore-dir=/usr/lib --trace AnsiballZ_lineinfile.py execute > /tmp/info.txt 2> /tmp/error.txt This approach suggests my problem may not be the fault of regexp but rather a missing package: # egrep -n base64 /tmp/info.txt /tmp/error.txt
/tmp/info.txt:24:AnsiballZ_lineinfile.py(17): import base64
/tmp/error.txt:15: import base64
/tmp/error.txt:16:ModuleNotFoundError: No module named 'base64'
/tmp/error.txt:38: import base64
/tmp/error.txt:39:ModuleNotFoundError: No module named 'base64' Note that "import base64" is the last line in the stdout file "info.txt" Using locate seems to confirm this. So this approach is generating concrete debugging results. Why AnsiballZ_lineinfile.py isn't bailing out on import failure and why the base64 module is missing are open questions. |
It does not import it directly, this happens in the common module_utils code, which does not guard on base64 since it is part of core python and 'assumed' to be present. Any import that is not part of core python should be guarded, but core libraries are not required to be so, in any case, that should result in a quick error, not hanging. |
I thought it might be useful to some to leave a few concluding remarks now that I have reached a more complete understanding of my issue. As mentioned before, it is possible to "explode" the AnsiballZ_lineinfile.py bundle, instrument the code, and then "execute" the result. To verify it was my regular expression that was causing the problem, I bracketed the re.compile & re.match calls with print statements. Verifying that my real issue was with the underlying python3 re code and not Ansible lineinfile allowed me to redirect my efforts to understanding why the supplied regular expression match time was increasing exponentially with each character added to the string being matched. Thank you @bcoca for providing useful feedback. |
I have a Changing back and forth between network modes yields the same outcome. Why would the apt module in particular be problematic? Edit: |
Interesting to read this... I keep having variations of this issue with different target hosts, different ansible controllers, different playbooks and at different tasks, different targets in the same playbook when I interrupt the play and restart it. This happens whether I am working with remote targets or target on the local network. There is no pattern that I could discern, other than that more complex playbooks tend to hang more often and more simultaneous targets create more hangs, consistent. This happens just as likely with stock ansible from ubuntu packages, those installed from the ansible repo or ansible installed via pip. Same in the office (completely different network setup) or at home. On the other end of the connection I see no running processes, no pending connections. It looks very much like the last action terminated cleanly. And again, this happens with any kind of command. I have not the slightest idea what the cause of all this could be, because the only constant is the observable effect. There are days when I have no such problems, but I just concluded that this is how ansible works. Windows 3.1 style problem solving, rince, reboot and repeat until it's working or until the day is over and it's somebody else's problem. I wish this happened before I invested months learning ansible and writing scripts for everything. This is really frustrating. |
I have also tried several iterations of -T and connect_timeout and I still cant force a timeout of a particular part of my playbook when a host has an issue with load or hanging. Its very frustrating. |
The project that I work on seems to be having this problem, or a problem like it, and when I see it "live" I check the
so I tend to try to execute the same (or similar) command. Every time I do these manual commands they stall, so this makes me wonder how it ever works for Ansible. That said, the When I do the Further, when I modify the command to be like:
Then I never get this "read from stdin" mode and the behavior that I expect is the behavior that I get. I can also get the desired behavior if I do commands like:
Though this produces a blank line in the output, so not the best solution. Has anyone else noticed this behavior? |
I see these processes for tasks that are still alive. After some time, they are gone and the controller keeps hanging. If these processes keep running, I guess you have a different problem, at least from what I observe. When your scripts consistently hang for specific tasks (which is not what I see on my side), then it looks more like an issue with the task itself. Did you try to
I tend to believe that this is another problem. What happens if you interrupt the playbook and restart it? Will it hang at the same task consistently? For me, this is definitely not the case, I see randomish behavior. A slow network definitely increases the probability of such hangers, as well as more target hosts and more complex scripts. Then I have periods of days or even weeks without problems. |
I am also facing the same issue on RHEL 8 when I am executing the command through jenkins but it is working from the vm. It's getting hand on AnsiballZ_command.py |
I have stopped using ansible push several years ago, for these types of
reasons.
…On Mon, Jun 12, 2023 at 4:34 AM priyankaprabhune26 ***@***.***> wrote:
I am also facing the same issue on RHEL 8 when I am executing the command
through jenkins but it is working from the vm.
It's getting hand on AnsiballZ_command.py
44699 1686567241.27096: _low_level_execute_command(): executing: /bin/sh
-c '/usr/bin/python
/home//.ansible/tmp/ansible-tmp-1686567240.16-44699-236344823441760/AnsiballZ_command.py
&& sleep 0'
—
Reply to this email directly, view it on GitHub
<#30411 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFYYS42DKQZRDFDH7KTOLX3XK35EDANCNFSM4D3C4LIQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
Regards,
*Imam Toufique*
*213-700-5485*
|
I didn't find this until now, I'm seeing the same issues on Windows servers ansible-collections/ansible.windows#564 I'm also finding servers that still have the Ansible process running even after the Job as completed without hangs. I regret ever considering Ansible, currently looking at what Azure will offer for patching. |
Hs anyone here tried async and poll to mitigate hung tasks ? https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_async.html |
So far async has worked for us. We had a lot of hanging during AMI builds with packer and ansible. It worked fine for years and recently started hanging this summer. We had another task hang today so going to set it to async as well. |
Ok, just tried it and I got a hang on the one server, which is good because I could dig into the issue, bad because the async didnt help, or I configured it wrong. |
This is what I am using
|
thanks for sharing, will give it a try today :-) |
"msg": "No job id was returned by the async task",
is this a limitation of Async not working across reboots ? |
the windows update module doesnt support Async. TASK [Apply Security, Critical updates, Update Rollups log to C:\ansible_wu.txt] *** |
So I found one cause of this: if you use Due to variable precedence orders, This manifests as an infinite freeze when you wind up waiting on a sudo prompt you don't expect (I see this when switching to Basically the following playbook won't do what you think: - hosts: all
tasks:
- name: Set fact here
set_fact:
ansible_become: true
- name: Now do a local thing
copy:
dest: somefile.txt
content: |
somedata
vars:
ansible_become: false
connection: local
- name: This also won't work
copy:
dest: somefile.txt
content: |
somedata
become: false
connection: local This is documented, in the sense that it's out there, but it seems absolutely bonkers insane to me: an event which might happen hundreds of playbook tasks ago can apparently override explicit variable specifications on the task in front of you. |
Looks like a possible dupe of #18305 |
ISSUE TYPE
COMPONENT NAME
ansible-playbook
ANSIBLE VERSION
CONFIGURATION
OS / ENVIRONMENT
SUMMARY
When running ansible playbook for all servers (1964 servers) it hangs somewhere in middle of execution. I used -vvvvvvvvvv to trace the problem, but it doesn't print ANYTHING. The previous host finishes and the message about finish is last thing I see in console, then there is NOTHING. It just hangs. It doesn't even print the FQDN or name of server that is next in queue, it doesn't say what is is waiting for. Just nothing.
In ps I see 2 processes: Running strace shows:
parent:
Process 22170 attached
select(0, NULL, NULL, NULL, {0, 599}) = 0 (Timeout)
wait4(28917, 0x7ffd44d0bf34, WNOHANG, NULL) = 0
wait4(28917, 0x7ffd44d0bf64, WNOHANG, NULL) = 0
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
wait4(28917, 0x7ffd44d0bf34, WNOHANG, NULL) = 0
wait4(28917, 0x7ffd44d0bf64, WNOHANG, NULL) = 0
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
wait4(28917, 0x7ffd44d0bf34, WNOHANG, NULL) = 0
wait4(28917, 0x7ffd44d0bf64, WNOHANG, NULL) = 0
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
wait4(28917, 0x7ffd44d0bf34, WNOHANG, NULL) = 0
wait4(28917, 0x7ffd44d0bf64, WNOHANG, NULL) = 0
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
wait4(28917, 0x7ffd44d0bf34, WNOHANG, NULL) = 0
wait4(28917, 0x7ffd44d0bf64, WNOHANG, NULL) = 0
...
repeats forever
child:
Process 28917 attached
epoll_wait(40,
STEPS TO REPRODUCE
Looks like a bug to me, it should at least tell me what it's trying to do that causes the hang
The text was updated successfully, but these errors were encountered: