Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ansible hangs forever while executing playbook with no information on what is going on #30411

Open
benapetr opened this issue Sep 15, 2017 · 151 comments
Labels
affects_2.3 This issue/PR affects Ansible v2.3 bug This issue/PR relates to a bug. hang Ansible hangs or gets stuck has_pr This issue has an associated PR. support:core This issue/PR relates to code supported by the Ansible Engineering Team.
Projects

Comments

@benapetr
Copy link

ISSUE TYPE
  • Bug Report
COMPONENT NAME

ansible-playbook

ANSIBLE VERSION
ansible 2.3.1.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = Default w/o overrides
  python version = 2.7.5 (default, Aug  2 2016, 04:20:16) [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)]
CONFIGURATION
OS / ENVIRONMENT
SUMMARY

When running ansible playbook for all servers (1964 servers) it hangs somewhere in middle of execution. I used -vvvvvvvvvv to trace the problem, but it doesn't print ANYTHING. The previous host finishes and the message about finish is last thing I see in console, then there is NOTHING. It just hangs. It doesn't even print the FQDN or name of server that is next in queue, it doesn't say what is is waiting for. Just nothing.

In ps I see 2 processes: Running strace shows:

parent:
Process 22170 attached
select(0, NULL, NULL, NULL, {0, 599}) = 0 (Timeout)
wait4(28917, 0x7ffd44d0bf34, WNOHANG, NULL) = 0
wait4(28917, 0x7ffd44d0bf64, WNOHANG, NULL) = 0
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
wait4(28917, 0x7ffd44d0bf34, WNOHANG, NULL) = 0
wait4(28917, 0x7ffd44d0bf64, WNOHANG, NULL) = 0
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
wait4(28917, 0x7ffd44d0bf34, WNOHANG, NULL) = 0
wait4(28917, 0x7ffd44d0bf64, WNOHANG, NULL) = 0
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
wait4(28917, 0x7ffd44d0bf34, WNOHANG, NULL) = 0
wait4(28917, 0x7ffd44d0bf64, WNOHANG, NULL) = 0
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
wait4(28917, 0x7ffd44d0bf34, WNOHANG, NULL) = 0
wait4(28917, 0x7ffd44d0bf64, WNOHANG, NULL) = 0

...
repeats forever

child:
Process 28917 attached
epoll_wait(40,

STEPS TO REPRODUCE
# This playbook is used to test if server works or not

---
- hosts: run-tools.homecredit.net all:!ignored
  tasks:
  - command: echo works

Looks like a bug to me, it should at least tell me what it's trying to do that causes the hang

@ansibot ansibot added affects_2.3 This issue/PR affects Ansible v2.3 bug_report needs_triage Needs a first human triage before being processed. support:core This issue/PR relates to code supported by the Ansible Engineering Team. labels Sep 15, 2017
@jborean93 jborean93 removed the needs_triage Needs a first human triage before being processed. label Sep 21, 2017
@jborean93
Copy link
Contributor

You can enable debug mode to get more information than -v* gives you. To do this just export ANSIBLE_DEBUG=1 to enable it.

We would probably need some more info to try and help you with this which hopefully the debug logs will provide. That many hosts does seem like a lot to connect concurrently, are you running in parallel or in blocks of hosts?

needs_info

@ansibot ansibot added the needs_info This issue requires further information. Please answer any outstanding questions. label Sep 21, 2017
@jhawkesworth
Copy link
Contributor

Hmm, what are you setting forks to? Hitting lots of hosts at once might be taking up a lot of resources - worth checking /var/log/messages too.

You might like to try wait_for_connection as a way of testing that all your hosts are available. Its nice as its aware of the connection type so it works even if you have mixed inventory of say windows and linux hosts.

@benapetr
Copy link
Author

Hi, it's connecting max to 10 hosts at same time, this is config:

[defaults]
host_key_checking = False
timeout = 20
retry_files_save_path = /home/ansible/retry/
# Do not put more jobs here, or ssh will fail
# Anything more than 10 kills this poor server
forks = 10
remote_user = root
log_path=/var/log/ansible.log

[ssh_connection]
retries=3
pipelining=True

I managed to find a "block of hosts" that contain the one which is probably causing this hang, so I am now removing more and more hosts from that temporary file with hope to identify the one that causes this.

My rough guess is that it maybe hangs on DNS query? It's possible that one of these hosts doesn't exist and misconfigured DNS server make the query take forever? I have no idea, but I am fairly sure that no detailed information is available even in debug log, I will try to provide that one soonish. I see some debug lines, then I see green ok: [hostname] and then nothing forever, not a single debug or verbose line, it just hangs.

@ansibot ansibot removed the needs_info This issue requires further information. Please answer any outstanding questions. label Sep 26, 2017
@benapetr
Copy link
Author

After very long of debugging and testing servers one by one, I figured out which one was causing this. Its DNS is OK, it responds and I can ssh there, but there is some problem:

 12502 1506511086.24143: starting run
 12502 1506511086.42742: Loading CacheModule 'memory' from /usr/lib/python2.7/site-packages/ansible/plugins/cache/memory.py
 12502 1506511086.63981: Loading CallbackModule 'default' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/default.py
 12502 1506511086.64073: Loading CallbackModule 'actionable' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/actionable.py (found_in_cache=False, class_only=True)
 12502 1506511086.64109: Loading CallbackModule 'context_demo' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/context_demo.py (found_in_cache=False, class_only=True)
 12502 1506511086.64146: Loading CallbackModule 'debug' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/debug.py (found_in_cache=False, class_only=True)
 12502 1506511086.64166: Loading CallbackModule 'default' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/default.py (found_in_cache=False, class_only=True)
 12502 1506511086.64226: Loading CallbackModule 'dense' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/dense.py (found_in_cache=False, class_only=True)
 12502 1506511086.67297: Loading CallbackModule 'foreman' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/foreman.py (found_in_cache=False, class_only=True)
 12502 1506511086.68313: Loading CallbackModule 'hipchat' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/hipchat.py (found_in_cache=False, class_only=True)
 12502 1506511086.68376: Loading CallbackModule 'jabber' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/jabber.py (found_in_cache=False, class_only=True)
 12502 1506511086.68409: Loading CallbackModule 'json' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/json.py (found_in_cache=False, class_only=True)
 12502 1506511086.68476: Loading CallbackModule 'junit' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/junit.py (found_in_cache=False, class_only=True)
 12502 1506511086.68511: Loading CallbackModule 'log_plays' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/log_plays.py (found_in_cache=False, class_only=True)
 12502 1506511086.68598: Loading CallbackModule 'logentries' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/logentries.py (found_in_cache=False, class_only=True)
 12502 1506511086.68675: Loading CallbackModule 'logstash' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/logstash.py (found_in_cache=False, class_only=True)
 12502 1506511086.68789: Loading CallbackModule 'mail' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/mail.py (found_in_cache=False, class_only=True)
 12502 1506511086.68823: Loading CallbackModule 'minimal' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/minimal.py (found_in_cache=False, class_only=True)
 12502 1506511086.68855: Loading CallbackModule 'oneline' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/oneline.py (found_in_cache=False, class_only=True)
 12502 1506511086.68892: Loading CallbackModule 'osx_say' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/osx_say.py (found_in_cache=False, class_only=True)
 12502 1506511086.68932: Loading CallbackModule 'profile_tasks' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/profile_tasks.py (found_in_cache=False, class_only=True)
 12502 1506511086.68973: Loading CallbackModule 'selective' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/selective.py (found_in_cache=False, class_only=True)
 12502 1506511086.69006: Loading CallbackModule 'skippy' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/skippy.py (found_in_cache=False, class_only=True)
 12502 1506511086.69063: Loading CallbackModule 'slack' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/slack.py (found_in_cache=False, class_only=True)
 12502 1506511086.69186: Loading CallbackModule 'syslog_json' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/syslog_json.py (found_in_cache=False, class_only=True)
 12502 1506511086.69221: Loading CallbackModule 'timer' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/timer.py (found_in_cache=False, class_only=True)
 12502 1506511086.69256: Loading CallbackModule 'tree' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/tree.py (found_in_cache=False, class_only=True)
 12502 1506511086.69296: in VariableManager get_vars()
 12502 1506511086.71899: Loading FilterModule 'core' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/core.py
 12502 1506511086.71980: Loading FilterModule 'ipaddr' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/ipaddr.py
 12502 1506511086.72396: Loading FilterModule 'json_query' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/json_query.py
 12502 1506511086.72433: Loading FilterModule 'mathstuff' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/mathstuff.py
 12502 1506511086.72655: Loading TestModule 'core' from /usr/lib/python2.7/site-packages/ansible/plugins/test/core.py
 12502 1506511086.72845: Loading TestModule 'files' from /usr/lib/python2.7/site-packages/ansible/plugins/test/files.py
 12502 1506511086.72880: Loading TestModule 'mathstuff' from /usr/lib/python2.7/site-packages/ansible/plugins/test/mathstuff.py
 12502 1506511086.73542: done with get_vars()
 12502 1506511086.73603: in VariableManager get_vars()
 12502 1506511086.73665: Loading FilterModule 'core' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/core.py (found_in_cache=True, class_only=False)
 12502 1506511086.73686: Loading FilterModule 'ipaddr' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/ipaddr.py (found_in_cache=True, class_only=False)
 12502 1506511086.73702: Loading FilterModule 'json_query' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/json_query.py (found_in_cache=True, class_only=False)
 12502 1506511086.73717: Loading FilterModule 'mathstuff' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/mathstuff.py (found_in_cache=True, class_only=False)
 12502 1506511086.73765: Loading TestModule 'core' from /usr/lib/python2.7/site-packages/ansible/plugins/test/core.py (found_in_cache=True, class_only=False)
 12502 1506511086.73783: Loading TestModule 'files' from /usr/lib/python2.7/site-packages/ansible/plugins/test/files.py (found_in_cache=True, class_only=False)
 12502 1506511086.73799: Loading TestModule 'mathstuff' from /usr/lib/python2.7/site-packages/ansible/plugins/test/mathstuff.py (found_in_cache=True, class_only=False)
 12502 1506511086.73921: done with get_vars()

PLAY [all:!ignored] **********************************************************************************************************************************************************************************************************************************
 12502 1506511086.81512: Loading StrategyModule 'linear' from /usr/lib/python2.7/site-packages/ansible/plugins/strategy/linear.py
 12502 1506511086.82241: getting the remaining hosts for this loop
 12502 1506511086.82263: done getting the remaining hosts for this loop
 12502 1506511086.82282: building list of next tasks for hosts
 12502 1506511086.82297: getting the next task for host in-terminal01.prod.domain.tld
 12502 1506511086.82315: done getting next task for host in-terminal01.prod.domain.tld
 12502 1506511086.82330:  ^ task is: TASK: Gathering Facts
 12502 1506511086.82345:  ^ state is: HOST STATE: block=0, task=0, rescue=0, always=0, run_state=ITERATING_SETUP, fail_state=FAILED_NONE, pending_setup=True, tasks child state? (None), rescue child state? (None), always child state? (None), did rescue? False, did start at task? False
 12502 1506511086.82360: done building task lists
 12502 1506511086.82376: counting tasks in each state of execution
 12502 1506511086.82390: done counting tasks in each state of execution:
        num_setups: 1
        num_tasks: 0
        num_rescue: 0
        num_always: 0
 12502 1506511086.82405: advancing hosts in ITERATING_SETUP
 12502 1506511086.82418: starting to advance hosts
 12502 1506511086.82432: getting the next task for host in-terminal01.prod.domain.tld
 12502 1506511086.82448: done getting next task for host in-terminal01.prod.domain.tld
 12502 1506511086.82462:  ^ task is: TASK: Gathering Facts
 12502 1506511086.82478:  ^ state is: HOST STATE: block=0, task=0, rescue=0, always=0, run_state=ITERATING_SETUP, fail_state=FAILED_NONE, pending_setup=True, tasks child state? (None), rescue child state? (None), always child state? (None), did rescue? False, did start at task? False
 12502 1506511086.82492: done advancing hosts to next task
 12502 1506511086.83078: getting variables
 12502 1506511086.83098: in VariableManager get_vars()
 12502 1506511086.83157: Loading FilterModule 'core' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/core.py (found_in_cache=True, class_only=False)
 12502 1506511086.83176: Loading FilterModule 'ipaddr' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/ipaddr.py (found_in_cache=True, class_only=False)
 12502 1506511086.83193: Loading FilterModule 'json_query' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/json_query.py (found_in_cache=True, class_only=False)
 12502 1506511086.83209: Loading FilterModule 'mathstuff' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/mathstuff.py (found_in_cache=True, class_only=False)
 12502 1506511086.83247: Loading TestModule 'core' from /usr/lib/python2.7/site-packages/ansible/plugins/test/core.py (found_in_cache=True, class_only=False)
 12502 1506511086.83263: Loading TestModule 'files' from /usr/lib/python2.7/site-packages/ansible/plugins/test/files.py (found_in_cache=True, class_only=False)
 12502 1506511086.83281: Loading TestModule 'mathstuff' from /usr/lib/python2.7/site-packages/ansible/plugins/test/mathstuff.py (found_in_cache=True, class_only=False)
 12502 1506511086.83394: done with get_vars()
 12502 1506511086.83421: done getting variables
 12502 1506511086.83438: sending task start callback, copying the task so we can template it temporarily
 12502 1506511086.83453: done copying, going to template now
 12502 1506511086.83470: done templating
 12502 1506511086.83484: here goes the callback...

TASK [Gathering Facts] ********************************************************************************************************************************************************************************************************************************************************
 12502 1506511086.83515: sending task start callback
 12502 1506511086.83530: entering _queue_task() for in-terminal01.prod.domain.tld/setup
 12502 1506511086.83694: worker is 1 (out of 1 available)
 12502 1506511086.83772: exiting _queue_task() for in-terminal01.prod.domain.tld/setup
 12502 1506511086.83840: done queuing things up, now waiting for results queue to drain
 12502 1506511086.83859: waiting for pending results...
 12511 1506511086.84236: running TaskExecutor() for in-terminal01.prod.domain.tld/TASK: Gathering Facts
 12511 1506511086.84316: in run()
 12511 1506511086.84379: calling self._execute()
 12511 1506511086.85119: Loading Connection 'ssh' from /usr/lib/python2.7/site-packages/ansible/plugins/connection/ssh.py
 12511 1506511086.85207: Loading ShellModule 'csh' from /usr/lib/python2.7/site-packages/ansible/plugins/shell/csh.py
 12511 1506511086.85264: Loading ShellModule 'fish' from /usr/lib/python2.7/site-packages/ansible/plugins/shell/fish.py
 12511 1506511086.85346: Loading ShellModule 'powershell' from /usr/lib/python2.7/site-packages/ansible/plugins/shell/powershell.py
 12511 1506511086.85381: Loading ShellModule 'sh' from /usr/lib/python2.7/site-packages/ansible/plugins/shell/sh.py
 12511 1506511086.85423: Loading ShellModule 'sh' from /usr/lib/python2.7/site-packages/ansible/plugins/shell/sh.py (found_in_cache=True, class_only=False)
 12511 1506511086.85479: Loading ActionModule 'normal' from /usr/lib/python2.7/site-packages/ansible/plugins/action/normal.py
 12511 1506511086.85500: starting attempt loop
 12511 1506511086.85516: running the handler
 12511 1506511086.85572: ANSIBALLZ: Using lock for setup
 12511 1506511086.85589: ANSIBALLZ: Acquiring lock
 12511 1506511086.85606: ANSIBALLZ: Lock acquired: 29697296
 12511 1506511086.85624: ANSIBALLZ: Creating module
 12511 1506511087.15592: ANSIBALLZ: Writing module
 12511 1506511087.15653: ANSIBALLZ: Renaming module
 12511 1506511087.15678: ANSIBALLZ: Done creating module
 12511 1506511087.15765: _low_level_execute_command(): starting
 12511 1506511087.15786: _low_level_execute_command(): executing: /bin/sh -c '/usr/bin/python && sleep 0'
 12511 1506511087.16363: Sending initial data
 12511 1506511089.69186: Sent initial data (103646 bytes)
 12511 1506511089.69246: stderr chunk (state=3):
>>>mux_client_request_session: session request failed: Session open refused by peer
ControlSocket /home/ansible/.ansible/cp/170b9dc5f6 already exists, disabling multiplexing
<<<

Here it hangs forever. I replaced domain of this company with "domain.tld", the actual domain name is different.

@benapetr
Copy link
Author

This is still bug in Ansible at least in a sense that Ansible should produce some verbose ERROR message that will let user know which host is causing the hang-up so that the host can be removed from inventory file with no need for complex investigation of what went wrong.

@harrytruman
Copy link

harrytruman commented Nov 14, 2017

Also running a large inventory (2400+), seeing similar errors, and going through the same lengthy troubleshooting in hopes of finding log verbosity that points me in a better direction.

ansible 2.4.1.0
python version = 2.7.5 (default, May  3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)]

@jmighion
Copy link
Member

Seeing it get to the end of a playbook and not getting the futex and closing calls. From the strace of a job that show the problem :

21:39:26 stat("/var/lib/awx/projects/<job>/host_vars", 0x7ffd4ce0dfd0) = -1 ENOENT (No such file or directory)
21:39:26 stat("/var/lib/awx/.cache/facts/<host>", 0x7ffd4ce0dae0) = -1 ENOENT (No such file or directory)
21:39:26 stat("/var/lib/awx/.cache/facts/<host>", 0x7ffd4ce0dcf0) = -1 ENOENT (No such file or directory)
21:39:26 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
21:39:26 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
... infinite Timeout

From a successful job :

21:39:39 stat("/var/lib/awx/projects/<job>/host_vars", 0x7ffd4ce0dfd0) = -1 ENOENT (No such file or directory)
21:39:39 stat("/var/lib/awx/.cache/facts/<host>", 0x7ffd4ce0dae0) = -1 ENOENT (No such file or directory)
21:39:39 stat("/var/lib/awx/.cache/facts/<host>", 0x7ffd4ce0dcf0) = -1 ENOENT (No such file or directory)
21:39:39 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
21:39:39 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
21:39:39 futex(0x7fb178001390, FUTEX_WAKE_PRIVATE, 1) = 1
21:39:39 futex(0x1e9d090, FUTEX_WAKE_PRIVATE, 1) = 1
21:39:39 futex(0x1e9d090, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
21:39:39 futex(0x1e9d090, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
21:39:39 futex(0x1e9d090, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
21:39:39 futex(0x1e9d090, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
21:39:39 futex(0x7f15690, FUTEX_WAKE_PRIVATE, 1) = 1
21:39:39 futex(0x1e9d090, FUTEX_WAKE_PRIVATE, 1) = 1
21:39:39 futex(0x711f9e0, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
21:39:39 close(7)                       = 0
... continues to close and wrap up

@EdwardIII
Copy link

@benapetr / @jmighion how did you guys solve the underlying problem? Trying to figure out a workaround when issues like this arise

@benapetr
Copy link
Author

There are multiple reasons why this can happen. I never said I found a solution. You can in theory run the playbook by hand in a loop on each host separately to figure out which one is causing problems, and then you need to investigate this host.

Good example of problem that can cause this is disconnected NFS mount, which would hang even "df" command.

@jmighion
Copy link
Member

Sorry, don't remember finding a solution. I think ours was hanging on a device we were trying to connect to, but honestly cant say. We're not hitting the problem anymore though.

@leupo
Copy link

leupo commented Jan 30, 2018

#30411 (comment)

Good example of problem that can cause this is disconnected NFS mount, which would hang even > "df" command.

That was exactly my problem. No ansible.cfg parameter(neither timeout fact_caching_timeout) did helped me to interrupt the process. Thank you so much!
IMO this schould result in a error/timeout on the control machine.

@ansibot ansibot added bug This issue/PR relates to a bug. and removed bug_report labels Mar 1, 2018
@strarsis
Copy link

strarsis commented Mar 2, 2018

I had a similar issue, I set pipelining from True to False, run the playbook (successfully after this change) and then set back to True whereafter subsequent playbook runs also worked correctly.

Edit: @benapetr: Thanks! This was actually the underlying issue. A folder is mounted over SSHFS through a reverse SSH tunnel (for pushing from ansible control machine). After manually connecting over SSH to target machine and unmounting umount -l ... the problem went away.

@jeroenflvr
Copy link

Looks like I have a similar problem with ansible and an nfs mount on the target. lsof, df, ansible.. all hanging.
Setting the pipelining to false in the ansible.cfg did not help.

@strarsis
Copy link

strarsis commented Apr 23, 2018

@jeroenflvr: In my case the sshfs mount hangs (also when invoking the mount command manually). I switched to CIFS/samba mount over the SSH reverse tunnel and now it works without hanging. This seems to be partly fuse- but not ansible-related.

@benapetr
Copy link
Author

it doesn't matter if hang on target host is caused by fuse, nfs or anything, it's still Ansible bug in the way that 1 target host should not be able to lock up whole play for all other hosts.

There must be some timeout or internal watchdog implemented that would kill playbook for misbehaving and broken machines. 1 broken machine out of 5000 shouldn't break Ansible for all 5000 machines.

@FlorianHeigl
Copy link

@benapetr yes, this has to be tackled some day. not as a bug, but as a design issue, such hangs are the big plague and no single "fix" will resolve them for real. I occasionally used it since 2011, and IMO it'll always be like this until finally someone accepts something has to be done on a higher level.

@byumov
Copy link

byumov commented Apr 26, 2018

Same issue for me.
I have six MacOS hosts and i run:
ansible-playbook tcmacagents.yml -f 6
It hangs on first step "Gathering Facts". Always on the same host(tcmacagent5):

ok: [tcmacagent1]
ok: [tcmacagent2]
ok: [tcmacagent6]
ok: [tcmacagent3]
ok: [tcmacagent4]


strace:

[pid 24072] wait4(24211, 0x7ffc608410a4, WNOHANG, NULL) = 0
[pid 24072] wait4(24211, 0x7ffc608410d4, WNOHANG, NULL) = 0
[pid 24072] select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
[pid 24072] wait4(24211, 0x7ffc608410a4, WNOHANG, NULL) = 0
[pid 24072] wait4(24211, 0x7ffc608410d4, WNOHANG, NULL) = 0
[pid 24072] select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
[pid 24072] wait4(24211, 0x7ffc608410a4, WNOHANG, NULL) = 0
[pid 24072] wait4(24211, 0x7ffc608410d4, WNOHANG, NULL) = 0
[pid 24072] select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
[pid 24072] wait4(24211, 0x7ffc608410a4, WNOHANG, NULL) = 0
[pid 24072] wait4(24211, 0x7ffc608410d4, WNOHANG, NULL) = 0
[pid 24072] select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
[pid 24072] wait4(24211, 0x7ffc608410a4, WNOHANG, NULL) = 0
[pid 24072] wait4(24211, 0x7ffc608410d4, WNOHANG, NULL) = 0
[pid 24072] select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)

How i can fix it? What wrong with my tcmacagent5?

# ansible --version
ansible 2.5.0

@leupo
Copy link

leupo commented Apr 26, 2018

If it is a mount and it is not usable(touch a file or something) remount it. In my special situation i always reverted a vm to a snapshot for testing purposes and the NFS handle doesnt match within this vm - so it is in stale state

@jctanner jctanner added this to TODO in CEM May 20, 2018
@dentarg
Copy link

dentarg commented May 23, 2018

We have also been experiencing Ansible hanging without any error message, when running playbooks against 23 hosts, with forks=25. We use Ansible 2.2.3.0, but have the same problem with Ansible 2.5.3. We run our playbooks from macOS 10.12.6, using Python 2.7.15. With forks=5I don't see the problem. We also don't see the problem when using transport=paramiko (and forks=25).

@kompjuteras
Copy link

I have this problem when the target server has an enormous load, so in case I am running playbook against 5000 servers, and there is one server with a load>100, the playbook will be hung infinitely - and mostly on gathering facts, and I didn't find any solution better than dump "Remove that server from the inventory and try again".
Is there any solution for this?

@bcoca
Copy link
Member

bcoca commented Jul 13, 2022

@Joe-Heffer-Shef
Copy link

Could this be caused by an input prompt (e.g. a password request)?

@Rakesh2904
Copy link

Hello All,

I have an issue any help/suggestions would be really helpful, I will just try to summarize the issue that I have in hand and please let me know if anything is not clear or more details are needed.

I have a playbook, which is to scan the entire server(using UAC binary), so the tasks involved in this are as follows.

  1. Create a workspace - Based on the max available space on drive(win)/fs(linux)
  2. Download the binary from a centralized repo - No issues in this step on any of the servers that were tested
  3. Unzip the binary - No issues in this step on any of the servers that were tested
  4. Execute the binary - This is where I have issue on few of the LINUX servers (approx. 15-20% of the servers) the issue is the binary execution is completed at the server but Ansible is not moving forward to the next task, when we manually execute the same binary it works like a charm(so it is claimed the the binary is not having issues), Since the playbook works in most of the servers(I am claiming there is no issue at the playbook), I suspect this has something to do at the server level which I am unable to figure out. So with the problematic servers it does not move to the next step even though the execution is actually completed.
  5. Download azcopy binary from the same centralized repo - No issues in this step on any of the servers that were tested except for the ones which has problems since Ansible is not moving to this task.
  6. Unzip the azcopy binary - No issues in this step on any of the servers that were tested except for the ones which has problems since Ansible is not moving to this task.
  7. Upload the collected logs/artefacts from UAC binary scan to BLOB storage - No issues in this step on any of the servers that were tested except for the ones which has problems since Ansible is not moving to this task.
  8. Clean up the workspace - No issues in this step on any of the servers that were tested except for the ones which has problems since Ansible is not moving to this task.

Troubleshooting that I have done on this are as below.

  1. Tried to increase the disk space - It worked (this was done since most of the servers having problem were having less than 50G avail space on the FS used as workspace) But we can't conclude this as the cause because when I used a NAS storage which is of 2T size the issue still occurs.
  2. Since the binary scans the entire server, there is one particular scan (rpm -a -V) this is the one which takes time, so we disabled this and tried to scan the problematic server it worked like a charm so when compared with problematic server and working server there is no much difference in the rpm packages. also this scan is important and the binary has to be included with this.(In my opinion the problem lies here but I am unsure of how to fix it).
  3. There are few other small things that I've tested which I don't think is worth mentioning here.

PS: The entire scan takes around 30mins to 2hours for the servers which we have tested, so we have not set any sort of Timeout.

Any suggestions would be of real great help.

@BillKanawyer
Copy link

BillKanawyer commented Sep 14, 2022

I am also chasing a variant of the hanging problem using a with_nested lineinfile edit of /etc/group that includes a complex regexp expression with back references.

The task in question has been in use for about four years now across a wide range of OS but has recently been hanging on every host. Clearly, something had changed in Ansible, python, or another supporting module.

Being an old Unix developer, my first thought after "test simplification" was to chase the actual hang on the target host.

As all of you likely know, when ansible-playbook is runs it copies a bundle of Python code to ~/.ansible/tmp/ansible-tmp-<UUID string> on the target host. In my case the bundle is named AnsiballZ_lineinfile.py

As per https://docs.ansible.com/ansible/latest/dev_guide/debugging.html, it is possible to unpack this bundle (which contains some compressed python code) using the command:

python AnsiballZ_lineinfile.py explode

The unpacked code is placed into a "debug_dir" directory alongside AnsiballZ_lineinfile.py. The unpacked code can be modified and then run using the command:

python AnsiballZ_lineinfile.py execute

Obviously this execution hangs as well, but at least we have recourse to strace:

strace -o /tmp/output.txt -f -r -s 128 python AnsiballZ_lineinfile.py execute

(Note: additional strace arguments like "-r -s 128" increase the amount of useful output.)

This at least provides some system call insights into what the process is doing just before it hangs.

The following is a selected bit from one blocked-process debug session:

4902       0.000023 openat(AT_FDCWD, "/etc/group", O_RDONLY|O_CLOEXEC) = 4
4902       0.000019 newfstatat(4, "", {st_mode=S_IFREG|0644, st_size=1236, ...}, AT_EMPTY_PATH) = 0
4902       0.000020 ioctl(4, TCGETS, 0x7ffcb9818ac0) = -1 ENOTTY (Inappropriate ioctl for device)
4902       0.000017 lseek(4, 0, SEEK_CUR) = 0
4902       0.000017 read(4, "root:x:0:\ndaemon:x:1:\nbin:x:2:\nsys:x:3:\nadm:x:4:root,jeff,bill\ntty:x:5:\ndisk:x:6:\nlp:x"..., 4096) = 1236
4902       0.000021 read(4, "", 4096)   = 0
4902       0.000016 close(4)            = 0

After the close the process appears to be hanging but in my particular instance, given sufficient time it returns to complete execution:

4902     **456.133294** newfstatat(AT_FDCWD, "/etc/group", {st_mode=S_IFREG|0644, st_size=1236, ...}, 0) = 0
4902       0.001059 write(1, "\n{\"changed\": false, \"msg\": \"\", \"backup\": \"\", \"diff\": [{\"before\": \"\", \"after\": \"\", \"before_header\": \"/etc/group (content)\", \"afte"..., 783) = 783
4902       0.000201 write(1, "\n", 1)   = 1
4902       0.000148 newfstatat(AT_FDCWD, "/tmp/ansible_ansible.builtin.lineinfile_payload_vy8b7kp2", {st_mode=S_IFDIR|0700, st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
4902       0.000032 openat(AT_FDCWD, "/tmp/ansible_ansible.builtin.lineinfile_payload_vy8b7kp2", O_RDONLY|O_CLOEXEC) = 4
...
4902       0.000011 close(5)            = 0
4902       0.000020 unlinkat(4, "ansible_ansible.builtin.lineinfile_payload.zip", 0) = 0
4902       0.000045 close(4)            = 0
4902       0.000012 rmdir("/tmp/ansible_ansible.builtin.lineinfile_payload_vy8b7kp2") = 0
...
4902       0.000199 exit_group(0)       = ?
4902       0.000525 +++ exited with 0 +++

Note the comparatively huge relative time between when the process is finished reading /etc/group and when it decides the regular expression has determined that no changes are required. This points at potential inefficiencies in the regexp being used in the ansible.builtin.lineinfile.

This is not a substitute for Ansible failing to support a task timeout value, something badly needed.

But at least someone facing what appears to be a block can obtain some clues as to possible causes.

@bcoca
Copy link
Member

bcoca commented Sep 14, 2022

@BillKanawyer
|This is not a substitute for Ansible failing to support a task timeout value, something badly needed.

like the existing timeout keyword?
https://docs.ansible.com/ansible/latest/reference_appendices/playbooks_keywords.html#task

Which i had commented about here previously #30411 (comment)

@BillKanawyer
Copy link

@BillKanawyer |This is not a substitute for Ansible failing to support a task timeout value, something badly needed.

like the existing timeout keyword? https://docs.ansible.com/ansible/latest/reference_appendices/playbooks_keywords.html#task

Which i had commented about here previously #30411 (comment)

Thank you; I respectfully withdraw that observation.

@bcoca
Copy link
Member

bcoca commented Sep 14, 2022

@BillKanawyer you have reminded me of an idea, something we have sometimes done internally, appended 'strace' to the remote commands to try to give the user a way to narrow down the issue. This might be worth adding to an 'execution mode' that will do this w/o making you jump through 8-10 steps.

@BillKanawyer
Copy link

Having an 'execution mode' option that adds strace functionality would indeed be useful.

I have also been looking into using python trace:

# python3 -m trace --ignore-dir=/usr/lib --trace AnsiballZ_lineinfile.py execute > /tmp/info.txt 2> /tmp/error.txt

This approach suggests my problem may not be the fault of regexp but rather a missing package:

# egrep -n base64 /tmp/info.txt /tmp/error.txt
/tmp/info.txt:24:AnsiballZ_lineinfile.py(17):     import base64
/tmp/error.txt:15:    import base64
/tmp/error.txt:16:ModuleNotFoundError: No module named 'base64'
/tmp/error.txt:38:    import base64
/tmp/error.txt:39:ModuleNotFoundError: No module named 'base64'

Note that "import base64" is the last line in the stdout file "info.txt"

Using locate seems to confirm this. So this approach is generating concrete debugging results.

Why AnsiballZ_lineinfile.py isn't bailing out on import failure and why the base64 module is missing are open questions.

@bcoca
Copy link
Member

bcoca commented Sep 14, 2022

It does not import it directly, this happens in the common module_utils code, which does not guard on base64 since it is part of core python and 'assumed' to be present.

Any import that is not part of core python should be guarded, but core libraries are not required to be so, in any case, that should result in a quick error, not hanging.

@BillKanawyer
Copy link

I thought it might be useful to some to leave a few concluding remarks now that I have reached a more complete understanding of my issue.

As mentioned before, it is possible to "explode" the AnsiballZ_lineinfile.py bundle, instrument the code, and then "execute" the result. To verify it was my regular expression that was causing the problem, I bracketed the re.compile & re.match calls with print statements.

Verifying that my real issue was with the underlying python3 re code and not Ansible lineinfile allowed me to redirect my efforts to understanding why the supplied regular expression match time was increasing exponentially with each character added to the string being matched.

Thank you @bcoca for providing useful feedback.

@Wehzie
Copy link

Wehzie commented Dec 23, 2022

I have a virt-manger guest running Ubuntu 22.04 LTS. Initially, the network configuration was set to NAT mode and my playbook worked fine. After changing the network configuration to Open mode, the playbook hangs on the apt module. When commenting out the apt module, while keeping the Open network mode, the playbook runs fine.

Changing back and forth between network modes yields the same outcome. Why would the apt module in particular be problematic?

Edit:
Host OS is Ubuntu 22.10, ansible installed via APT. Had the same hang in another context. Was able to fix the issue by removing the package with apt purge and reinstalling.

@mutech
Copy link

mutech commented Feb 28, 2023

Interesting to read this... I keep having variations of this issue with different target hosts, different ansible controllers, different playbooks and at different tasks, different targets in the same playbook when I interrupt the play and restart it. This happens whether I am working with remote targets or target on the local network. There is no pattern that I could discern, other than that more complex playbooks tend to hang more often and more simultaneous targets create more hangs, consistent. This happens just as likely with stock ansible from ubuntu packages, those installed from the ansible repo or ansible installed via pip. Same in the office (completely different network setup) or at home. On the other end of the connection I see no running processes, no pending connections. It looks very much like the last action terminated cleanly. And again, this happens with any kind of command. I have not the slightest idea what the cause of all this could be, because the only constant is the observable effect.

There are days when I have no such problems, but I just concluded that this is how ansible works. Windows 3.1 style problem solving, rince, reboot and repeat until it's working or until the day is over and it's somebody else's problem.

I wish this happened before I invested months learning ansible and writing scripts for everything. This is really frustrating.

@kharnden-amd
Copy link

I have also tried several iterations of -T and connect_timeout and I still cant force a timeout of a particular part of my playbook when a host has an issue with load or hanging. Its very frustrating.

@esmacgil
Copy link

The project that I work on seems to be having this problem, or a problem like it, and when I see it "live" I check the ps -ef to find the stalled command and then I try to execute a similar command. The command is usually in the forms:

ssh <ssh options> <target IP> /bin/sh -c 'cat <some file path> && sleep 0'
ssh <ssh options> <target IP> /bin/sh -c 'python3 <some AnsiballZ path> && sleep 0'

so I tend to try to execute the same (or similar) command. Every time I do these manual commands they stall, so this makes me wonder how it ever works for Ansible.

That said, the ssh part does connect and if I start typing and follow with a \n, then most of the time I get an "echo" of why I typed, at least for the cat command. This makes me think that for whatever reason, the "file path" that was meant to be passed to cat is not being passed and instead cat is in "read from stdin" mode.

When I do the python3 command, I actually get the interactive mode of Python3 and I can type python statements and they get executed. So again, the "command" appears to not be getting the argument and therefore going into "read from stdin" mode.

Further, when I modify the command to be like:

ssh <ssh options> <target IP> "/bin/sh -c 'cat <some file path> && sleep 0'"
ssh <ssh options> <target IP> "/bin/sh -c 'python3 <some AnsiballZ path> && sleep 0'"

Then I never get this "read from stdin" mode and the behavior that I expect is the behavior that I get.

I can also get the desired behavior if I do commands like:

ssh <ssh options> <target IP> /bin/sh -c 'echo && cat <some file path> && sleep 0'
ssh <ssh options> <target IP> /bin/sh -c 'echo && python3 <some AnsiballZ path> && sleep 0'

Though this produces a blank line in the output, so not the best solution.

Has anyone else noticed this behavior?
Am I observing some other problem?

@mutech
Copy link

mutech commented Apr 11, 2023

The command is usually in the forms:

ssh <ssh options> <target IP> /bin/sh -c 'cat <some file path> && sleep 0'
ssh <ssh options> <target IP> /bin/sh -c 'python3 <some AnsiballZ path> && sleep 0'

I see these processes for tasks that are still alive. After some time, they are gone and the controller keeps hanging.

If these processes keep running, I guess you have a different problem, at least from what I observe. When your scripts consistently hang for specific tasks (which is not what I see on my side), then it looks more like an issue with the task itself. Did you try to strace the python process to see what it's doing?

Has anyone else noticed this behavior? Am I observing some other problem?

I tend to believe that this is another problem. What happens if you interrupt the playbook and restart it? Will it hang at the same task consistently? For me, this is definitely not the case, I see randomish behavior. A slow network definitely increases the probability of such hangers, as well as more target hosts and more complex scripts. Then I have periods of days or even weeks without problems.

@priyankaprabhune26
Copy link

I am also facing the same issue on RHEL 8 when I am executing the command through jenkins but it is working from the vm.

It's getting hand on AnsiballZ_command.py
44699 1686567241.27096: _low_level_execute_command(): executing: /bin/sh -c '/usr/bin/python /home//.ansible/tmp/ansible-tmp-1686567240.16-44699-236344823441760/AnsiballZ_command.py && sleep 0'

@techie879
Copy link

techie879 commented Jun 12, 2023 via email

@GDMTT
Copy link

GDMTT commented Oct 20, 2023

I didn't find this until now, I'm seeing the same issues on Windows servers

ansible-collections/ansible.windows#564

I'm also finding servers that still have the Ansible process running even after the Job as completed without hangs.

I regret ever considering Ansible, currently looking at what Azure will offer for patching.

@GDMTT
Copy link

GDMTT commented Oct 20, 2023

Hs anyone here tried async and poll to mitigate hung tasks ? https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_async.html

@FriedCircuits
Copy link
Contributor

Hs anyone here tried async and poll to mitigate hung tasks ? https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_async.html

So far async has worked for us. We had a lot of hanging during AMI builds with packer and ansible. It worked fine for years and recently started hanging this summer. We had another task hang today so going to set it to async as well.

@GDMTT
Copy link

GDMTT commented Oct 20, 2023

o far async has worked for us. We had a lot of hanging during AMI builds with pack

Ok, just tried it and I got a hang on the one server, which is good because I could dig into the issue, bad because the async didnt help, or I configured it wrong.

@FriedCircuits
Copy link
Contributor

o far async has worked for us. We had a lot of hanging during AMI builds with pack

Ok, just tried it and I got a hang on the one server, which is good because I could dig into the issue, bad because the async didnt help, or I configured it wrong.

This is what I am using

   async: 1800 
   poll: 5 
   timeout: 1800 
   retries: 1 
   delay: 5 

@GDMTT
Copy link

GDMTT commented Oct 24, 2023

o far async has worked for us. We had a lot of hanging during AMI builds with pack

Ok, just tried it and I got a hang on the one server, which is good because I could dig into the issue, bad because the async didnt help, or I configured it wrong.

This is what I am using

   async: 1800 
   poll: 5 
   timeout: 1800 
   retries: 1 
   delay: 5 

thanks for sharing, will give it a try today :-)

@GDMTT
Copy link

GDMTT commented Oct 24, 2023

"msg": "No job id was returned by the async task",

  • name: Reboot - 3600s timeout
    ansible.windows.win_reboot:
    reboot_timeout: 3600
    async: 1800
    poll: 5
    timeout: 1800
    retries: 5
    delay: 120

is this a limitation of Async not working across reboots ?

@GDMTT
Copy link

GDMTT commented Oct 24, 2023

the windows update module doesnt support Async.

TASK [Apply Security, Critical updates, Update Rollups log to C:\ansible_wu.txt] ***
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: NoneType: None
fatal: [server1.example.LOCAL]: FAILED! => {"changed": false, "msg": "async is not supported for this task when reboot=yes"}

@wrouesnel
Copy link
Contributor

So I found one cause of this: if you use set_fact to configure ansible_become at any point in your playbook, roles or anything else.

Due to variable precedence orders, set_fact overrides, amongst other things, task variables.

This manifests as an infinite freeze when you wind up waiting on a sudo prompt you don't expect (I see this when switching to connection: local to do some controller specific thing and it tries to sudo).

Basically the following playbook won't do what you think:

- hosts: all
  tasks:
  - name: Set fact here
    set_fact:
      ansible_become: true
  - name: Now do a local thing
    copy:
      dest: somefile.txt
      content: |
        somedata
    vars:
      ansible_become: false
    connection: local
  - name: This also won't work
    copy:
      dest: somefile.txt
      content: |
        somedata
    become: false
    connection: local

This is documented, in the sense that it's out there, but it seems absolutely bonkers insane to me: an event which might happen hundreds of playbook tasks ago can apparently override explicit variable specifications on the task in front of you.

@C0rn3j
Copy link
Contributor

C0rn3j commented Feb 8, 2024

Looks like a possible dupe of #18305

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects_2.3 This issue/PR affects Ansible v2.3 bug This issue/PR relates to a bug. hang Ansible hangs or gets stuck has_pr This issue has an associated PR. support:core This issue/PR relates to code supported by the Ansible Engineering Team.
Projects
No open projects
CEM
TODO
Development

No branches or pull requests