Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regression since Jan 2021 breaks ansible #346

Open
ajf8 opened this issue Feb 14, 2022 · 8 comments
Open

regression since Jan 2021 breaks ansible #346

ajf8 opened this issue Feb 14, 2022 · 8 comments

Comments

@ajf8
Copy link
Contributor

ajf8 commented Feb 14, 2022

Hi!

I've been running the 12 release for a while on many hosts without issue. Recently upgraded to master and running ansible on a host with tlog enabled was quite unstable with tasks hanging. There's defunct processes like sleep under ansible. Downgrading to 12 release fixes the issue.

OS is CentOS 8.4. Pipelining is enabled on the SSH connection in ansible. It doesn't seem to happen on all ansible tasks, but soon after the upgrade it was pretty frequent.

Looking at the changes since the 12, I believe this to be the most likely cause of the regression. Could the void packet be missing and therefore not breaking out of the recording loop?

f03ff12

I will try to get a reproducer together.

@justin-stephenson
Copy link
Collaborator

Hi, thank you for reporting this issue. The https://github.com/Scribery/tlog/releases/tag/v12 release should already include that commit f03ff12 so I'm not exactly sure what is happening. The only difference between v12 and master are CI related things (nothing that would affect tlog functionality)

Reproducer steps with exact package versions would be great.

@ajf8
Copy link
Contributor Author

ajf8 commented Feb 14, 2022

Sorry I guess I have the wrong idea of what the 12 release is :)

I mean this commit from 20 Jan 2021 31f198c

@ajf8 ajf8 changed the title regression since 12 release breaks ansible regression since Jan 2021 breaks ansible Feb 14, 2022
@justin-stephenson
Copy link
Collaborator

Okay then yes f03ff12 is most likely the culprit.

If you can provide a minimal ansible reproducer that would be great.

@ajf8
Copy link
Contributor Author

ajf8 commented Feb 14, 2022

Reproduced at home on Fedora 34. Compiled and installed master (c23a145), set my shell to /usr/bin/tlog-rec-session. Ansible version below from Fedora 34 RPM but also earlier 2.8.

[alan@fedora tlog]$ ansible --version
ansible 2.9.27
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/home/alan/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python3.9/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 3.9.4 (default, Apr  6 2021, 00:00:00) [GCC 11.0.1 20210324 (Red Hat 11.0.1-0)]
[alan@fedora tlog]$ 

Pipelining enabled (ansible.cfg in directory running playbook from)

[alan@fedora ~]$ cat ansible.cfg 
[ssh_connection]
pipelining=True

Simple playbook that runs a command and copies a file

[alan@fedora ~]$ cat test.yaml 
---
  - hosts: all
    gather_facts: false
    tasks:
      - name: run a command
        command: dmesg
      - name: copy a file
        copy:
          src: test.yaml
          dest: /tmp/test_file.yaml

Run the playbook.

[alan@fedora ~]$ ansible-playbook -l 127.0.0.1, -i 127.0.0.1, test.yaml 

PLAY [all] ***********************************************************************************************************************************************************************************************

TASK [run a command] *************************************************************************************************************************************************************************************
[DEPRECATION WARNING]: Distribution fedora 34 on host 127.0.0.1 should use /usr/bin/python3, but is using /usr/bin/python for backward compatibility with prior Ansible releases. A future Ansible 
release will default to using the discovered platform python for this host. See https://docs.ansible.com/ansible/2.9/reference_appendices/interpreter_discovery.html for more information. This feature 
will be removed in version 2.12. Deprecation warnings can be disabled by setting deprecation_warnings=False in ansible.cfg.
changed: [127.0.0.1]

TASK [copy a file] ***************************************************************************************************************************************************************************************

And it's stuck. Here we can the ansible controller process running, and the connection over local SSH.

root       81444  0.0  0.2  45148 10972 ?        Ss   21:11   0:00  \_ sshd: alan [priv]
alan       81447  0.0  0.1  45148  5920 ?        S    21:11   0:00  |   \_ sshd: alan@pts/1
alan       81448  0.0  0.2  30104  8204 pts/1    Ss+  21:11   0:00  |       \_ -tlog-rec-session
alan       81449  0.0  0.1  17412  5548 pts/2    Ss   21:11   0:00  |           \_ -bash
alan       84225 14.8  1.2 132548 51224 pts/2    Sl+  21:39   0:11  |               \_ /usr/bin/python3 /usr/bin/ansible-playbook -l 127.0.0.1, -i 127.0.0.1, test.yaml
alan       84287  0.0  1.0 131508 41668 pts/2    S+   21:39   0:00  |                   \_ /usr/bin/python3 /usr/bin/ansible-playbook -l 127.0.0.1, -i 127.0.0.1, test.yaml
alan       84288  0.0  0.1  18740  6392 pts/2    S+   21:39   0:00  |                       \_ ssh -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentic
root       84230  0.0  0.2  45148 10856 ?        Ss   21:39   0:00  \_ sshd: alan [priv]
alan       84236  0.0  0.1  45416  6488 ?        S    21:39   0:00      \_ sshd: alan@notty
alan       84289  0.0  0.1  29976  7844 ?        Ss   21:39   0:00          \_ tlog-rec-session -c /bin/sh -c 'echo ~ && sleep 0'
alan       84290  0.0  0.0      0     0 ?        Z    21:39   0:00              \_ [sleep] <defunct>

This reproduces what I saw earlier when it was getting stuck specifically on a copy task but some other tasks were running OK. SSH pipelining was enabled. Interestingly, when I remove SSH pipelining, the first command doesn't work either.

[alan@fedora ~]$ rm ansible.cfg 
[alan@fedora ~]$ ansible-playbook -l 127.0.0.1, -i 127.0.0.1, test.yaml 


PLAY [all] ***********************************************************************************************************************************************************************************************

TASK [run a command] *************************************************************************************************************************************************************************************

@justin-stephenson
Copy link
Collaborator

This should be fixed in the latest release v12.1, can you please check it @ajf8 ?

@ajf8
Copy link
Contributor Author

ajf8 commented Jul 7, 2022

Hi @justin-stephenson I've just encountered the same or similar issue on a Ubuntu 20.04 host, using a package built from master a few weeks ago. I'll try to get another reproducer together. To test I was running "dmesg" using ansible in a loop, it got stuck on the 9th iteration. Pipelining enabled.

@justin-stephenson
Copy link
Collaborator

@ajf8 Okay thanks for letting me know, I was hoping this was taken care of once and for all. Do you see it fail frequently? I'll try to reproduce on my end with Fedora/RHEL

@Manawyrm
Copy link

I also ran into this issue on a number of machines. After taking a closer look, this seems to be some kind of race condition, as it seems to be somewhat time-dependant.

Running:

[ansible]$ ssh root@host  'echo hello world'
hello world

ATTENTION! Your session is being recorded!

(hangs infinitely or until Return is pushed)

By adding a sleep, it starts to work:

[ansible]$ ssh root@host  'sleep 1 && echo hello world'

ATTENTION! Your session is being recorded!

hello world
[ansible]$

Exits after a second, as expected.

This only seems to happen on very fast (64+ modern physical cores) machines, which might be related to why I couldn't reproduce this in a VM.

I have attached GDB to the process and taken a backtrace in the hung state, which looks as follows:

(gdb) bt full
#0  0x0000ffff90c5bc98 in __GI___poll (fds=0xaaaac9e5d768, nfds=nfds@entry=2, 
    timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:41
        _x3tmp = 0
        _x0tmp = 187650508445544
        _x0 = 187650508445544
        _x3 = 0
        _x4tmp = 0
        _x1tmp = 2
        _x1 = 2
        _x4 = 0
        _x2tmp = <optimized out>
        _x2 = 0
        _x8 = 73
        _sys_result = <optimized out>
        sc_ret = <optimized out>
        sc_ret = <optimized out>
        timeout_ts = {tv_sec = 187650508474048, tv_nsec = 281473929059056}
        timeout_ts_p = <optimized out>
#1  0x0000ffff90d7c718 in poll (__timeout=-1, __nfds=2, __fds=<optimized out>)
    at /build/tlog-VBYllq/tlog-12/lib/tlog/tty_source.c:212
No locals.
#2  tlog_tty_source_poll_if_active (nfds=2, timeout=-1, fds=<optimized out>)
    at /build/tlog-VBYllq/tlog-12/lib/tlog/tty_source.c:217
        i = <optimized out>
        i = <optimized out>
#3  tlog_tty_source_read (source=0xaaaac9e5d6b0, pkt=0xffffc18e1c18)
    at /build/tlog-VBYllq/tlog-12/lib/tlog/tty_source.c:281
        tty_source = <optimized out>
        ts = {tv_sec = 281473111763260, tv_nsec = 187650508445360}
        real_ts = {tv_sec = 48, tv_nsec = 281473929059056}
#4  0x0000ffff90d72958 in tlog_rec_transfer (psignal=<synthetic pointer>, in_fd=0, 
    item_mask=<optimized out>, latency=<optimized out>, tty_sink=<optimized out>, 
    log_sink=0xaaaac9e61c30, tty_source=<optimized out>, perrs=0xffffc18e1e90)
    at /build/tlog-VBYllq/tlog-12/lib/tlog/rec.c:961
        j = <optimized out>
        last_alarm_caught = 1
        pkt = {timestamp = {tv_sec = 0, tv_nsec = 0}, real_ts = {tv_sec = 0, tv_nsec = 0}, type = TLOG_PKT_TYPE_VOID, data = {window = {width = 0, height = 0}, io = {output = false, buf = 0x0, buf_owned = false, len = 0}}}
        tty_pos = {type = TLOG_PKT_TYPE_VOID, val = 0}
        return_grc = 0
        sa = {__sigaction_handler = {sa_handler = 0xffff90d679b0 <tlog_rec_sigchld_handler>, sa_sigaction = 0xffff90d679b0 <tlog_rec_sigchld_handler>}, sa_mask = {__val = {0, 281473929058960, 281473110153140, 0, 0, 281473929058992, 281473111517172, 281473929059352, 0, 281473929059984, 281473929059948, 187650508456944, 0, 281473929059056, 281473111762404, 281473929059440}}, sa_flags = 0, sa_restorer = 0x0}
        log_pending = false
        log_pos = {type = TLOG_PKT_TYPE_VOID, val = 0}
        exit_sig = {32, 15, 1}
        grc = 0
        i = <optimized out>
        new_alarm_caught = 1
        exit_sig = <optimized out>
        return_grc = <optimized out>
        grc = <optimized out>
        i = <optimized out>
        j = <optimized out>
        sa = <optimized out>
        log_pending = <optimized out>
        last_alarm_caught = <optimized out>
        new_alarm_caught = <optimized out>
        pkt = <optimized out>
        tty_pos = <optimized out>
        log_pos = <optimized out>
        cleanup = <optimized out>
#5  tlog_rec (perrs=perrs@entry=0xffffc18e1e90, euid=euid@entry=0, egid=egid@entry=0, cmd_help=<optimized out>, conf=<optimized out>, opts=opts@entry=6, path=<optimized out>, argv=<optimized out>, in_fd=in_fd@entry=0, out_fd=out_fd@entry=1, err_fd=err_fd@entry=2, pstatus=pstatus@entry=0xffffc18e1e6c, psignal=psignal@entry=0xffffc18e1e68) at /build/tlog-VBYllq/tlog-12/lib/tlog/rec.c:1273
        grc = <optimized out>
        clock_id = <optimized out>
        session_id = 167
        lock_acquired = true
        obj = 0xaaaac9e5eac0
        num = <optimized out>
        latency = <optimized out>
        item_mask = <optimized out>
        signal = 0
        log_sink = 0xaaaac9e61c30
        tap = {pid = 561992, source = 0xaaaac9e5d6b0, sink = 0xaaaac9e5ef40, in_fd = 5, out_fd = 6, tty_fd = -1, termios_orig = {c_iflag = 0, c_oflag = 0, c_cflag = 0, c_lflag = 0, c_line = 0 '\000', c_cc = '\000' <repeats 31 times>, c_ispeed = 0, c_ospeed = 0}, termios_set = false}
#6  0x0000aaaab8db11f8 in main (argc=3, argv=0xffffc18e2028) at ./src/tlog/tlog-rec-session.c:138
        grc = 0
        euid = 0
        egid = 0
        errs = 0x0
        perrs = <optimized out>
        conf = 0xaaaac9e603f0
        cmd_help = 0xaaaac9e60430 "Usage: tlog-rec-session [OPTION...] [CMD_FILE [CMD_ARG...]]\n   or: tlog-rec-session -c [OPTION...] CMD_STRING [CMD_NAME [CMD_ARG...]]\nStart a shell and log terminal I/O.\n\nGeneral options:\n    -h, --he"...
        std_fds = {0, 1, 2}
        charset = 0xffff8f48bd8c "UTF-8"
        shell_path = 0xaaaac9e5eaa0 "/bin/bash"
        shell_argv = 0xaaaac9e5bc00
        status = 1
        signal = 0
(gdb) 

Running strace on the hung process looks like this:

# strace -p `pidof tlog-rec-session`
strace: Process 565674 attached
ppoll([{fd=0, events=POLLIN}, {fd=-1}], 2, NULL, NULL, 0    

until after a while this happens:

# strace -p `pidof tlog-rec-session`
strace: Process 565674 attached
ppoll([{fd=0, events=POLLIN}, {fd=-1}], 2, NULL, NULL, 0) = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
write(3, "{\"ver\":\"2.3\",\"host\":\""..., 261) = 261
ppoll([{fd=0, events=POLLIN}, {fd=-1}], 2, NULL, NULL, 0

Pressing Return in the hung ssh connection leads to this trace:

# strace -p `pidof tlog-rec-session`
strace: Process 565782 attached
ppoll([{fd=0, events=POLLIN}, {fd=-1}], 2, NULL, NULL, 0^[[A
) = 1 ([{fd=0, revents=POLLIN}])
read(0, "\n", 4096)                     = 1
write(5, "\n", 1)                       = -1 EPIPE (Broken pipe)
--- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=565782, si_uid=0} ---
+++ killed by SIGPIPE +++

Any ideas on what could cause this? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants