Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reboot broken on Ubuntu 16.04 hosts #1488

Open
hyperknot opened this issue Jul 18, 2016 · 21 comments
Open

Reboot broken on Ubuntu 16.04 hosts #1488

hyperknot opened this issue Jul 18, 2016 · 21 comments

Comments

@hyperknot
Copy link

hyperknot commented Jul 18, 2016

The built in reboot() function, which has been working perfectly both on Ubuntu 14.04 and FreeBSD 10.x hosts, but is broken on Ubuntu 16.04 hosts.

What is happening on Ubuntu 14.04:
I receive an output like this and the system reboots, after the reboot Fabric reconnects.

[ubuntu] out:
[ubuntu] out:
[ubuntu] out: Broadcast message from root@ubuntu
[ubuntu] out:
[ubuntu] out:   (/dev/pts/0) at 15:02 ...
[ubuntu] out:
[ubuntu] out:
[ubuntu] out:
[ubuntu] out:
[ubuntu] out: The system is going down for reboot NOW!
[ubuntu] out:
[ubuntu] out:

What is happening on Ubuntu 16.04:

  1. There is no output at all from the command.
  2. The system actually starts rebooting (still no output in Fabric)
  3. The system finishes reboot, but Fabric doesn't realise it, it does not reconnect, still no output.
  4. Fabric just sits there waiting seemingly forever.

If I press the enter key in this state, Fabric actually continues, but shows this message before:

No handlers could be found for logger "paramiko.transport"
Warning: sudo() received nonzero return code -1 while executing 'reboot'!

I am using this code for reboot:

def reboot_():
    with settings(warn_only=True):
        print 'rebooting'
        start_time = time.time()
        reboot(wait=1200)
        print 'reboot took: {} seconds'.format(time.time() - start_time)
@hyperknot
Copy link
Author

It is exactly the same with run('reboot')

@bitprophet
Copy link
Member

bitprophet commented Jul 19, 2016

It being the same with a manual run is unsurprising - clearly something changed regarding Ubuntu's handling of reboot, SSH connections, etc.

Nothing obvious springs to mind, but reboot() (Fab's, not Linux's) is pretty basic - it simply calls sudo('reboot'), and temporarily tweaks Fabric's general reconnection settings so it can handle reconnecting after a nontrivial reboot sequence (versus the default, which would give up pretty quickly).

See

def reboot(wait=120, command='reboot', use_sudo=True):
- you might want to try tweaking that.

Also try enabling Paramiko's logging (see bottom of our troubleshooting page - http://www.fabfile.org/troubleshooting.html) as it might yield a clue.

@bitprophet
Copy link
Member

bitprophet commented Jul 19, 2016

Actually, on second thought, it sounds like Ubuntu's reboot is somehow never exiting or submitting an exit code to Fabric's execution handlers (run/sudo), since you note that sudo is what gets mad when you mash Enter after waiting.

If you look at the reboot() code, it expects the sudo('reboot') call to exit eventually, so that it can A) wait a bit and B) initiate reconnection.

The fact that, on Fabric's end, execution is just hanging out within the sudo means something remotely is violating that expectation. Kind of strange. Maybe a bug in Fabric itself, but feels more like bad behavior on the remote end. (P.S.: which fabric version(s) are you seeing this on?)

Offhand thought - we could perhaps set timeout= on the sudo, then except TimeoutException: pass around it. This would ensure that even in this (strange) situation, we default to trying a reconnect.

Only downside would be the case where reboot is actually hanging and the system is not truly rebooting, but it's not like we'd make things any worse for that case by the above change - the infinite hang would just happen on the connection loop instead of within the sudo.

@hyperknot
Copy link
Author

An other really strange, changed behaviour in Ubuntu 16.04 is the following. When I run poweroff in an ssh session, the machine does power off, but the SSH sessions hangs! There is no way to Ctrl + C, or Ctrl + D, or anything. All I can do is wait a lot then ssh aborts with:
packet_write_wait: Connection to 192.168.56.11: Broken pipe

I'm really not into the deep pockets of SSH connection handling, but this might be the exactly the same issue as with reboot.

@fillest
Copy link

fillest commented Sep 6, 2016

I've just run into broken reboot (fresh up-to-date Ubuntu 16.04 on AWS, Fabric==1.12.0) but in a different way. For me it just throws:

Fatal error: sudo() received nonzero return code -1 while executing!

Requested: reboot
Executed: sudo -S -p 'sudo password:'  /bin/bash -l -c "reboot"

Running sudo reboot in terminal by hand works (host reboots).

@fillest
Copy link

fillest commented Sep 6, 2016

May be worth noting:

$ readlink /sbin/reboot 
/bin/systemctl
$ readlink /sbin/shutdown
/bin/systemctl

@fillest
Copy link

fillest commented Sep 6, 2016

And another weird thing. I've changed the rebooting code to use aws-cli and after its call (which takes ~1sec, seems like it's asynchronous) I run sudo('add-apt-repository --yes ppa:nginx/stable'). It has always worked, but now after reboot it returned -1 too:

sudo: add-apt-repository --yes ppa:nginx/stable

Fatal error: sudo() received nonzero return code -1 while executing!

Requested: add-apt-repository --yes ppa:nginx/stable
Executed: sudo -S -p 'sudo password:'  /bin/bash -l -c "add-apt-repository --yes ppa:nginx/stable"

Then I tried to make fabric to reconnect by adding fabric.network.disconnect_all(). It resulted in requesting a password (why??):

[...] sudo: add-apt-repository --yes ppa:nginx/stable
[...] Login password for 'ubuntu': 

And it started to work only after I added e.g. time.sleep(60 * 3) after reboot. Which is obviously a poor band-aid, and now I'm puzzled how to properly handle the password problem. Looks like it's related to this issue.

@ploxiln
Copy link

ploxiln commented Oct 4, 2016

The problem seems to be that "reboot" is now sometimes "too fast", before the status of the command gets back over the ssh connection.

(Tip: If you're at a frozen ssh connection as a result: type \n~. aka enter, tilde, period. That's the default ssh escape character, then the disconnect command for ssh. If you just try ctrl-c or ctrl-d, ssh tries to pass that to the process running on the other side.)

One solution is to use shutdown -r +1, which will schedule the reboot for the next minute, and then wait a minute for it to start, and then start trying to re-connect. Admittedly, waiting a minute is not great.

A hacky thing to try: shutdown -r +0 should be equivalent to reboot, but in my limited tests of Ubuntu-16.04 running in VirtualBox, it tends to give a fraction of a second longer, showing the next shell prompt just before disconnecting a manual ssh session.

@ploxiln
Copy link

ploxiln commented Oct 4, 2016

this is probably a dup of #1444

@palbee
Copy link

palbee commented Nov 1, 2016

If the init daemon is switched to upstart reboot works as expected. It looks like systemd is killing sshd immediately.

@alexkiousis
Copy link

There was a bug on the Debian/Ubuntu's package of systemd that, on shutdown, killed the network service before the SSH one so everything hang.
It was fixed on the latest point release. Don't know about the Ubuntu package status.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=751636

@hyperknot
Copy link
Author

Reported the bug for Ubuntu:
https://bugs.launchpad.net/ubuntu/+source/openssh/+bug/1645002

@stefan-wegener
Copy link

I also had issues regarding the usage of reboot() in some of my scripts. I found out that when connecting with a password, the reboot was working correctly, but when using keyfile-authentication, the connection hung up (an the reboot was done).

@ploxiln
Copy link

ploxiln commented Feb 4, 2017

The ubuntu bug https://bugs.launchpad.net/ubuntu/+source/openssh/+bug/1645002 is marked as fixed in 16.10, but not yet in 16.04, and unclear when it will be.

The current behavior for me is that paramiko/fabric instantly detect that the ssh connection was closed, but it's before paramiko/fabric sees the reboot command to have completed. At least it doesn't hang indefinitely as in the original report.

Fatal error: sudo() received nonzero return code -1 while executing!
...
Aborting.

Plain reboot() did that consistently for me in a handful of tests against AWS EC2 and a local virtualbox VM. (I always used keyfile auth.)

I've found a short and elegant workaround, as I suggested without as much detail above:

reboot(command="shutdown -r +0")

That worked as expected for me (in my handful of tests against AWS EC2 and local virtualbox VM, all running up-to-date ubuntu 16.04). Note that "shutdown -r now" behaved like "reboot" and did not seem to work.

I took a quick look at the freebsd and openbsd man pages, and it looks they have a shutdown command that supports those parameters. I suspect that the command "shutdown -r +0" would work for pretty much any unix system which "reboot" worked on. So it could be considered for changing the default command, or updating the documentation. (But I'd be interested to see a report of a test on a BSD system first.)

@ambsw-technology
Copy link

ambsw-technology commented Jul 5, 2017

shutdown -r +0 isn't enough for us. Since reboot doesn't accept a manual timeout, I've even tried something like:

try:
    sudo("shutdown -r +0", timeout=300)
except NetworkError:
    pass
# in case the sudo times out during reboot
sleep(15)

Despite all of this hand waving, the next command hangs indefinitely. Is it possible that the connection pool is holding onto (and using) the dead connection? If so, is there a workaround? Can I temporarily reduce the connection-level timeout?

@ploxiln
Copy link

ploxiln commented Jul 5, 2017

Indeed, you need to replace the existing connection, the way reboot() does:

https://github.com/fabric/fabric/blob/1.13.2/fabric/operations.py#L1289-L1294

@ecnepsnai
Copy link

Apologies to revive an old issue, I can also confirm that this problem happens when attempting to reboot a LXC container. @ploxiln's suggestion of using command="shutdown -r +0" did work for us.

@tehfink
Copy link

tehfink commented Feb 7, 2018

Confirming this error on a fresh install of FreeBSD 11.1 with bash installed:

reboot(wait=1) results in:

Fatal error: sudo() received nonzero return code -1 while executing!

Requested: reboot
Executed: sudo -S -p 'sudo password:'  /usr/local/bin/bash -l -c "reboot"

Aborting.
Traceback (most recent call last):
…
    raise env.abort_exception(msg)
hosts.FabricException: sudo() received nonzero return code -1 while executing!

@aggieNick02
Copy link

aggieNick02 commented Feb 7, 2019

I ended up needing this to get things going after reeding @ambsw-technology and @ploxiln comments. I'm running against an ubuntu 16.04 LTS server (from a windows client).

sudo('shutdown -r +0')
time.sleep(30)
fabric.state.connections.connect(env.host_string)

@aggieNick02
Copy link

FYI, I still see this against 18.04.2 LTS servers.

@cgd1
Copy link

cgd1 commented Apr 23, 2020

Any fix for this? also getting issue with 16.04

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests