New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Init scripts frequently fail to start their daemons #395
Comments
Jeff Forcier (bitprophet) posted: Instrumented the init script I'm testing and everything seems to run the same either way (i.e. real success or fake success scenarios), implying the problem is within the Starting to think about what the cause could be on our end:
on 2011-07-23 at 07:45pm EDT |
Jeff Forcier (bitprophet) posted:
Examining the output of So far this isn't going anywhere useful. Time to test the above ideas (pty, ssh) to see what changes there. on 2011-07-23 at 08:46pm EDT |
Jeff Forcier (bitprophet) posted: With Running So this isn't Fabric's fault; it's something deeper where these init scripts misbehave when an SSH style pseudo-tty is in play. Going to dig a bit deeper for curiosity's sake, but it looks like the "solution" here is a new FAQ stating to use on 2011-07-23 at 08:59pm EDT |
Jeff Forcier (bitprophet) posted: Yea, not finding anything that explains this behavior, unfortunately. Given the findings above I think an FAQ is definitely the way to go. on 2011-07-23 at 10:35pm EDT |
Hugo Garza (hiro2k) posted: Ughh I just ran into this yesterday, I wish I would have seen this bug, luckily I tried setting pty=False and it worked as well. Thanks for the explanation, at least it's not fabrics fault. Now you really have me wondering why this fails. on 2011-08-02 at 01:27pm EDT |
Are you sure this isn't just a bash script issue too? I mean with my mailing list thread. They were just bash scripts that started java and weblogic. |
FWIW, I'm getting this horrible behavior on pretty much every Ubuntu machine I spin up on EC2. It's also reproducible with tasks launched via a detached screen I should mention that usually |
@yuvadm -- in those cases where pty=False does not solve the problem, can the problem still be recreated by using a regular ssh command (as mentioned above)? As far as I've seen it's an SSH problem and not a Fabric one, but it would be good to know if there are any situations where it does not match up. |
That's an interesting angle to check, I'll get back to you on that one... |
I have reproduced this problem. Client is Ubuntu 10.04.3 LTS, server is "Ubuntu 8.04.4 LTS (server)". The issue is there 100% with pty = True, and it disappears with pty = False. Connecting to other servers, the issue is not always there when pty = True. In my case, for testing, I am running a very simple command: "nohup sleep 100 > /tmp/xxx 2>&1 </dev/null &" |
I've been bitten by this, only on EC2 as it seems (I haven't seen it on my Linode, but I'm not 100% sure). Setting pty=False seems to fix it. |
Just faced with this problem. I've solved the problem with adding a sleep after the command execution line: |
Thanks spodgruskiy, Your tips works for me.
But none of them works, nimbus didn't start at all. I don't understand what happened. |
+1 for the sleep trick needed to work on systems with requiretty sudo('start service; sleep .5') and all is well! |
Where you are using 'sudo()' and the remote system has RequireTty enabled for sudo access, you can use 'set -m; service start' to prevent the SIGHUP from being sent to the process started by the init script. See http://stackoverflow.com/a/14866774 for a more detailed explanation on bash interactive versus non-interactive and how that effects job control. |
I'm curious, what's the ssh issue here? pty=false works for me |
It's not really a SSH problem, it's more the subtle behaviour around BASH non-interactive/interactive modes and signal propagation to process groups. Following is based on http://stackoverflow.com/questions/14679178/why-does-ssh-wait-for-my-subshells-without-t-and-kill-them-with-t/14866774#14866774 and http://www.itp.uzh.ch/~dpotter/howto/daemonize, with some assumptions not fully validated, but tests about how this works seem to confirm. pty/tty = falseThe bash shell launched connects to the stdout/stderr/stdin of the started process and is kept running until there is nothing attached to the sockets and it's children have exited. A good deamon process will ensure it doesn't wait for it's children to exit, fork a child process and then exit. When in this mode no SIGHUP will be sent to the child process by SSH. I believe this will work correctly for most scripts executing a process that handles deamonizing itself and doesn't need to be backgrounded. Where init scripts use '&' to background a process then it's likely that the main problem will be whether the backgrounded process ever attempts to read from stdin since that will trigger a SIGHUP if the session has been terminated. pty/tty = true*If the init script backgrounds the process started, the parent BASH shell will return an exit code to the SSH connection, which will in turn look to exit immediately since it isn't waiting on a child process to terminate and isn't blocked on stdout/stderr/stdin. This will cause a SIGHUP to be sent to the parent bash shell process group, which since job control is disabled in non-interactive mode in bash, will include the child processes just launched. Where a daemon process explicitly starts a new process session when forking or in the forked process then it or it's children won't receive the SIGHUP from the BASH parent process exiting. Note this is different from suspended jobs which will see a SIGTERM. I suspect the problems around this only working sometimes has to do with a slight race condition. If you look at the standard approach to deamonizing - http://www.itp.uzh.ch/~dpotter/howto/daemonize, you'll see that in the code the new session is created by the forked process which may not be run before the parent exits, thus resulting the random sucess/failure behaviour mentioned above. A sleep statement will allow enough time for the forked process to have created a new session, which is why it works for some cases. pty/tty = true and job control is explicitly enabled in bashSSH won't connect to the stdout/stderr/stdin of the bash shell or any launched child processes, which will mean it will exit as soon as the parent bash shell started finished executing the requested commands. In this case, with job control explicitly enabled, any processes launched by the bash shell with '&' to background them will be placed into a separate session immediately and will not receive the SIGHUP signal when the the parent process to the BASH session exits (SSH connection in this case). What's needed to fixI think the solutions just need to be explicitly mentioned in the run/sudo operations documentation as a special case when working with background processes/services. Basically either use 'pty=false', or where that is not possible, explicitly enable job control as the first command, and the behaviour will be correct. |
link update: http://www.ics.uzh.ch/~dpotter/howto/daemonize |
As I mentioned here fabrickit ( a wrapper of fabric libs ) https://github.com/HyukjinKwon/fabrickit/commit/cceb8bfb8f960a3ac41b24c64b8358bd6e7a0366 You can absolutely easily start a program as a daemon without specific configurations or settings. Try this:
I tried this and it works perfectly fine even it does not implement additional programming to run as a daemon (even just a program writing 'Hello' within a while loop works fine). |
Summary: It doesn't work to run init.d scripts with pty, both via Fabric and via native ssh. However, CentOS and some other OSes have requiretty in their /etc/sudoers file, meaning that you get the error "sudo: sorry, you must have a tty to run sudo". The only way to fix this is to remove the requiretty default in a user's /etc/sudoers file, but we don't want to force them to do that. The work-around is to run the init script in job control mode (e.g. set -m), because it avoids the race condition that is the cause of the daemons not starting when executing commands with TTY. See fabric/fabric#395 (comment) for an in-depth treatment of the issue. We also add || true when running tar, because tar can sometimes have a non-zero exit code even when the files correctly un-tarred. Task: SWARM-363 Review Url: @@review_url@@ Test Plan: make clean lint test-all; manual Reviewers: anu, rschlussel Reviewed By: rschlussel Subscribers: an186016, mf186042 Differential Revision: https://phabricator.td.teradata.com/D395
Summary: It doesn't work to run init.d scripts with pty, both via Fabric and via native ssh. However, CentOS and some other OSes have requiretty in their /etc/sudoers file, meaning that you get the error "sudo: sorry, you must have a tty to run sudo". The only way to fix this is to remove the requiretty default in a user's /etc/sudoers file, but we don't want to force them to do that. The work-around is to run the init script in job control mode (e.g. set -m), because it avoids the race condition that is the cause of the daemons not starting when executing commands with TTY. See fabric/fabric#395 (comment) for an in-depth treatment of the issue. We also add || true when running tar, because tar can sometimes have a non-zero exit code even when the files correctly un-tarred. Task: SWARM-363 Review Url: @@review_url@@ Test Plan: make clean lint test-all; manual Reviewers: anu, rschlussel Reviewed By: rschlussel Subscribers: an186016, mf186042 Differential Revision: https://phabricator.td.teradata.com/D395
Summary: It doesn't work to run init.d scripts with pty, both via Fabric and via native ssh. However, CentOS and some other OSes have requiretty in their /etc/sudoers file, meaning that you get the error "sudo: sorry, you must have a tty to run sudo". The only way to fix this is to remove the requiretty default in a user's /etc/sudoers file, but we don't want to force them to do that. The work-around is to run the init script in job control mode (e.g. set -m), because it avoids the race condition that is the cause of the daemons not starting when executing commands with TTY. See fabric/fabric#395 (comment) for an in-depth treatment of the issue. We also add || true when running tar, because tar can sometimes have a non-zero exit code even when the files correctly un-tarred. Task: SWARM-363 Review Url: @@review_url@@ Test Plan: make clean lint test-all; manual Reviewers: anu, rschlussel Reviewed By: rschlussel Subscribers: an186016, mf186042 Differential Revision: https://phabricator.td.teradata.com/D395
There were cases where Telegraf failed to start when deployed over SSH even though the installation script completed with exit code 0. Running the script by hand in a terminal, after having SSH'ed to a random VM, returns no errors and Telegraf starts successfully. It turns out this weird behavior has to do with jobs and sessions when daemonizing processes. When we run a command over SSH with pty=True and the session leader (shell) exits, a SIGHUP will be sent to the parent process' process group, which includes the daemonized Telegraf bin. On the other hand, when pty=False, the new shell connects to stdin/stdout/stderr, which are then inherited by its child processes, thus allowing them to run in the background after the parent process has exited. For more details, see: fabric/fabric#395
Description
I've gotten multiple reports of this on IRC, as well as a comment on #350, and now a mailing list thread.
No clear cause yet, and while it's been reported multiple times I don't expect that it is a constant problem or we'd be hearing much more about it. In some very limited testing on my end so far, I can recreate the problem maybe 30-50% of the time -- but it is reproducible.
Symptom is simply that init-style scripts responsible for starting daemons and then returning immediately, will return OK, return code of 0 and "success" status message printed to stdout -- but will not actually spin up the daemon in question.
My personal test was done via latest master targeting an Ubuntu 10.04 (Lucid) VM and the stock Apache2 package's init script.
Originally submitted by Jeff Forcier (bitprophet) on 2011-07-23 at 07:25pm EDT
Relations
The text was updated successfully, but these errors were encountered: