Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarting celery issues and better supervisord config file #102

Closed
kmike opened this issue May 9, 2010 · 16 comments · Fixed by #6942
Closed

Restarting celery issues and better supervisord config file #102

kmike opened this issue May 9, 2010 · 16 comments · Fixed by #6942

Comments

@kmike
Copy link
Contributor

kmike commented May 9, 2010

I use supervisord config file based on the example in celery repository and have some issues when celeryd restart is occured: sometimes task processing silently stops after restarting celery without any error messages in logs. Processes remains visible in process list.

Finally I figured out that sometimes when processes are restarted celery spawns additional process that is not managed by supervisord and this leads to these bugs. So I started to watch ps output carefully after each restart and kill extra processes manually via kill . After killing these processes tasks begin to be executed properly. This is sort of hack that solve a problem for a week or so.

And today I think the real reason is found. The default supervisord value for 'stopwaitsecs' option is 10s. This means that after 10s celery process will be killed with KILL signal instead of TERM. It seems that celery don't like being killed and try to spawn additional process in that case.

So I think it'll be good to add something like 'stopwaitsecs=600' to all supervisord example config files (from faq: "You should never stop celeryd with the KILL signal (-9), unless you’ve tried TERM a few times and waited a few minutes to let it get a chance to shut down.") and investigate celeryd behaviour on KILL signal: it is mentioned in docs that tasks will be lost (and it is tolerably in many cases) but the issue with spawned process is a bit weird.

@ask
Copy link
Contributor

ask commented May 9, 2010

Processes spawned when receiving the KILL signal is weird indeed. I don't see that behavior when used outside of supervisord, so maybe this is something caused by it?

If you install the setproctitle module, celery should report the kind of process in ps listings, could you do that to investigate what kind of process is created?

(easy_install setproctitle)

Setting the timeout to 600 is probably good. Is there any setting for infinity (maybe with a warning if it takes too long)? When celeryd is killed via TERM (which is the preferred shutdown signal) it stops receiving messages and waits for the currently executing tasks to finish. And I guess for most applications, termination mid-execution is not acceptable.

@kmike
Copy link
Contributor Author

kmike commented May 9, 2010

As for process spawning: setproctitle and watching for process ids was helpful. It is not process spawning. Worker processes remain alive when parent process is killed.
This is a simulation of supervisord restart with manual killing and zero timeout:

 4976 ?        Ss     0:00 /usr/bin/python /usr/bin/supervisord --pidfile /var/run/supervisord.pid
 5422 ?        S      0:01  \_ [celerybeat] --schedule=/var/lib/celery/celerybeat-schedule-nadovmeste --loglevel=INFO                                                             
 6101 ?        Sl     0:00  \_ [celeryd.MainProcess] Running... (--loglevel=INFO)                                                           
 6108 ?        S      0:00      \_ [celeryd.PoolWorker-1]                                                                                       
 nadovmeste:~# kill 6101 & kill -9 6101 &

ps -afx:

 4976 ?        Ss     0:00 /usr/bin/python /usr/bin/supervisord --pidfile /var/run/supervisord.pid
 5422 ?        S      0:01  \_ [celerybeat] --schedule=/var/lib/celery/celerybeat-schedule-nadovmeste --loglevel=INFO                                                             
 6867 ?        Sl     0:00  \_ [celeryd.MainProcess] Running... (--loglevel=INFO)                                                           
 6875 ?        S      0:00      \_ [celeryd.PoolWorker-1]                                                                                       
 6108 ?        S      0:00 [celeryd.PoolWorker-1]       

I was able reproduce this only with such artifical race between kill and kill -9. Sometimes worker gets killed properly. The issue seems to be supervisord-specific because when I start celeryd from console I have no luck reproducing it.

@kmike
Copy link
Contributor Author

kmike commented May 9, 2010

I was able to reproduce this with console-started scripts after several attempts:

/home/nadovmeste/envs/nadovmeste/bin/python /home/nadovmeste/src/nadovmeste/manage.py celeryd -B --loglevel=INFO&

and then in another terminal session:

nadovmeste:~# ps -afx

 6450 ?        Ss     0:00  \_ sshd: root@pts/2 
 6452 pts/2    Ss+    0:00      \_ -bash
 9343 pts/2    Sl     0:00          \_ [celeryd.MainProcess] Running... (-B --loglevel=INFO)                                                           
 9350 pts/2    S      0:00              \_ [celeryd.PoolWorker-2]                                                                                          
 9355 pts/2    S      0:00              \_ [celerybeat]     

nadovmeste:~# kill 9343 & kill -9 9343

nadovmeste:~# ps -afx

 4526 ?        Ss     0:00  \_ sshd: root@pts/1 
 4529 pts/1    Ss     0:00  |   \_ -bash
 9366 pts/1    R+     0:00  |       \_ ps -afx
 6450 ?        Ss     0:00  \_ sshd: root@pts/2 
 6452 pts/2    Ss+    0:00      \_ -bash    
 ...
 9350 pts/2    S      0:00 [celeryd.PoolWorker-2]                                                                                          
 9355 pts/2    S      0:00 [celerybeat]

@kmike
Copy link
Contributor Author

kmike commented May 9, 2010

Haven't found any special option for infinite timeout with warning in supervisord docs. Probably very large number will suffice if it is what we want.

@kmike
Copy link
Contributor Author

kmike commented May 9, 2010

Maybe it's something related to celerybeat because the I was able to reproduce the issue for console-started celeryd only after using the -B option.

@digi604
Copy link

digi604 commented May 13, 2010

If i am testing some celery tasks locally and i use the -B option sometimes the process is not killed when i used ctrl-c.

@ask
Copy link
Contributor

ask commented May 14, 2010

I can't reproduce this locally. Btw, are you running the master branch? I just fixed a bug that could hang shutdown. If you could test with this it would be nice.

@kmike
Copy link
Contributor Author

kmike commented May 14, 2010

Yes, I'm running the latest master branch. I saw your bug-fixing commit and hoped that it will help but it seems that it doesn't help in my case: the latest celery seems to behave the same. But it is possible that the initial problem is solved - I check this only with an immediate kill. Can't wrap my hand around it now :) The ctrl-c issue is not reproducible with my setup.

So the bug report, simplified: http://gist.github.com/401028 . The results are always the same (not sometimes). I have some periodic tasks and some non-periodic. Tasks are simple and don't take much time to finish. Is it a bug that children processes stay alive after killing the main process? If so and you can't reproduce it then I'll try to provide the minimal project.

The celerybeat killing behaviour is interesting: when I kill hanging(?) celerybeat process the hanging(?) worker process also shutdowns.

@ask
Copy link
Contributor

ask commented May 15, 2010

@kmike I still can't reproduce with the commands above. Maybe because I'm on OS X, or maybe you're running Python 2.5? (I'm running 2.6.1)

Could run it with --loglevel=DEBUG? It could provide some info about where it stops.

The celerybeat process is started by the main process, so I'm assuming the main process is waiting
for the celerybeat to exit before it kills the remaining pool processes.

@kmike
Copy link
Contributor Author

kmike commented May 15, 2010

I thought that the main process was killed: it is not visible in process list. Don't have much experience with process management though.

My setup was Debian Lenny + python 2.5.

I'll try to run celeryd with --loglevel=DEBUG and to reproduce it on my macbook.

@ask
Copy link
Contributor

ask commented May 15, 2010

hmm, you're right of course. It's almost like the beat process takes ownership of the pool processes.

I just tried to reproduce on Debian Lenny with python 2.5, and it works just right there.
Tried killing with both TERM and INT.

@kmike
Copy link
Contributor Author

kmike commented May 15, 2010

Ask, thank you for help.

I think that initial problem was solved with increased supervisord timeout and your bug-fixing commit. The simulation was incorrect because I use kill -9 commands and they send KILL signal instead of TERM. With TERM signal processes are getting killed properly.

@kmike
Copy link
Contributor Author

kmike commented May 15, 2010

Supervisord use TERM signal so all should be fine.

@kmike
Copy link
Contributor Author

kmike commented May 15, 2010

But the thing that scares me a bit is that the initial bug wasn't investigated. I'll try to reproduce it and let you know.

@ask
Copy link
Contributor

ask commented May 16, 2010

Ah! I'm so sorry. I didn't read the issue carefully enough. Yes! That's exactly what happens when you kill it with SIGKILL. The 9 signal can't be catched, so there's nothing we can do about this AFAIK.

@rafaelpivato
Copy link

In case you are still having issues terminating your Celery workers, you might try setting stopasgroup=true before increasing your stopwaitsecs.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants