Tens of thousands of backend_cleanup tasks executed every night since 10:42 pm. All async functions stop working #4335

aoerliang · 2017-10-21T00:53:32Z

Checklist

I have included the output of celery -A proj report in the issue.
(if you are not able to do this, then at least specify the Celery
version affected).
I have verified that the issue exists against the master branch of Celery.

My app is developed by using django1.11 and celery4.1[sqs]. It is deployed on Amazon AWS through Elastic Beanstalk. AWS SQS is the broker.

Steps to reproduce

The issue shows up every night around 10:42 pm, lasting for about 1 hour. Tens of thousands of backend_cleanup tasks start to be executed. Per the list of the task results in the Admin, each task is executed successfully. The CPU usage sometimes stays at 100%. Every page of the app is hard to open. 1 hour after the issue, the CPU usage is back to normal. However, the queues are wedging. Any delay() tasks cannot be executed. I have to purge the queue in SQS, re-start the app, and re-upload the entire code to AWS Elastic Beanstalk. Then the delay() tasks will be executed normally. If I don't do the purging, re-starting or re-uploading, none of any async tasks will be executed. Even the built-in backendcleaup task cannot start. My app doesn't have any periodic or crontab tasks except the built-in celery.backend_cleanup task. Could anyone help me with this issue?

Expected behavior

I need the entire app to perform normally after the backend_cleanup tasks are finished.

Actual behavior

Described in the ## Steps to reproduce

celery-sqs-worker-report.txt

The text was updated successfully, but these errors were encountered:

aoerliang · 2017-10-24T03:05:50Z

I just found that the problem is probably about the built-in task, celery.backend_cleanup. I just ran the app in the local development environment. When it was 10:42 pm, tremendous backend_cleanup tasks started and executed successfully. However, after a couple of seconds, an error occurred. Here is the error.

[2017-10-23 22:51:27,677: CRITICAL/MainProcess] Unrecoverable error: Exception('Request Empty body  HTTP 599  Server aborted the SSL handshake (None)',)
Traceback (most recent call last):
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/celery/worker/worker.py", line 203, in start
    self.blueprint.start(self)
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/celery/bootsteps.py", line 370, in start
    return self.obj.start()
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/celery/worker/consumer/consumer.py", line 320, in start
    blueprint.start(self)
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/celery/worker/consumer/consumer.py", line 596, in start
    c.loop(*c.loop_args())
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/celery/worker/loops.py", line 88, in asynloop
    next(loop)
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/kombu/async/hub.py", line 354, in create_loop
    cb(*cbargs)
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/kombu/async/http/curl.py", line 111, in on_readable
    return self._on_event(fd, _pycurl.CSELECT_IN)
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/kombu/async/http/curl.py", line 124, in _on_event
    self._process_pending_requests()
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/kombu/async/http/curl.py", line 132, in _process_pending_requests
    self._process(curl, errno, reason)
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/kombu/async/http/curl.py", line 178, in _process
    buffer=buffer, effective_url=effective_url, error=error,
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/vine/promises.py", line 150, in __call__
    svpending(*ca, **ck)
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/vine/promises.py", line 143, in __call__
    return self.throw()
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/vine/promises.py", line 140, in __call__
    retval = fun(*final_args, **final_kwargs)
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/vine/funtools.py", line 100, in _transback
    return callback(ret)
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/vine/promises.py", line 143, in __call__
    return self.throw()
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/vine/promises.py", line 140, in __call__
    retval = fun(*final_args, **final_kwargs)
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/vine/funtools.py", line 98, in _transback
    callback.throw()
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/vine/funtools.py", line 96, in _transback
    ret = filter_(*args + (ret,), **kwargs)
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/kombu/async/aws/connection.py", line 253, in _on_status_ready
    raise self._for_status(response, response.read())
Exception: Request Empty body  HTTP 599  Server aborted the SSL handshake (None)

Meanwhile, the queue disappeared from AWS SQS. Then a few of the backend_cleanup tasks continued to execute. And finally, a warning showed up.

[2017-10-23 22:51:28,967: WARNING/MainProcess] Restoring 10 unacknowledged message(s)

And finally, the HTTPS connection to the SQS queue stopped. The app could not re-connect to SQS unless I re-deployed the app to AWS Elastic Beanstalk.

Could anyone help me with this issue? Any response will be significantly appreciated.

aoerliang · 2017-10-25T13:03:51Z

I think I solved the issue. The problem never happens after I changed CELERY_RESULT_BACKEND to "redis" from "django-db", and change CELERY_TIMEZONE to "UTC". Although the Task Results in Django Admin stopped adding any task results, my problem had been majorly solved.

barbarakr · 2018-05-18T08:37:11Z

You can simply overwrite the schedule and avoid using crontab:
'celery.backend_cleanup': { 'task': 'celery.backend_cleanup', 'schedule': 86400, # every 24 hours instead of crontab('0', '4', '*') which is used in celery beat 'options': {'expires': 12 * 3600}} ,

aoerliang changed the title ~~Tens of thousands of unknown tasks executed every night, making CPU usage 100%~~ Tens of thousands of backend_cleanup tasks executed every night since 10:42 pm. All async functions stop working Oct 24, 2017

aoerliang closed this as completed Oct 25, 2017

auvipy added the Issue Type: Question label May 21, 2018

AvnerCohen mentioned this issue Mar 20, 2019

Backend_Cleanup Delete fest #5401

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tens of thousands of backend_cleanup tasks executed every night since 10:42 pm. All async functions stop working #4335

Tens of thousands of backend_cleanup tasks executed every night since 10:42 pm. All async functions stop working #4335

aoerliang commented Oct 21, 2017 •

edited

aoerliang commented Oct 24, 2017

aoerliang commented Oct 25, 2017

barbarakr commented May 18, 2018

Tens of thousands of backend_cleanup tasks executed every night since 10:42 pm. All async functions stop working #4335

Tens of thousands of backend_cleanup tasks executed every night since 10:42 pm. All async functions stop working #4335

Comments

aoerliang commented Oct 21, 2017 • edited

Checklist

Steps to reproduce

Expected behavior

Actual behavior

aoerliang commented Oct 24, 2017

aoerliang commented Oct 25, 2017

barbarakr commented May 18, 2018

aoerliang commented Oct 21, 2017 •

edited