Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task randomly get stuck (between parent and child) #389

Open
meensu23 opened this issue Jul 10, 2023 · 0 comments
Open

Task randomly get stuck (between parent and child) #389

meensu23 opened this issue Jul 10, 2023 · 0 comments

Comments

@meensu23
Copy link

meensu23 commented Jul 10, 2023

We use celery 5.2.3 and billiard 3.6.1.0. What we observe is the main worker receives a task (we can see task_received signal being fired) and then there is no trace of the task in the logs after that (not always and very rarely).
The main worker thinks it has sent the task on the pipe whereas the child is still stuck on read. As ack is not received by the main worker the timers (soft and hard) also don't fire. This task is now stuck forever. When we kill the child process the task is sent to another worker. The main worker meanwhile was directing new tasks to other children successfully.

This happens very randomly (after weeks of running normally) in production.

Stacktrace of the child when it was sent SIGUSR1:

Pool process <celery.concurrency.asynpool.Worker object at 0x7f1d1c1cf850> error: SoftTimeLimitExceeded()
Traceback (most recent call last):
  File ".../python/lib/python3.9/site-packages/billiard/pool.py", line 292, in __call__
    sys.exit(self.workloop(pid=pid))
  File ".../python/lib/python3.9/site-packages/billiard/pool.py", line 351, in workloop
    req = wait_for_job()
  File ".../python/lib/python3.9/site-packages/billiard/pool.py", line 473, in receive
    ready, req = _receive(1.0)
  File ".../python/lib/python3.9/site-packages/billiard/pool.py", line 445, in _recv
    return True, loads(get_payload())
  File ".../python/lib/python3.9/site-packages/billiard/queues.py", line 355, in get_payload
    return self._reader.recv_bytes()
  File ".../python/lib/python3.9/site-packages/billiard/connection.py", line 243, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File ".../python/lib/python3.9/site-packages/billiard/connection.py", line 460, in _recv_bytes
    return self._recv(size)
  File ".../python/lib/python3.9/site-packages/billiard/connection.py", line 422, in _recv
    chunk = read(handle, remaining)
  File ".../python/lib/python3.9/site-packages/billiard/pool.py", line 229, in soft_timeout_sighandler
    raise SoftTimeLimitExceeded()
billiard.exceptions.SoftTimeLimitExceeded: SoftTimeLimitExceeded()

The fact that when we killed this child, the task gets resubmitted to another child makes me think that the parent thought it sent the task on pipe and was waiting for ack whereas the receiver was stuck on read (forever). Could it be because of some race in the receiver or sender?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant