Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Delayed jobs don't move to waiting state after some days #2534

Open
1 task done
oanguenot opened this issue Apr 21, 2024 · 14 comments
Open
1 task done

[Bug]: Delayed jobs don't move to waiting state after some days #2534

oanguenot opened this issue Apr 21, 2024 · 14 comments
Labels
bug Something isn't working

Comments

@oanguenot
Copy link

Version

v5.7.3

Platform

NodeJS

What happened?

Hello,
I have a node application that schedules delayed jobs with a delay from 2s to 1 hour.
When the job is finished, I remove it from the queue and add a new one (with the same id/name) and with a new delay (depending on the result).

Everything works fine during some days (1 to 3) and then without any reasons, the worker stops to run jobs: no more jobs are processed. But my nodeJs application still answers to Web requests so is still alive.

I added logs to all event handlers. I didn't notice any errors.

But, the event "waiting" from the queueEvents is not fired at the time a job need to be launched.

What is strange is that if some hours after (or any time), I add manually a new job to the queue, the worker "wakes-up" and runs all these old delayed jobs.

  1. How to debug this case ?
    --> As said, I put an event listener to all Queue, Worker and QueueEvents events, but I didn't see something different.

  2. What could be the reason to not move a job to the 'waiting' state when it is the time to handle it ?

Thanks for your help

How to reproduce.

No response

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@oanguenot oanguenot added the bug Something isn't working label Apr 21, 2024
@roggervalf
Copy link
Collaborator

hi @oanguenot we are tracking that in this issue #2466

@roggervalf
Copy link
Collaborator

btw what it would help us, is to see which values is passed to bzpopmin command

@oanguenot
Copy link
Author

Thanks @roggervalf for your quick answer!
In one side, I'm happy to see that problem seems not in my code because I spent days to track it without success but in other side, this is still a problem in front of us :-)
I added a comment to the #2466 and will be happy to help one way or another.

I don't know what is bzpopmin. How or where can I find the values ?
Thanks

@roggervalf
Copy link
Collaborator

hi @oanguenot in order to see your commands You may need to get into your redis instances with redis-cli and then use monitor command

@oanguenot
Copy link
Author

oanguenot commented Apr 21, 2024

Is it what you need ?

image

Should I let the monitor opens until it blocks and should I see if I got a timeout of zero ?

@roggervalf
Copy link
Collaborator

yeah we would like to know which value is blocking that command as we we're doing some fixes to prevent passing 0

@roggervalf
Copy link
Collaborator

also the value that is blocking that command could be a different value than 0, that's what we want to know

@roggervalf
Copy link
Collaborator

hey @oanguenot, btw which are your queue settings or which values are you using for adding delayed jobs?

@oanguenot
Copy link
Author

Hi @roggervalf,

Here are my settings:

queue = new Queue("services", {
    connection: {
      host: CONFIG().redisDbUrl,
      port: CONFIG().redisDbPort,
    },
  });

I use the following when adding new jobs:

 const job = await queue.add(
        `${service}-${instance.id}`,
        {
          userId: instance.userId,
          instanceId: instance.id,
          serviceId: service,
          immediate: false,
          retriedCounter,
        },
        {
          jobId: `${service}-${instance.id}`,
          removeOnComplete: true,
          removeOnFail: true,
          delay: delay + randomDelay,
        }
      );

I think, nothing really special.

On my own and after around 60 hours, all jobs have been proceeded on time (redis monitoring active).

@roggervalf
Copy link
Collaborator

thank you @oanguenot, pls let us know if it happens again. One last questions, before how frequent it happened?

@oanguenot
Copy link
Author

oanguenot commented Apr 24, 2024

It happened every 2 or 3 days, but I can't remember when it started. It seems to have worked very well a few versions ago or I didn't notice due to other manual restarts done on my own

@oanguenot
Copy link
Author

Everything has been running smoothly for the past 6 days. No problem so far.

@roggervalf
Copy link
Collaborator

roggervalf commented Apr 27, 2024

thank you @oanguenot, also we release a new performance change regarding this topic. You can try version 5.7.6. Pls let us know how it goes

@manast
Copy link
Contributor

manast commented Apr 30, 2024

I would recommend upgrading to 5.7.7 even, as it will mitigate a potential issue we have discovered with IORedis in the case of network partitions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants