Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some of the in progress jobs cannot be restored after restarting #146

Open
steven-zou opened this issue Dec 28, 2019 · 1 comment
Open

Comments

@steven-zou
Copy link

A restart happened when large scale jobs are running. After that, some of the jobs queued in the in-progress queue (which depends on the worker ID => return fmt.Sprintf("%s:%s:inprogress", redisKeyJobs(namespace, jobName), poolID) will not be restored.

A restart will recreate the worker pools and generate new workers with new UUIDs. And it seems that the dead pool reaper thread only check the workers of current worker pool and the previous one will be discarded. However, some of the in-progress queues are relying on those workers. It results in that some of the in-progress jobs cannot be requeued.

The Reap flow seems like the following one:

Find the dead pools first,

deadPoolIDs, err := r.findDeadPools()

In the dead pool finding process,

workerPoolsKey := redisKeyWorkerPools(r.namespace)

	workerPoolIDs, err := redis.Strings(conn.Do("SMEMBERS", workerPoolsKey))
	if err != nil {
		return nil, err
	}

as most of the time, after the restart, the previous pool is gone, but there are still some in-progress queues with in-progress jobs, those jobs are becoming unavailable anymore.

Screen Shot 2019-12-28 at 11 52 26

@steven-zou
Copy link
Author

@gocraft

@hoffoo @mitchrodrigues

Will anyone here take a look at this issue to confirm it is really a bug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant