Add new `QueuedJobRegistry` to catch jobs dropped by workers before m… #1568

joshcoden · 2021-09-24T18:21:36Z

…arked as started

codecov · 2021-09-24T18:27:20Z

Codecov Report

Merging #1568 (cb9fc61) into master (e71fcb9) will increase coverage by 0.08%.
The diff coverage is 99.28%.

@@            Coverage Diff             @@
##           master    #1568      +/-   ##
==========================================
+ Coverage   95.59%   95.67%   +0.08%     
==========================================
  Files          46       46              
  Lines        7061     7195     +134     
==========================================
+ Hits         6750     6884     +134     
  Misses        311      311

Impacted Files	Coverage Δ
rq/registry.py	`97.15% <96.96%> (-0.07%)`	⬇️
rq/job.py	`98.20% <100.00%> (+0.01%)`	⬆️
rq/queue.py	`94.26% <100.00%> (+0.35%)`	⬆️
rq/worker.py	`88.66% <100.00%> (+0.01%)`	⬆️
tests/test_job.py	`100.00% <100.00%> (ø)`
tests/test_queue.py	`100.00% <100.00%> (ø)`
tests/test_registry.py	`100.00% <100.00%> (ø)`
tests/test_worker.py	`97.52% <100.00%> (+<0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e71fcb9...cb9fc61. Read the comment docs.

selwin · 2021-09-28T00:39:37Z

Thanks for the PR.

I think at any one time, a job can only exist in one registry. When a job is moved to StartedJobRegistry, it should also be removed from QueuedJobRegistry in a single pipeline call. Do you mind making this change?

When a job is canceled, it should also be moved from QueuedJobRegistry to CanceledJobRegistry.

docs/docs/job_registries.md

joshcoden · 2021-09-28T01:15:42Z

I think at any one time, a job can only exist in one registry. When a job is moved to StartedJobRegistry, it should also be removed from QueuedJobRegistry in a single pipeline call. Do you mind making this change?

@selwin It is when you enqueue, it's just up a level in the call stack:

rq/rq/worker.py

Line 913 in 8fd8de0

job.queued_job_registry.remove(job, pipeline=pipeline)

I didn't include it in the same level of the call stack since the job.heartbeat method is also called in other places, like on job callbacks, where we don't want to remove it from a registry it shouldn't be in.

When a job is canceled, it should also be moved from QueuedJobRegistry to CanceledJobRegistry.

@selwin I didn't include this here since there is a bug fix PR I also have up, where canceling assumes it is in failed, so if you had a job elsewhere in a different registry, it would also break. I think it makes sense to make this change in this PR after we merge #1564

selwin · 2021-09-28T01:36:05Z

It is when you enqueue, it's just up a level in the call stack:

I meant when a job is popped off the queue and moved to StartedJobRegistry (when it's being worked on), it should also be removed from QueuedJobRegistry. As it is now, when a job is being worked on it exists in both QueuedJobRegistry and StartedJobRegistry.

joshcoden · 2021-09-28T01:46:42Z

It is when you enqueue, it's just up a level in the call stack:

I meant when a job is popped off the queue and moved to StartedJobRegistry (when it's being worked on), it should also be removed from QueuedJobRegistry. As it is now, when a job is being worked on it exists in both QueuedJobRegistry and StartedJobRegistry

@selwin as mentioned above, I am removing it from QueuedJobRegistry before adding to StartedJobRegistry within the same pipeline, In prepare_for_execution I remove it from the QueuedJobRegistry before calling heartbeat which is where the job gets added to StartedJobRegistry?, please see the code link in the comment above: #1568 (comment):

rq/rq/worker.py

Line 913 in 8fd8de0

job.queued_job_registry.remove(job, pipeline=pipeline)

alella · 2021-09-28T13:46:08Z

fn_call	queue	QueuedJobRegistry	StartedJobRegistry	parent worker	forked worker
enqueue	job	job		-	-
dequeue_job_and_maintain_ttl	-	job		active	doesnt exist
perform_job(job, queue)	-	job		inactive	active
prepare_job_execution(job)	-	-	job	inactive	active

I think what @joshcoden is trying to say is prepare_job_execution calls job.queued_job_registry.remove and job.heartbeat in the same pipeline. job.heartbeat in turn calls started_job_registry.add using the same pipeline. So the job never exists in both QueuedJobRegistry and StartedJobRegistry at the same time

docs/docs/job_registries.md

rq/queue.py

rq/registry.py

alella · 2021-09-28T14:53:35Z

rq/registry.py

+                        if job.enqueued_at < front_timestamp:
+                            self.requeue(job)


if job.enqueued_at >= front_timestamp, the condition would reman True for rest of the jobs in the QueuedJobRegistry, you no longer need to iterate through the jobs (assuming get_job_ids returns a list in ascending order).

@alella The user can manually specify the score when calling add on the registry. While no such calls will exist in code, a user could specify a ttl that does not align with the enqueued_at time:

rq/rq/registry.py

Lines 64 to 72 in 4711080

def add(self, job, ttl=0, pipeline=None, xx=False):

"""Adds a job to a registry with expiry time of now + ttl, unless it's -1 which is set to +inf"""

score = ttl if ttl < 0 else current_timestamp() + ttl

if score == -1:

score = '+inf'

if pipeline is not None:

return pipeline.zadd(self.key, {job.id: score}, xx=xx)

return self.connection.zadd(self.key, {job.id: score}, xx=xx)

rq/registry.py

Co-authored-by: Ashoka Lella <alella@users.noreply.github.com>

selwin · 2021-09-29T00:17:36Z

@alella @joshcoden thanks for explaination.

It is indeed working as expected. I misread this because the the code around this part is not symmetric. Can you create a job.move_to_started_job_registry() method that moves the job from QueuedJobRegistry into StartedJobRegistry? It will be cleaner this way.

With this, we can potentially also cleanup job.heartbeat(xx) because right now it does two jobs:

Adding job to started job registry
Updating the timestamp in started job registry

This already led to race condition bugs fixed in this PR #1550

selwin · 2021-09-29T00:31:01Z

rq/registry.py

+            else:
+                raise InvalidJobOperationError(
+                    "Queued job {} has no enqueue_at value!".format(front_job.id)
+                )


This could not happen, right? So I think we can skip this check.

selwin · 2021-09-29T00:33:57Z

rq/registry.py

+        for job_id in job_ids:
+            # If job was enqueued AFTER the front of the queue it must have already been dequeued
+            # This is faster than seeing if the job is in the queue directly.
+            front_timestamp = self._get_front_queue_timestamp() 


We can move this out of the loop so we don't keep getting new timestamp for every single job processed.

selwin · 2021-09-29T00:35:35Z

rq/registry.py

+        It is not defined in cleanup since we don't want this being called everytime count or get_job_ids is called
+        """
+        job_ids = self.get_job_ids()
+        for job_id in job_ids:


This could be expensive if we have lots of jobs in QueuedJobRegistry and sometimes people have millions of jobs enqueued. I wonder if we have a way to only check a subset of the jobs.

selwin · 2021-09-29T00:38:23Z

rq/registry.py

+                                 connection=queue.connection,
+                                 job_class=queue.job_class,
+                                 serializer=queue.serializer)
+    registry.requeue_stuck_jobs()


I'm not sure if we should automatically requeue stuck jobs because:

This operation could be heavy depending on the number of jobs you have in the registry

If requeued, jobs that are time sensitive could lead to unwanted outcome

ccrvlh · 2023-01-17T03:21:42Z

@joshcoden any plans to continue working on this?

Add new QueuedJobRegistry to catch jobs dropped by workers before m…

b42ebe8

…arked as started

joshcoden marked this pull request as ready for review September 24, 2021 18:37

Undo bad docstring change

8fd8de0

selwin reviewed Sep 28, 2021

View reviewed changes

docs/docs/job_registries.md Outdated Show resolved Hide resolved

joshcoden requested a review from selwin September 28, 2021 01:16

Fix docs

8b1053e

alella reviewed Sep 28, 2021

View reviewed changes

joshcoden and others added 2 commits September 28, 2021 11:21

Apply suggestions from code review

d48c359

Co-authored-by: Ashoka Lella <alella@users.noreply.github.com>

Address @alella's review

edc9798

selwin reviewed Sep 29, 2021

View reviewed changes

Merge branch 'master' into queued-job-registry

cb9fc61

joshcoden marked this pull request as draft September 29, 2021 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new `QueuedJobRegistry` to catch jobs dropped by workers before m… #1568

Add new `QueuedJobRegistry` to catch jobs dropped by workers before m… #1568

joshcoden commented Sep 24, 2021

codecov bot commented Sep 24, 2021 •

edited

selwin commented Sep 28, 2021 •

edited

joshcoden commented Sep 28, 2021

selwin commented Sep 28, 2021

joshcoden commented Sep 28, 2021 •

edited

alella commented Sep 28, 2021

alella Sep 28, 2021

joshcoden Sep 28, 2021 •

edited

selwin commented Sep 29, 2021

selwin Sep 29, 2021

selwin Sep 29, 2021

selwin Sep 29, 2021

selwin Sep 29, 2021

ccrvlh commented Jan 17, 2023

	def add(self, job, ttl=0, pipeline=None, xx=False):
	"""Adds a job to a registry with expiry time of now + ttl, unless it's -1 which is set to +inf"""
	score = ttl if ttl < 0 else current_timestamp() + ttl
	if score == -1:
	score = '+inf'
	if pipeline is not None:
	return pipeline.zadd(self.key, {job.id: score}, xx=xx)

	return self.connection.zadd(self.key, {job.id: score}, xx=xx)

Add new QueuedJobRegistry to catch jobs dropped by workers before m… #1568

Are you sure you want to change the base?

Add new QueuedJobRegistry to catch jobs dropped by workers before m… #1568

Conversation

joshcoden commented Sep 24, 2021

codecov bot commented Sep 24, 2021 • edited

Codecov Report

selwin commented Sep 28, 2021 • edited

joshcoden commented Sep 28, 2021

selwin commented Sep 28, 2021

joshcoden commented Sep 28, 2021 • edited

alella commented Sep 28, 2021

alella Sep 28, 2021

Choose a reason for hiding this comment

joshcoden Sep 28, 2021 • edited

Choose a reason for hiding this comment

selwin commented Sep 29, 2021

selwin Sep 29, 2021

Choose a reason for hiding this comment

selwin Sep 29, 2021

Choose a reason for hiding this comment

selwin Sep 29, 2021

Choose a reason for hiding this comment

selwin Sep 29, 2021

Choose a reason for hiding this comment

ccrvlh commented Jan 17, 2023

Add new `QueuedJobRegistry` to catch jobs dropped by workers before m… #1568

Add new `QueuedJobRegistry` to catch jobs dropped by workers before m… #1568

codecov bot commented Sep 24, 2021 •

edited

selwin commented Sep 28, 2021 •

edited

joshcoden commented Sep 28, 2021 •

edited

joshcoden Sep 28, 2021 •

edited