-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove_after
no longer respects task dependencies
#129
Comments
The old behaviour already did not work when there were too many jobs in the queue unfortunately. But i don't quite understand your use case here. We only delete |
I use Minion for flows, and use the task tree to render an overview of how far the process is, like this: flowchart LR
task_a["Task A"] --> task_b["Task B"]
task_b --> task_c["Task C"]
task_c --> task_d["Task D"]
If any task in the middle finishes before In essens I only want |
Job |
Not that it matters much, the old query falls apart with a few million jobs in the queue. We couldn't bring it back even if we wanted. |
at my company we are running Minion, with a throughput of roughly ~500k tasks a day. And ~1M tasks at most times in table. for some reason we never did a PR on our fixed back with 10.22. we also noticed problems with large amount of jobs in the queue. Our solution still works. WITH interesting_jobs AS (
SELECT
parents,
state
FROM
minion_jobs
WHERE
cardinality(parents) > 0
AND state != 'finished' )
DELETE FROM minion_jobs AS j
WHERE (finished <= NOW() - INTERVAL '1 second' * 172800
AND state = 'finished'
AND NOT EXISTS ( SELECT 1 FROM interesting_jobs
WHERE parents @> ARRAY[j.id] AND state != 'finished'))
OR (expires <= NOW() AND state = 'inactive'); its fairly fast as well explain analyze WITH interesting_jobs AS (
SELECT
parents,
state
FROM
minion_jobs
WHERE
cardinality(parents) > 0
AND state != 'finished' )
SELECT FROM minion_jobs AS j
WHERE (finished <= NOW() - INTERVAL '1 second' * 172800
AND state = 'finished'
AND NOT EXISTS ( SELECT 1 FROM interesting_jobs
WHERE parents @> ARRAY[j.id] AND state != 'finished'))
OR (expires <= NOW() AND state = 'inactive');
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on minion_jobs j (cost=23066.17..346802608.51 rows=67548 width=0) (actual time=463.859..28426.737 rows=11952 loops=1)
Recheck Cond: (((finished <= (now() - '48:00:00'::interval)) AND (state = 'finished'::minion_state)) OR (state = 'inactive'::minion_state))
Rows Removed by Index Recheck: 46175
Filter: (((finished <= (now() - '48:00:00'::interval)) AND (state = 'finished'::minion_state) AND (NOT (SubPlan 1))) OR ((expires <= now()) AND (state = 'inactive'::minion_state)))
Rows Removed by Filter: 26650
Heap Blocks: exact=33423
-> BitmapOr (cost=23066.17..23066.17 rows=135096 width=0) (actual time=90.466..90.467 rows=0 loops=1)
-> BitmapAnd (cost=23044.85..23044.85 rows=135096 width=0) (actual time=89.573..89.573 rows=0 loops=1)
-> Bitmap Index Scan on minion_jobs_finished (cost=0.00..2833.77 rows=153245 width=0) (actual time=30.673..30.673 rows=152942 loops=1)
Index Cond: (finished <= (now() - '48:00:00'::interval))
-> Bitmap Index Scan on minion_jobs_state_priority_id_idx (cost=0.00..20177.06 rows=879551 width=0) (actual time=56.709..56.709 rows=883147 loops=1)
Index Cond: (state = 'finished'::minion_state)
-> Bitmap Index Scan on minion_jobs_state_priority_id_idx (cost=0.00..4.43 rows=1 width=0) (actual time=0.893..0.893 rows=3 loops=1)
Index Cond: (state = 'inactive'::minion_state)
SubPlan 1
-> Bitmap Heap Scan on minion_jobs (cost=289.45..6250.75 rows=23 width=0) (actual time=0.731..0.731 rows=1 loops=38599)
Recheck Cond: ((parents @> ARRAY[j.id]) AND (cardinality(parents) > 0))
Filter: ((state <> 'finished'::minion_state) AND (state <> 'finished'::minion_state))
Rows Removed by Filter: 0
Heap Blocks: exact=26678
-> BitmapAnd (cost=289.45..289.45 rows=1663 width=0) (actual time=0.730..0.730 rows=0 loops=38599)
-> Bitmap Index Scan on minion_jobs_parents_idx (cost=0.00..41.90 rows=4989 width=0) (actual time=0.001..0.001 rows=1 loops=38599)
Index Cond: (parents @> ARRAY[j.id])
-> Bitmap Index Scan on minion_jobs_cardinality_parents (cost=0.00..247.30 rows=332571 width=0) (actual time=1.046..1.046 rows=28013 loops=26647)
Planning Time: 0.401 ms
Execution Time: 28427.121 ms
(26 rows) we are currently moving to pg16, and found something interesting. The above query, finishes just a tiny bit faster. BUT the most surprising is that the original code in Minion 10.22, actually runs fast. running the query on the same data as above: explain analyze SELECT FROM minion_jobs AS j
WHERE (finished <= NOW() - INTERVAL '1 second' * 172800 AND NOT EXISTS (
SELECT 1 FROM minion_jobs WHERE parents @> ARRAY[j.id] AND state != 'finished'
) AND state = 'finished') OR (expires <= NOW() AND state = 'inactive');
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on minion_jobs j (cost=23079.63..44015254.24 rows=67855 width=0) (actual time=106.526..240.670 rows=12766 loops=1)
Recheck Cond: (((finished <= (now() - '48:00:00'::interval)) AND (state = 'finished'::minion_state)) OR (state = 'inactive'::minion_state))
Rows Removed by Index Recheck: 46175
Filter: (((finished <= (now() - '48:00:00'::interval)) AND (NOT (SubPlan 1)) AND (state = 'finished'::minion_state)) OR ((expires <= now()) AND (state = 'inactive'::minion_state)))
Rows Removed by Filter: 26650
Heap Blocks: exact=33512
-> BitmapOr (cost=23079.63..23079.63 rows=135711 width=0) (actual time=100.451..100.452 rows=0 loops=1)
-> BitmapAnd (cost=23058.24..23058.24 rows=135711 width=0) (actual time=99.631..99.632 rows=0 loops=1)
-> Bitmap Index Scan on minion_jobs_finished (cost=0.00..2847.00 rows=153943 width=0) (actual time=32.276..32.276 rows=153756 loops=1)
Index Cond: (finished <= (now() - '48:00:00'::interval))
-> Bitmap Index Scan on minion_jobs_state_priority_id_idx (cost=0.00..20177.06 rows=879551 width=0) (actual time=64.983..64.983 rows=883147 loops=1)
Index Cond: (state = 'finished'::minion_state)
-> Bitmap Index Scan on minion_jobs_state_priority_id_idx (cost=0.00..4.43 rows=1 width=0) (actual time=0.819..0.819 rows=3 loops=1)
Index Cond: (state = 'inactive'::minion_state)
SubPlan 1
-> Bitmap Heap Scan on minion_jobs (cost=42.04..16158.52 rows=591 width=0) (actual time=0.001..0.001 rows=1 loops=39416)
Recheck Cond: (parents @> ARRAY[j.id])
Filter: (state <> 'finished'::minion_state)
Rows Removed by Filter: 0
Heap Blocks: exact=26678
-> Bitmap Index Scan on minion_jobs_parents_idx (cost=0.00..41.90 rows=4989 width=0) (actual time=0.001..0.001 rows=1 loops=39416)
Index Cond: (parents @> ARRAY[j.id])
Planning Time: 0.322 ms
Execution Time: 241.096 ms
(24 rows) Ok, while collecting this information, and writting. another solution presented itself explain analyze SELECT FROM minion_jobs
WHERE id IN (
(
SELECT id FROM minion_jobs WHERE state = 'finished' AND finished <= NOW() - INTERVAL '1 second' * 172800
EXCEPT
SELECT unnest(parents) AS id FROM minion_jobs WHERE state != 'finished'
)
UNION
SELECT id FROM minion_jobs WHERE state = 'inactive' AND expires <= NOW());
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=256494.17..285054.46 rows=138181 width=0) (actual time=427.742..482.616 rows=15610 loops=1)
Hash Cond: (minion_jobs.id = "*SELECT* 1".id)
-> Index Only Scan using minion_jobs_pkey on minion_jobs (cost=0.42..25941.71 rows=997714 width=8) (actual time=0.043..35.708 rows=999174 loops=1)
Heap Fetches: 11
-> Hash (cost=254766.48..254766.48 rows=138181 width=8) (actual time=417.175..417.176 rows=15610 loops=1)
Buckets: 262144 Batches: 1 Memory Usage: 2658kB
-> HashAggregate (cost=252305.13..254766.48 rows=138181 width=8) (actual time=415.115..415.903 rows=15610 loops=1)
Group Key: "*SELECT* 1".id
Planned Partitions: 4 Batches: 1 Memory Usage: 3857kB
-> Append (cost=0.00..244877.90 rows=138181 width=8) (actual time=411.209..413.750 rows=15610 loops=1)
-> Result (cost=0.00..244178.55 rows=138180 width=8) (actual time=411.208..413.247 rows=15610 loops=1)
-> HashSetOp Except (cost=0.00..242796.75 rows=138180 width=12) (actual time=411.207..412.678 rows=15610 loops=1)
-> Append (cost=0.00..239497.23 rows=1319810 width=12) (actual time=0.210..398.082 rows=69663 loops=1)
-> Subquery Scan on "*SELECT* 1" (cost=0.00..111576.08 rows=138180 width=12) (actual time=0.210..224.810 rows=42257 loops=1)
-> Seq Scan on minion_jobs minion_jobs_1 (cost=0.00..110194.28 rows=138180 width=8) (actual time=0.209..222.820 rows=42257 loops=1)
Filter: ((state = 'finished'::minion_state) AND (finished <= (now() - '48:00:00'::interval)))
Rows Removed by Filter: 956917
-> Subquery Scan on "*SELECT* 2" (cost=0.00..121322.10 rows=1181630 width=12) (actual time=0.046..170.768 rows=27406 loops=1)
-> ProjectSet (cost=0.00..109505.80 rows=1181630 width=8) (actual time=0.046..169.658 rows=27406 loops=1)
-> Seq Scan on minion_jobs minion_jobs_2 (cost=0.00..102711.43 rows=118163 width=13) (actual time=0.040..157.409 rows=116027 loops=1)
Filter: (state <> 'finished'::minion_state)
Rows Removed by Filter: 883147
-> Index Scan using minion_jobs_state_priority_id_idx on minion_jobs minion_jobs_3 (cost=0.42..8.45 rows=1 width=8) (actual time=0.012..0.012 rows=0 loops=1)
Index Cond: (state = 'inactive'::minion_state)
Filter: (expires <= now())
Rows Removed by Filter: 3
Planning Time: 0.520 ms
Execution Time: 484.704 ms
(28 rows) this runs fast on both pg13 and pg16 (above is from pg16) if interested, I will make a proper PR for the latest solution we have |
I'm not opposed to a PR if there really are no performance issues with the solution. But i'm still quite sceptical about the use cases. Do we really need this? If we treat this as a new feature, can we find two unique use cases for it? |
For the first use case i will count debugging the whole chain of jobs that led to a failure somewhere in the middle. Do we have a second? |
I once build a provisioning flow for a telco where we had automated the whole process of moving a cellphone number between providers and becoming a new customer utomated using minion. The whole process could take anywhere from 1 day to several months to complete depending on notice periods. Having the whole flow stored allowed us to render it in the "service agent ui", making it quite easy for everyone to see when the different task were performed and how many tasks there are left before the flow is done. So second use-case would be possibility of "flow overview" (for non-technical people). This together with the debugging case are my two favourite things about Minion. |
At At one client of ours there could also be pretty catastrophic if parts of a flow for long-running flows were deleted, as the later parts of a flow would compile an end-dataset based on data stored on the individual jobs in a flow. So missing parts of a chain would represent incomplete data. Naturally these datasets could be stored elsewhere, but that has not been required since minion has made this storage of job results so easy and convenient. |
One thing to consider: the query in 10.22, the different query in 10.23, and both versions of the query that @HEM42 posted are equivalent to each other if we ignore the performance issues. The code introduced in 10.27 thus breaks POLA, hard. |
I suppose that's good enough for two use cases. Now we just need a solution that 100% won't cause performance issues. |
I was just coming to submit a bug report on this. The "old" routine with the union all causes our entire queue to get in a broken state and locks the entire queue. |
Just as an aside to the OP's request, I had temporarily fixed this by simply creating a loop to delete with a limit. This allowed full deletion in "chunks". The following was a quick hack to get my queue back in working order, but could probably be refined some:
|
After the performance update in v10.27, Minion no longer respects its dependencies as stated in the documentation.
This means that a tree of dependant tasks might be severed if tasks in the "middle" finishes before the top and bottom of the tree.
I guess this is on purpose, but it totally breaks the excellent "flow" use case that Minion used to support.
Would it be possible to add a flag that toggles between the new and the old behaviour?
Expected behavior
This is what was done previously in
Pg.pm
:Actual behavior
This is the current code in
Pg.pm
, that deletes all finished jobsThe text was updated successfully, but these errors were encountered: