`remove_after` no longer respects task dependencies #129

davidolrik · 2024-05-06T09:27:14Z

Minion version: v10.29
Perl version: v5.38.2
Operating system: macOS/Linux

After the performance update in v10.27, Minion no longer respects its dependencies as stated in the documentation.

This means that a tree of dependant tasks might be severed if tasks in the "middle" finishes before the top and bottom of the tree.

I guess this is on purpose, but it totally breaks the excellent "flow" use case that Minion used to support.

Would it be possible to add a flag that toggles between the new and the old behaviour?

Expected behavior

This is what was done previously in Pg.pm:

  # Old jobs with no unresolved dependencies and expired jobs
  $db->query(
    "DELETE FROM minion_jobs WHERE id IN (
       SELECT j.id FROM minion_jobs AS j LEFT JOIN minion_jobs AS children
         ON children.state != 'finished' AND ARRAY_LENGTH(children.parents, 1) > 0 AND j.id = ANY(children.parents)
       WHERE j.state = 'finished' AND j.finished <= NOW() - INTERVAL '1 second' * ? AND children.id IS NULL
       UNION ALL
       SELECT id FROM minion_jobs WHERE state = 'inactive' AND expires <= NOW()
    )", $minion->remove_after
  );

Actual behavior

This is the current code in Pg.pm, that deletes all finished jobs

  # Old jobs
  $db->query("DELETE FROM minion_jobs WHERE state = 'finished' AND finished <= NOW() - INTERVAL '1 second' * ?",
    $minion->remove_after);

  # Expired jobs
  $db->query("DELETE FROM minion_jobs WHERE  state = 'inactive' AND expires <= NOW()");

The text was updated successfully, but these errors were encountered:

kraih · 2024-05-06T10:19:44Z

The old behaviour already did not work when there were too many jobs in the queue unfortunately. But i don't quite understand your use case here. We only delete finished jobs, how could that negatively affect dependencies, where missing job and finished are treated the same?

davidolrik · 2024-05-06T10:43:05Z

I use Minion for flows, and use the task tree to render an overview of how far the process is, like this:

flowchart LR
task_a["Task A"] --> task_b["Task B"]
task_b --> task_c["Task C"]
task_c --> task_d["Task D"]

If any task in the middle finishes before A or D, and are then deleted, and D later fails I end up with no way of getting from D to A, and I have lost the context of who spawned the failing task D.

In essens I only want finished jobs to be deleted when they are not part of an "active" tree.
I.e. when the job that have reached the state finished and have no unresolved dependencies.

kraih · 2024-05-06T10:46:48Z

Job B can't finish before A, that's what dependencies are all about.

kraih · 2024-05-06T10:51:57Z

Not that it matters much, the old query falls apart with a few million jobs in the queue. We couldn't bring it back even if we wanted.

HEM42 · 2024-05-06T13:03:24Z

at my company we are running Minion, with a throughput of roughly ~500k tasks a day. And ~1M tasks at most times in table.

for some reason we never did a PR on our fixed back with 10.22. we also noticed problems with large amount of jobs in the queue.

Our solution still works.

WITH interesting_jobs AS (
       SELECT
         parents,
         state
       FROM
         minion_jobs
       WHERE
         cardinality(parents) > 0
         AND state != 'finished' )
     DELETE FROM minion_jobs AS j
       WHERE (finished <= NOW() - INTERVAL '1 second' * 172800
       AND state = 'finished'
       AND NOT EXISTS ( SELECT 1 FROM interesting_jobs
         WHERE parents @> ARRAY[j.id] AND state != 'finished'))
       OR (expires <= NOW() AND state = 'inactive');

its fairly fast as well

explain analyze WITH interesting_jobs AS (
       SELECT
         parents,
         state
       FROM
         minion_jobs
       WHERE
         cardinality(parents) > 0
         AND state != 'finished' )
     SELECT FROM minion_jobs AS j
       WHERE (finished <= NOW() - INTERVAL '1 second' * 172800
       AND state = 'finished'
       AND NOT EXISTS ( SELECT 1 FROM interesting_jobs
         WHERE parents @> ARRAY[j.id] AND state != 'finished'))
       OR (expires <= NOW() AND state = 'inactive');
                                                                                       QUERY PLAN                                                                                       
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on minion_jobs j  (cost=23066.17..346802608.51 rows=67548 width=0) (actual time=463.859..28426.737 rows=11952 loops=1)
   Recheck Cond: (((finished <= (now() - '48:00:00'::interval)) AND (state = 'finished'::minion_state)) OR (state = 'inactive'::minion_state))
   Rows Removed by Index Recheck: 46175
   Filter: (((finished <= (now() - '48:00:00'::interval)) AND (state = 'finished'::minion_state) AND (NOT (SubPlan 1))) OR ((expires <= now()) AND (state = 'inactive'::minion_state)))
   Rows Removed by Filter: 26650
   Heap Blocks: exact=33423
   ->  BitmapOr  (cost=23066.17..23066.17 rows=135096 width=0) (actual time=90.466..90.467 rows=0 loops=1)
         ->  BitmapAnd  (cost=23044.85..23044.85 rows=135096 width=0) (actual time=89.573..89.573 rows=0 loops=1)
               ->  Bitmap Index Scan on minion_jobs_finished  (cost=0.00..2833.77 rows=153245 width=0) (actual time=30.673..30.673 rows=152942 loops=1)
                     Index Cond: (finished <= (now() - '48:00:00'::interval))
               ->  Bitmap Index Scan on minion_jobs_state_priority_id_idx  (cost=0.00..20177.06 rows=879551 width=0) (actual time=56.709..56.709 rows=883147 loops=1)
                     Index Cond: (state = 'finished'::minion_state)
         ->  Bitmap Index Scan on minion_jobs_state_priority_id_idx  (cost=0.00..4.43 rows=1 width=0) (actual time=0.893..0.893 rows=3 loops=1)
               Index Cond: (state = 'inactive'::minion_state)
   SubPlan 1
     ->  Bitmap Heap Scan on minion_jobs  (cost=289.45..6250.75 rows=23 width=0) (actual time=0.731..0.731 rows=1 loops=38599)
           Recheck Cond: ((parents @> ARRAY[j.id]) AND (cardinality(parents) > 0))
           Filter: ((state <> 'finished'::minion_state) AND (state <> 'finished'::minion_state))
           Rows Removed by Filter: 0
           Heap Blocks: exact=26678
           ->  BitmapAnd  (cost=289.45..289.45 rows=1663 width=0) (actual time=0.730..0.730 rows=0 loops=38599)
                 ->  Bitmap Index Scan on minion_jobs_parents_idx  (cost=0.00..41.90 rows=4989 width=0) (actual time=0.001..0.001 rows=1 loops=38599)
                       Index Cond: (parents @> ARRAY[j.id])
                 ->  Bitmap Index Scan on minion_jobs_cardinality_parents  (cost=0.00..247.30 rows=332571 width=0) (actual time=1.046..1.046 rows=28013 loops=26647)
 Planning Time: 0.401 ms
 Execution Time: 28427.121 ms
(26 rows)

we are currently moving to pg16, and found something interesting. The above query, finishes just a tiny bit faster.

BUT the most surprising is that the original code in Minion 10.22, actually runs fast.

running the query on the same data as above:

explain analyze SELECT FROM minion_jobs AS j
     WHERE (finished <= NOW() - INTERVAL '1 second' * 172800 AND NOT EXISTS (
       SELECT 1 FROM minion_jobs WHERE parents @> ARRAY[j.id] AND state != 'finished'
     ) AND state = 'finished') OR (expires <= NOW() AND state = 'inactive');
                                                                                       QUERY PLAN                                                                                       
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on minion_jobs j  (cost=23079.63..44015254.24 rows=67855 width=0) (actual time=106.526..240.670 rows=12766 loops=1)
   Recheck Cond: (((finished <= (now() - '48:00:00'::interval)) AND (state = 'finished'::minion_state)) OR (state = 'inactive'::minion_state))
   Rows Removed by Index Recheck: 46175
   Filter: (((finished <= (now() - '48:00:00'::interval)) AND (NOT (SubPlan 1)) AND (state = 'finished'::minion_state)) OR ((expires <= now()) AND (state = 'inactive'::minion_state)))
   Rows Removed by Filter: 26650
   Heap Blocks: exact=33512
   ->  BitmapOr  (cost=23079.63..23079.63 rows=135711 width=0) (actual time=100.451..100.452 rows=0 loops=1)
         ->  BitmapAnd  (cost=23058.24..23058.24 rows=135711 width=0) (actual time=99.631..99.632 rows=0 loops=1)
               ->  Bitmap Index Scan on minion_jobs_finished  (cost=0.00..2847.00 rows=153943 width=0) (actual time=32.276..32.276 rows=153756 loops=1)
                     Index Cond: (finished <= (now() - '48:00:00'::interval))
               ->  Bitmap Index Scan on minion_jobs_state_priority_id_idx  (cost=0.00..20177.06 rows=879551 width=0) (actual time=64.983..64.983 rows=883147 loops=1)
                     Index Cond: (state = 'finished'::minion_state)
         ->  Bitmap Index Scan on minion_jobs_state_priority_id_idx  (cost=0.00..4.43 rows=1 width=0) (actual time=0.819..0.819 rows=3 loops=1)
               Index Cond: (state = 'inactive'::minion_state)
   SubPlan 1
     ->  Bitmap Heap Scan on minion_jobs  (cost=42.04..16158.52 rows=591 width=0) (actual time=0.001..0.001 rows=1 loops=39416)
           Recheck Cond: (parents @> ARRAY[j.id])
           Filter: (state <> 'finished'::minion_state)
           Rows Removed by Filter: 0
           Heap Blocks: exact=26678
           ->  Bitmap Index Scan on minion_jobs_parents_idx  (cost=0.00..41.90 rows=4989 width=0) (actual time=0.001..0.001 rows=1 loops=39416)
                 Index Cond: (parents @> ARRAY[j.id])
 Planning Time: 0.322 ms
 Execution Time: 241.096 ms
(24 rows)

Ok, while collecting this information, and writting. another solution presented itself

explain analyze SELECT FROM minion_jobs
WHERE id IN (
(
SELECT id FROM minion_jobs WHERE state = 'finished' AND finished <= NOW() - INTERVAL '1 second' * 172800
EXCEPT
SELECT unnest(parents) AS id FROM minion_jobs WHERE state != 'finished'
)
UNION
SELECT id FROM minion_jobs WHERE state = 'inactive' AND expires <= NOW());
                                                                                        QUERY PLAN                                                                                         
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=256494.17..285054.46 rows=138181 width=0) (actual time=427.742..482.616 rows=15610 loops=1)
   Hash Cond: (minion_jobs.id = "*SELECT* 1".id)
   ->  Index Only Scan using minion_jobs_pkey on minion_jobs  (cost=0.42..25941.71 rows=997714 width=8) (actual time=0.043..35.708 rows=999174 loops=1)
         Heap Fetches: 11
   ->  Hash  (cost=254766.48..254766.48 rows=138181 width=8) (actual time=417.175..417.176 rows=15610 loops=1)
         Buckets: 262144  Batches: 1  Memory Usage: 2658kB
         ->  HashAggregate  (cost=252305.13..254766.48 rows=138181 width=8) (actual time=415.115..415.903 rows=15610 loops=1)
               Group Key: "*SELECT* 1".id
               Planned Partitions: 4  Batches: 1  Memory Usage: 3857kB
               ->  Append  (cost=0.00..244877.90 rows=138181 width=8) (actual time=411.209..413.750 rows=15610 loops=1)
                     ->  Result  (cost=0.00..244178.55 rows=138180 width=8) (actual time=411.208..413.247 rows=15610 loops=1)
                           ->  HashSetOp Except  (cost=0.00..242796.75 rows=138180 width=12) (actual time=411.207..412.678 rows=15610 loops=1)
                                 ->  Append  (cost=0.00..239497.23 rows=1319810 width=12) (actual time=0.210..398.082 rows=69663 loops=1)
                                       ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..111576.08 rows=138180 width=12) (actual time=0.210..224.810 rows=42257 loops=1)
                                             ->  Seq Scan on minion_jobs minion_jobs_1  (cost=0.00..110194.28 rows=138180 width=8) (actual time=0.209..222.820 rows=42257 loops=1)
                                                   Filter: ((state = 'finished'::minion_state) AND (finished <= (now() - '48:00:00'::interval)))
                                                   Rows Removed by Filter: 956917
                                       ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..121322.10 rows=1181630 width=12) (actual time=0.046..170.768 rows=27406 loops=1)
                                             ->  ProjectSet  (cost=0.00..109505.80 rows=1181630 width=8) (actual time=0.046..169.658 rows=27406 loops=1)
                                                   ->  Seq Scan on minion_jobs minion_jobs_2  (cost=0.00..102711.43 rows=118163 width=13) (actual time=0.040..157.409 rows=116027 loops=1)
                                                         Filter: (state <> 'finished'::minion_state)
                                                         Rows Removed by Filter: 883147
                     ->  Index Scan using minion_jobs_state_priority_id_idx on minion_jobs minion_jobs_3  (cost=0.42..8.45 rows=1 width=8) (actual time=0.012..0.012 rows=0 loops=1)
                           Index Cond: (state = 'inactive'::minion_state)
                           Filter: (expires <= now())
                           Rows Removed by Filter: 3
 Planning Time: 0.520 ms
 Execution Time: 484.704 ms
(28 rows)

this runs fast on both pg13 and pg16 (above is from pg16)

if interested, I will make a proper PR for the latest solution we have

kraih · 2024-05-06T15:35:15Z

I'm not opposed to a PR if there really are no performance issues with the solution. But i'm still quite sceptical about the use cases. Do we really need this? If we treat this as a new feature, can we find two unique use cases for it?

kraih · 2024-05-06T18:20:51Z

For the first use case i will count debugging the whole chain of jobs that led to a failure somewhere in the middle. Do we have a second?

davidolrik · 2024-05-06T19:01:56Z

I once build a provisioning flow for a telco where we had automated the whole process of moving a cellphone number between providers and becoming a new customer utomated using minion.

The whole process could take anywhere from 1 day to several months to complete depending on notice periods.
For this type of flow it would have been catastrophic if the first part of the flow was deleted because it was finished.

Having the whole flow stored allowed us to render it in the "service agent ui", making it quite easy for everyone to see when the different task were performed and how many tasks there are left before the flow is done.

So second use-case would be possibility of "flow overview" (for non-technical people).

This together with the debugging case are my two favourite things about Minion.

christopherraa · 2024-05-06T19:10:52Z

At $work we are heavily using Minion for such flows, and for us dependencies have been valuable both for debugging and "flow overview".

At one client of ours there could also be pretty catastrophic if parts of a flow for long-running flows were deleted, as the later parts of a flow would compile an end-dataset based on data stored on the individual jobs in a flow. So missing parts of a chain would represent incomplete data. Naturally these datasets could be stored elsewhere, but that has not been required since minion has made this storage of job results so easy and convenient.

tobez · 2024-05-06T19:16:35Z

One thing to consider: the query in 10.22, the different query in 10.23, and both versions of the query that @HEM42 posted are equivalent to each other if we ignore the performance issues. The code introduced in 10.27 thus breaks POLA, hard.

kraih · 2024-05-06T23:14:06Z

I suppose that's good enough for two use cases. Now we just need a solution that 100% won't cause performance issues.

rshingleton · 2024-05-13T14:04:01Z

I was just coming to submit a bug report on this. The "old" routine with the union all causes our entire queue to get in a broken state and locks the entire queue.

rshingleton · 2024-05-13T14:51:28Z

Just as an aside to the OP's request, I had temporarily fixed this by simply creating a loop to delete with a limit. This allowed full deletion in "chunks". The following was a quick hack to get my queue back in working order, but could probably be refined some:

my $db = $minion->backend->pg->db;
my $limit = 50000;
my $deleteAfterSeconds = 86400;

my $data = getIds();
my $count = 0;
while (defined($data) && scalar(@{$data})) {
    my $records = scalar(@{$data});
    $count += scalar($records);
    say "deleting $records records";
    runRepairJobs($data);
    $data = getIds();
    say "deleted $count records";
    sleep 2;
}

sub runRepairJobs {
    my $results = shift;

    my @deletes;
    for my $item (@{$results}) {
        # say $item->{id};
        push(@deletes, $item->{id});
    }

    if (scalar(@deletes)) {
        my $ids = join(',', @deletes);
        my $delete = "DELETE FROM minion_jobs WHERE id IN ($ids)";
        $db->query($delete)
    }

}

sub getIds {
    my $query = qq{
        SELECT j.id FROM minion_jobs AS j
          LEFT JOIN minion_jobs AS children
             ON children.state != 'finished'
             AND ARRAY_LENGTH(children.parents, 1) > 0
             AND j.id = ANY(children.parents)
          WHERE j.state = 'finished'
             AND j.finished <= NOW() - INTERVAL '1 second' * $deleteAfterSeconds
             AND children.id IS NULL
        UNION ALL
        SELECT id FROM minion_jobs
             WHERE state = 'inactive'
             AND expires <= NOW()
             LIMIT $limit
    };

    my $results = $db->query($query)->hashes->to_array;

    return $results;
}

HEM42 mentioned this issue May 7, 2024

Reimplementation of previous repair, but without significant performance hit #130

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`remove_after` no longer respects task dependencies #129

`remove_after` no longer respects task dependencies #129

davidolrik commented May 6, 2024 •

edited

kraih commented May 6, 2024

davidolrik commented May 6, 2024

kraih commented May 6, 2024

kraih commented May 6, 2024

HEM42 commented May 6, 2024 •

edited

kraih commented May 6, 2024

kraih commented May 6, 2024

davidolrik commented May 6, 2024

christopherraa commented May 6, 2024

tobez commented May 6, 2024

kraih commented May 6, 2024

rshingleton commented May 13, 2024

rshingleton commented May 13, 2024

remove_after no longer respects task dependencies #129

remove_after no longer respects task dependencies #129

Comments

davidolrik commented May 6, 2024 • edited

Expected behavior

Actual behavior

kraih commented May 6, 2024

davidolrik commented May 6, 2024

kraih commented May 6, 2024

kraih commented May 6, 2024

HEM42 commented May 6, 2024 • edited

kraih commented May 6, 2024

kraih commented May 6, 2024

davidolrik commented May 6, 2024

christopherraa commented May 6, 2024

tobez commented May 6, 2024

kraih commented May 6, 2024

rshingleton commented May 13, 2024

rshingleton commented May 13, 2024

`remove_after` no longer respects task dependencies #129

`remove_after` no longer respects task dependencies #129

davidolrik commented May 6, 2024 •

edited

HEM42 commented May 6, 2024 •

edited