flux queue idle hangs with no active jobs #5964

garlick · 2024-05-13T22:33:09Z

Problem: when flux (0.62.0) was shut down on fluke, the flux queue idle command executed during cleanup hung.

A manual run of flux queue idle also hangs.

flux module stats job-manager shows zero active jobs

# flux module stats job-manager
{
 "journal": {
  "listeners": 1
 },
 "active_jobs": 0,
 "inactive_jobs": 52,
 "max_jobid": 891940360861779968
}

But the debugger shows ctx->alloc.alloc_pending_count = 1. This value must be zero in order for the queue to be called idle.

(gdb) thread find job-manager  # thanks tom
Thread 15 has target name 'job-manager'
(gdb) t 15
[Switching to thread 15 (Thread 0x7fffb2098700 (LWP 2603579))]
#0  0x00007ffff62cf247 in epoll_wait () from /lib64/libc.so.6
(gdb) frame
#0  0x00007ffff62cf247 in epoll_wait () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff62cf247 in epoll_wait () from /lib64/libc.so.6
#1  0x00007ffff7b961ff in epoll_poll (loop=0x7fffa8002440, 
    timeout=<optimized out>) at ev_epoll.c:155
#2  0x00007ffff7b98dd6 in ev_run (flags=0, loop=0x7fffa8002440) at ev.c:4157
#3  ev_run (loop=0x7fffa8002440, flags=0) at ev.c:4021
#4  0x00007ffff7b6b7ff in flux_reactor_run (r=r@entry=0x7fffa8002420, 
    flags=flags@entry=0) at reactor.c:124
#5  0x00007fffc53ca649 in mod_main (h=0x7fffa8000be0, argc=<optimized out>, 
    argv=<optimized out>) at job-manager.c:239
#6  0x0000000000411ecf in module_thread (arg=0x7419f0) at module.c:221
#7  0x00007ffff79301ca in start_thread () from /lib64/libpthread.so.0
#8  0x00007ffff61d9e73 in clone () from /lib64/libc.so.6
(gdb) frame 5
#5  0x00007fffc53ca649 in mod_main (h=0x7fffa8000be0, argc=<optimized out>, 
    argv=<optimized out>) at job-manager.c:239
(gdb) p ctx
$5 = {h = 0x7fffa8000be0, handlers = 0x7fffa80109d0, 
  active_jobs = 0x7fffa8003cb0, inactive_jobs = 0x7fffa8003d40, 
  running_jobs = 0, max_jobid = 891940360861779968, owner = 505, 
  conf = 0x7fffa8003df0, start = 0x7fffa800f0c0, alloc = 0x7fffa800ea10, 
  event = 0x7fffa800d240, submit = 0x7fffa800e8f0, drain = 0x7fffa800f260, 
  wait = 0x7fffa800f4f0, raise = 0x7fffa800f760, kill = 0x7fffa800f960, 
  annotate = 0x7fffa800fea0, journal = 0x7fffa800ffc0, purge = 0x7fffa800d480, 
  queue = 0x7fffa800de00, update = 0x7fffa8010290, jobtap = 0x7fffa8003f40}
(gdb) p *ctx.alloc
$7 = {ctx = 0x7fffb2097740, handlers = 0x7fffa800eb50, queue = 0x7fffa800ea70, 
  pending_jobs = 0x7fffa800eae0, ready = true, stopped = false, 
  stopped_reason = 0x0, prep = 0x7fffa800ef90, check = 0x7fffa800efe0, 
  dle = 0x7fffa800f030, alloc_limit = 0, alloc_pending_count = 1, 
  sched_sender = 0x7fffa801b5e0 "ad2e7465-ba06-40ac-a2ee-60ffb4cce15d"}

The text was updated successfully, but these errors were encountered:

garlick · 2024-05-13T22:39:04Z

flux queue stop, which ran earlier in the cleanup sequence, should have ensured that no alloc requests are pending to the scheduler.

Although dmesg shows that it ran successfullly:

[May13 13:49] broker[0]: shutdown: run->cleanup 5.13866d
[  +0.203507] broker[0]: cleanup.0: flux queue stop --quiet --all --nocheckpoint Exited (rc=0) 0.2s
[  +0.408389] broker[0]: cleanup.1: flux cancel --user=all --quiet --states RUN Exited (rc=0) 0.2s

The gdb dump above shows that ctx->alloc.stopped = false!

Edit: alloc->stopped appears to have been orphaned when we added queue support, so that doesn't mean anything. Nevermind.

garlick · 2024-05-13T23:30:16Z

This also shows the problem:

$ flux queue status -v
batch: Job submission is enabled
batch: Scheduling is stopped
debug: Job submission is enabled
debug: Scheduling is stopped
0 alloc requests queued
1 alloc requests pending to scheduler
0 running jobs

When a queue is stopped, any pending sched.alloc requests for jobs in that queue are canceled. They are then resent when the queue is started. However, I don't see this occurring during a queue reconfiguration. So if a job has an outstanding alloc request in a queue that is deleted, it might still be outstanding after flux queue stop --all. Now if the scheduler were reloaded that would take care of it. Or if the queue was stopped before deleting it. Anyway, this may be where we should look, especially if fluke's queues were recently reconfigured.

garlick · 2024-05-24T17:17:00Z

Closing. The problem as stated in this issue's description was resolved by flux-framework/flux-sched#1209.

garlick mentioned this issue May 14, 2024

job-manager: misc cleanup in scheduler interface #5969

Merged

grondo mentioned this issue May 19, 2024

job-manager: problem with alloc queue on elcap #5990

Closed

garlick mentioned this issue May 20, 2024

alloc requests are leaked for canceled jobs with queue-policy = "easy" flux-framework/flux-sched#1208

Closed

garlick closed this as completed May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flux queue idle hangs with no active jobs #5964

flux queue idle hangs with no active jobs #5964

garlick commented May 13, 2024

garlick commented May 13, 2024 •

edited

garlick commented May 13, 2024

garlick commented May 24, 2024

flux queue idle hangs with no active jobs #5964

flux queue idle hangs with no active jobs #5964

Comments

garlick commented May 13, 2024

garlick commented May 13, 2024 • edited

garlick commented May 13, 2024

garlick commented May 24, 2024

garlick commented May 13, 2024 •

edited