Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux queue idle hangs with no active jobs #5964

Closed
garlick opened this issue May 13, 2024 · 3 comments
Closed

flux queue idle hangs with no active jobs #5964

garlick opened this issue May 13, 2024 · 3 comments

Comments

@garlick
Copy link
Member

garlick commented May 13, 2024

Problem: when flux (0.62.0) was shut down on fluke, the flux queue idle command executed during cleanup hung.

A manual run of flux queue idle also hangs.

flux module stats job-manager shows zero active jobs

# flux module stats job-manager
{
 "journal": {
  "listeners": 1
 },
 "active_jobs": 0,
 "inactive_jobs": 52,
 "max_jobid": 891940360861779968
}

But the debugger shows ctx->alloc.alloc_pending_count = 1. This value must be zero in order for the queue to be called idle.

(gdb) thread find job-manager  # thanks tom
Thread 15 has target name 'job-manager'
(gdb) t 15
[Switching to thread 15 (Thread 0x7fffb2098700 (LWP 2603579))]
#0  0x00007ffff62cf247 in epoll_wait () from /lib64/libc.so.6
(gdb) frame
#0  0x00007ffff62cf247 in epoll_wait () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff62cf247 in epoll_wait () from /lib64/libc.so.6
#1  0x00007ffff7b961ff in epoll_poll (loop=0x7fffa8002440, 
    timeout=<optimized out>) at ev_epoll.c:155
#2  0x00007ffff7b98dd6 in ev_run (flags=0, loop=0x7fffa8002440) at ev.c:4157
#3  ev_run (loop=0x7fffa8002440, flags=0) at ev.c:4021
#4  0x00007ffff7b6b7ff in flux_reactor_run (r=r@entry=0x7fffa8002420, 
    flags=flags@entry=0) at reactor.c:124
#5  0x00007fffc53ca649 in mod_main (h=0x7fffa8000be0, argc=<optimized out>, 
    argv=<optimized out>) at job-manager.c:239
#6  0x0000000000411ecf in module_thread (arg=0x7419f0) at module.c:221
#7  0x00007ffff79301ca in start_thread () from /lib64/libpthread.so.0
#8  0x00007ffff61d9e73 in clone () from /lib64/libc.so.6
(gdb) frame 5
#5  0x00007fffc53ca649 in mod_main (h=0x7fffa8000be0, argc=<optimized out>, 
    argv=<optimized out>) at job-manager.c:239
(gdb) p ctx
$5 = {h = 0x7fffa8000be0, handlers = 0x7fffa80109d0, 
  active_jobs = 0x7fffa8003cb0, inactive_jobs = 0x7fffa8003d40, 
  running_jobs = 0, max_jobid = 891940360861779968, owner = 505, 
  conf = 0x7fffa8003df0, start = 0x7fffa800f0c0, alloc = 0x7fffa800ea10, 
  event = 0x7fffa800d240, submit = 0x7fffa800e8f0, drain = 0x7fffa800f260, 
  wait = 0x7fffa800f4f0, raise = 0x7fffa800f760, kill = 0x7fffa800f960, 
  annotate = 0x7fffa800fea0, journal = 0x7fffa800ffc0, purge = 0x7fffa800d480, 
  queue = 0x7fffa800de00, update = 0x7fffa8010290, jobtap = 0x7fffa8003f40}
(gdb) p *ctx.alloc
$7 = {ctx = 0x7fffb2097740, handlers = 0x7fffa800eb50, queue = 0x7fffa800ea70, 
  pending_jobs = 0x7fffa800eae0, ready = true, stopped = false, 
  stopped_reason = 0x0, prep = 0x7fffa800ef90, check = 0x7fffa800efe0, 
  dle = 0x7fffa800f030, alloc_limit = 0, alloc_pending_count = 1, 
  sched_sender = 0x7fffa801b5e0 "ad2e7465-ba06-40ac-a2ee-60ffb4cce15d"}
@garlick
Copy link
Member Author

garlick commented May 13, 2024

flux queue stop, which ran earlier in the cleanup sequence, should have ensured that no alloc requests are pending to the scheduler.

Although dmesg shows that it ran successfullly:

[May13 13:49] broker[0]: shutdown: run->cleanup 5.13866d
[  +0.203507] broker[0]: cleanup.0: flux queue stop --quiet --all --nocheckpoint Exited (rc=0) 0.2s
[  +0.408389] broker[0]: cleanup.1: flux cancel --user=all --quiet --states RUN Exited (rc=0) 0.2s

The gdb dump above shows that ctx->alloc.stopped = false!

Edit: alloc->stopped appears to have been orphaned when we added queue support, so that doesn't mean anything. Nevermind.

@garlick
Copy link
Member Author

garlick commented May 13, 2024

This also shows the problem:

$ flux queue status -v
batch: Job submission is enabled
batch: Scheduling is stopped
debug: Job submission is enabled
debug: Scheduling is stopped
0 alloc requests queued
1 alloc requests pending to scheduler
0 running jobs

When a queue is stopped, any pending sched.alloc requests for jobs in that queue are canceled. They are then resent when the queue is started. However, I don't see this occurring during a queue reconfiguration. So if a job has an outstanding alloc request in a queue that is deleted, it might still be outstanding after flux queue stop --all. Now if the scheduler were reloaded that would take care of it. Or if the queue was stopped before deleting it. Anyway, this may be where we should look, especially if fluke's queues were recently reconfigured.

@garlick
Copy link
Member Author

garlick commented May 24, 2024

Closing. The problem as stated in this issue's description was resolved by flux-framework/flux-sched#1209.

@garlick garlick closed this as completed May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant