Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sdexec: transient units may not be getting cleaned up properly #5902

Open
garlick opened this issue Apr 19, 2024 · 0 comments
Open

sdexec: transient units may not be getting cleaned up properly #5902

garlick opened this issue Apr 19, 2024 · 0 comments

Comments

@garlick
Copy link
Member

garlick commented Apr 19, 2024

I've noticed a few transient units are not getting cleaned up. For example:

[root@tioga20:garlick]# ~garlick/bin/flist
  UNIT                          LOAD   ACTIVE SUB     DESCRIPTION              
  dbus.service                  loaded active running D-Bus User Message Bus   
  imp-shell-9-f2yHNSwmAhHR.service loaded active exited  User workload         >
● imp-shell-9-f2zEnyt2ai3q.service loaded failed failed  User workload      

The first job died with a "failed to create guest ns" exception

[root@tioga20:garlick]# flux job eventlog -H f2yHNSwmAhHR
[Apr12 10:55] submit userid=27241 urgency=16 flags=0 version=1
[  +0.018449] jobspec-update attributes.system.bank="guests"
[  +0.018531] validate
[  +0.037056] depend
[  +0.037109] priority priority=623
[  +0.123069] alloc
[  +0.123345] prolog-start description="job-manager.prolog"
[  +0.123398] prolog-start description="cray-pals-port-distributor"
[  +0.282366] prolog-finish description="cray-pals-port-distributor" status=0
[  +1.222043] prolog-finish description="job-manager.prolog" status=0
[  +1.232569] start
[  +1.605169] memo uri="ssh://tioga20/var/tmp/lbannusr/flux-7chyxa/local-0"
[Apr12 12:26] exception type="exec" severity=0 userid=763 note="failed to create guest ns: No such file or directory"
[  +0.000119] release ranks="all" final=true
[  +0.000150] free
[  +0.000172] clean

The second job is not a valid job id.

I did do a sanity check and found no units left behind when

  • a job succeeds
  • a job fails
  • a job is canceled
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant