Skip to content
This repository has been archived by the owner on Sep 23, 2020. It is now read-only.

Nimbus allows qdels to fail in Pilot #87

Open
oldpatricka opened this issue Feb 15, 2012 · 0 comments
Open

Nimbus allows qdels to fail in Pilot #87

oldpatricka opened this issue Feb 15, 2012 · 0 comments

Comments

@oldpatricka
Copy link
Contributor

From a bug report by Sharon Goliath:

On a nimbus installation running 2.8, the services.log file contains a few instances of the following error:

/usr/local/nimbus/var/services.log.litai02.20120118.000000:2012-01-18 22:17:39,092 INFO  workspace.WorkspaceUtil     [ServiceThread-164,runCommand:154] [NIMBUS-EVENT][id-25]: /opt/bin/qdel 4562168.moab01.**.**.**
/usr/local/nimbus/var/services.log.litai02.20120118.000000:qdel: Server could not connect to MOM 4562168.moab01.**.**.**
/usr/local/nimbus/var/services.log.litai02.20120118.000000:2012-01-18 22:17:39,107 ERROR pilot.PilotSlotManagement     [ServiceThread-164,releaseSpaceImpl:1077] Problem calling Torque qdel: return code = 222, stderr = 'qdel: Server could not     connect to MOM 4562168.moab01.**.**.**', no stdout

The workspace service removes its record of the VM, although the pilot job has not been successfully terminated.

Instead, I think Nimbus should probably retry the qdel a number of times, rather than simply logging an error. The current behaviour can leave zombie jobs in the PBS queue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant