Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting final outputs on active tasks: should kill first #6033

Open
hjoliver opened this issue Mar 22, 2024 · 4 comments
Open

Setting final outputs on active tasks: should kill first #6033

hjoliver opened this issue Mar 22, 2024 · 4 comments
Labels
could be better Not exactly a bug, but not ideal. question Flag this as a question for the next Cylc project meeting.
Milestone

Comments

@hjoliver
Copy link
Member

hjoliver commented Mar 22, 2024

The new(ish) Cylc 8 interventions proposal allows:

  • setting a final output on task to about a retry chain
    • (note this can always be done on a waiting task, between retries)
  • setting a final output on an active task to abandon a "stuck job submission"
    • (presumably the job platform or job runner has gone AWOL)

However, the proposal neglects to cover setting final outputs on active tasks that are not "stuck".

Hence, #6025 (don't allow automatically set (i.e., clock) expire on an active task); and this issue.

If an active task is not "stuck" we should require the user to kill the job before setting a final output, because orphaned jobs are:

  • confusing to users who look at their batch queues
  • dangerous - if the task gets retriggered you end up with duplicate jobs

We can easily avoid this by simply not allowing manual setting of final outputs on active tasks.

If a user tries to do it, refer them to cylc kill. Note #5981 could handle both normal and "stuck" tasks: if the job kill fails, set the requested output anyway.

@hjoliver hjoliver added the could be better Not exactly a bug, but not ideal. label Mar 22, 2024
@hjoliver hjoliver added this to the cylc-8.x milestone Mar 22, 2024
@oliver-sanders oliver-sanders added the question Flag this as a question for the next Cylc project meeting. label Mar 22, 2024
@oliver-sanders
Copy link
Member

Considerations:

  • The task might not be killable.
  • The user may or may not want to try to kill the task.

Possible options to consider, when setting a final output on an active task:

  1. Try to kill by default, if the kill fails, continue anyway.
  2. Kill by default, if the kill fails, reject the set. Provide a --no-kill option.
  3. Don't kill by default, if the kill fails, reject the set. Provide a --kill option.
  4. Other?

@hjoliver
Copy link
Member Author

hjoliver commented Mar 22, 2024

Yes I addressed Consideration 1. above:

Note #5981 could handle both normal and "stuck" tasks: if the job kill fails, set the requested output anyway.

Consideration 2. is interesting - can you think of a valid reason for taking that view?

Note I'm not suggesting we extend cylc set to do the (attempted) job kill - see #5981

@hjoliver
Copy link
Member Author

hjoliver commented Mar 22, 2024

Ah, I wonder if I misinterpreted "stuck job submission" in the proposal.

I took it to mean a task that is "stuck" as submitted or running, because the the job was hard-killed, or the host or job-runner down, or network issues.

I suppose it could also mean job stuck in a batch queue indefinitely? (or seemingly indefinitely). Mind you, that's killable. I can see no good reason to just orphan it, and risk a duplicate.

@oliver-sanders
Copy link
Member

oliver-sanders commented Mar 22, 2024

To the best of my memory:

  • Jobs which poll as submitted but will never run (e.g. "held" in PBS?) but cannot be removed by the user. We've had some strange issues of this ilk.
  • Jobs which cannot be either polled or killed (e.g. network issues).
  • Jobs the batch system reincarnated?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
could be better Not exactly a bug, but not ideal. question Flag this as a question for the next Cylc project meeting.
Projects
None yet
Development

No branches or pull requests

2 participants