Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: replace cleanup scriptlets with shutdown script #5860

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

garlick
Copy link
Member

@garlick garlick commented Apr 5, 2024

This makes some improvement to the shutdown process:

  • eliminate the flux admin cleanup-push method of registering cleanup tasks in rc1, and instead have a shutdown script similar to other rc scripts
  • add flux shutdown --force which causes the shutdown script to be called with the -f option
  • in the shutdown script, add a 10s timeout to flux queue idle when -f is present
  • in the shutdown script, unload the perilog module before canceling jobs to skip the epilog (@grondo's idea - a good one IMHO)

Todo:

  • tests need to be written
  • document in man pages and admin guide
  • consider running flux shutdown --force in systemctl stop flux
  • I think flux-accounting's use of the "job" test personality is going to be a problem now that flux admin is removed so deal with that

I thought maybe a pause at this point to get feedback would be good. This had a bigger impact on tests than I was expecting, so maybe I took a wrong turn somewhere.

@garlick garlick force-pushed the shutdown branch 2 times, most recently from 69adb5e to 9883eec Compare April 5, 2024 15:00
@garlick
Copy link
Member Author

garlick commented Apr 5, 2024

Fixed some stuff

  • add shutdown script to dist manifest (whoops)
  • dropped commits that remove flux admin cleanup-push
  • temporarily add backwards compat code for cleanup scripts (avoids breaking flux-coral2 and flux-accounting)

@garlick garlick force-pushed the shutdown branch 3 times, most recently from 82bc90c to 579104e Compare April 8, 2024 20:14
@garlick garlick changed the title WIP: improve shutdown procedure WIP: replace cleanup scriptlets with shutdown script Apr 9, 2024
@garlick
Copy link
Member Author

garlick commented Apr 9, 2024

I dropped the actual changes to the shutdown process from this PR (including shutdown --force) leaving only the switch from cleanup-push to a shutdown script since that change alone is a pretty big one.

I think the script is going to be far more maintainable than the "cleanup scriptlets" pushed by rc1, and it opens the possibility of letting flux shutdown pass options to the script. For example, we may want to implement an option for shutdown at a future time that gives jobs more of a soft warning so they can checkpoint. Maybe the --force option should be reserved for when we don't automatically cancel running jobs at shutdown because they can be recovered.

A downside is that this impacts the tests that set rc1,rc3, including some in other projects. (Thus cleanup-push remains as an alias until the other projects can transition)

Problem: the shutdown script will coexist with rc1 and rc3 as
an alternative to the "cleanup" commands currently pushed into broker
memory by rc1, but it has no path configuration.

Add builtin 'shutdown_path' config key, like to 'rc1_path' and 'rc3_path'.
Add 'broker.shutdown_path' broker attribute, like 'broker.rc1_path'
and 'broker.rc3_path'.

Update flux-config(1) and flux-broker-attributes(7).
Problem: the cleanup commands pushed into broker memory in rc1
are not easily maintained or extended.

Create a new "shutdown" script that lives next to rc1 and rc3.
Run this script instead of the "cleanup" commands when the broker
CLEANUP state is entered.  Upon completion of the shutdown script,
the broker transitions to the SHUTDOWN state.
Problem: test "personalites" like "job" and "kvs" select
alternate rc1 and rc3 scripts, but not shutdown.

Require that a personality define a shutdown script even if empty,
and add them for "job" and "kvs".
Problem: some tests assume that clearing the rc1/rc3 paths
is sufficient to bypass all default startup and shutdown activities,
but now the shutdown script path must also be cleared.

Modify test scripts to take the shutdown script into
consideration when modifying rc1/rc3 paths.
Problem: a function call with a long parameter list is
not broken to one per line, as per project norms.

Fix line breaks.
Problem: rc1, rc3 scripts are run with shell -c <script> but
they could be executed directly.

Run scripts directly.
Adjust one test.
Problem: some framework projects are using "test personalities"
that call flux admin cleanup-push.

Accept "cleanup" as an alias for "shutdown" to allow those projects
to continue to work while transitioning to the shutdown script.

Note: although these commands are no longer interpreted by the shell,
none of those projects appear to be relying on that.
Copy link

codecov bot commented Apr 17, 2024

Codecov Report

Merging #5860 (587573b) into master (a7d5bb5) will decrease coverage by 0.05%.
Report is 2 commits behind head on master.
The diff coverage is 57.89%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5860      +/-   ##
==========================================
- Coverage   83.31%   83.27%   -0.05%     
==========================================
  Files         514      514              
  Lines       82801    82827      +26     
==========================================
- Hits        68986    68974      -12     
- Misses      13815    13853      +38     
Files Coverage Δ
src/common/libflux/conf.c 85.40% <ø> (ø)
src/broker/state_machine.c 80.91% <50.00%> (+0.73%) ⬆️
src/broker/runat.c 73.61% <62.50%> (-6.05%) ⬇️
src/broker/broker.c 77.92% <56.25%> (-0.27%) ⬇️

... and 14 files with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant