Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sdexec: do not clean up failed units #5960

Open
chu11 opened this issue May 12, 2024 · 2 comments
Open

sdexec: do not clean up failed units #5960

chu11 opened this issue May 12, 2024 · 2 comments

Comments

@chu11
Copy link
Member

chu11 commented May 12, 2024

Per discussion in #5956, it may be useful at times to not cleanup failed systemd units for post analysis later.

The way this was handled in the old libsdprocess was through:

    /* In sdprocess, we require the systemd unit to exist until the                                                                            
     * user cleans it up with sdprocess_systemd_cleanup().  This                                                                               
     * ensures consistent behavior in a number of functions.  For                                                                              
     * example, a function like sdprocess_wait() can be called                                                                                 
     * multiple times.  Therefore, we set RemainAfterExit to true                                                                              
     * for every process we start.                                                                                                             
     */                                                                                                                                        
    if ((ret = sd_bus_message_append (m,                                                                                                       
                                      "(sv)",                                                                                                  
                                      "RemainAfterExit",                                                                                       
                                      "b",                                                                                                     
                                      true)) < 0) {                                                                                            
        set_errno_log (sdp->h, ret, "sd_bus_message_append");                                                                                  
        return -1;                                                                                                                             
    }      

This apparently keeps all units around. This would pile up old units quickly and not be good, so perhaps we could make this a per-job configuration, in case we need to debug a very specific job that is failing in a unique way. Then after that job was done we'd clean up the successful units. Or somewhere in the sdexec module it would do the cleanup of successful units.

@garlick
Copy link
Member

garlick commented May 12, 2024

We do set RemainAfterExit on all units now:

https://github.com/flux-framework/flux-core/blob/master/src/common/libsdexec/start.c#L302

As I recall, successful units do not remain but failed ones do, until ResetFailedUnit is called:

https://github.com/flux-framework/flux-core/blob/master/src/modules/sdexec/sdexec.c#L283

@chu11
Copy link
Member Author

chu11 commented May 12, 2024

Ahh, hmmm then why the failed one from #5956 didn't stick around is a mystery (although it's possible I did not use quite the right systemctl runes ...) neverind, I misread your prior comment.

So perhaps we shouldn't called ResetFailedUnit on some special config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants