Have scheduler lsf driver dump bhist summary to runpath #7794

jonathan-eq · 2024-04-29T08:14:25Z

Issue
Resolves #7694

Approach
The commit in this PR has the lsf_driver dump the output of bhist -l <JOB_ID> to runpath.

(Screenshot of new behavior in GUI if applicable)

PR title captures the intent of the changes, and is fitting for release notes.
Added appropriate release note label
Commit history is consistent and clean, in line with the contribution guidelines.
Make sure tests pass locally (after every commit!)

When applicable

When there are user facing changes: Updated documentation
New behavior or changes to existing untested code: Ensured that unit tests are added (See Ground Rules).
Large PR: Prepare changes in small commits for more convenient review
Bug fix: Add regression test for the bug
Bug fix: Create Backport PR to latest release

jonathan-eq · 2024-05-03T10:33:29Z

I have manually tested this, and it works :)

codecov-commenter · 2024-05-03T11:30:41Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 85.39%. Comparing base (bd52a37) to head (89c5349).
Report is 10 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7794      +/-   ##
==========================================
- Coverage   85.41%   85.39%   -0.03%     
==========================================
  Files         381      381              
  Lines       23565    23598      +33     
  Branches      887      888       +1     
==========================================
+ Hits        20128    20151      +23     
- Misses       3330     3333       +3     
- Partials      107      114       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/ert/scheduler/lsf_driver.py

tests/integration_tests/job_queue/test_lsf_driver.py

tests/integration_tests/scheduler/bin/bhist.py

berland · 2024-05-10T13:37:25Z

Since we are adding a file to the runpath that we can never later rename or remove, we should have an approval from @sondreso as well.

sondreso · 2024-05-14T06:18:46Z

I am slightly skeptical to adding this file to the runpath: Contrary to the content of LSF-stderr, the output from bhist -l will be available also after the execution of the job itself. This means that either we or the user can run the command at any time to get the output from bhist (provided that you know the jobid). What is the benefit of writing this file instead?

berland · 2024-05-14T07:05:50Z

This is only for the convencience of the user. The inconvenience is to figure out which number to pick for -n when running this an arbitrary amount of time after finish.

The user is maybe not so interested in this output as Ert developers are, and runpath pollution is a real thing, so a suggested alternative is to log this is as a multiline string from the driver instead.

sondreso · 2024-05-14T08:47:58Z

suggested alternative is to log this is as a multiline string from the driver instead.

I think this is a better option, as we avoid making another file we have to keep in the runpath. Additionally we get logging to Azure for free with this option as well.

jonathan-eq · 2024-05-14T11:34:53Z

This is only for the convencience of the user. The inconvenience is to figure out which number to pick for -n when running this an arbitrary amount of time after finish.

The user is maybe not so interested in this output as Ert developers are, and runpath pollution is a real thing, so a suggested alternative is to log this is as a multiline string from the driver instead.

It will automatically be logged inside _execute_with_retry in this format:

2024-05-14 13:25:01,700 - ert.scheduler.driver - LegacyEnsemble - DEBUG - Command "/global/bin/bhist -l 185509" succeeded with exit code 0, output: "Job <185509>, Job Name <poly.ert-4>, User <jonak>, Project <default>, Command <
                     /private/jonak/ert/test-data/poly_example/poly_out/realiza
                     tion-4/iter-0/.lsf_submit_doabu68u.sh /private/jonak/ert/t
                     est-data/poly_example/poly_out/realization-4/iter-0>
Tue May 14 13:24:47: Submitted from host <st-grid03>, to Queue <normal>, CWD <$
                     HOME/ert/test-data/poly_example>, Output File </private/jo
                     nak/ert/test-data/poly_example/poly_out/realization-4/iter
                     -0/poly.ert-4.LSF-stdout>;
Tue May 14 13:24:47: Dispatched to <st-rsv17-17-09>, Effective RES_REQ <select[
                     (cs)&&(type == any )&&(mem>maxmem*1/12)] order[r15s:pg:bjo
                     bs] span[hosts=1] same[model] >;
Tue May 14 13:24:47: Starting (Pid 83393);
Tue May 14 13:24:47: Running with execution home </private/jonak>, Execution CW
                     D </private/jonak/ert/test-data/poly_example>, Execution P
                     id <83393>;
Tue May 14 13:24:48: Done successfully. The CPU time used is 0.3 seconds;
Tue May 14 13:24:49: Post job process done successfully;

Summary of time in seconds spent in various states by  Tue May 14 13:24:49
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  0        0        1        0        0        0        1", and error: "<empty>"
2024-05-14 13:25:01,700 - ert.scheduler.lsf_driver - LegacyEnsemble - INFO - BHIST SUMMARY FOR REALIZATION 4:
Job <185509>, Job Name <poly.ert-4>, User <jonak>, Project <default>, Command <
                     /private/jonak/ert/test-data/poly_example/poly_out/realiza
                     tion-4/iter-0/.lsf_submit_doabu68u.sh /private/jonak/ert/t
                     est-data/poly_example/poly_out/realization-4/iter-0>
Tue May 14 13:24:47: Submitted from host <st-00003>, to Queue <test_queue>, CWD <$
                     HOME/ert/test-data/poly_example>, Output File </private/jo
                     nak/ert/test-data/poly_example/poly_out/realization-4/iter
                     -0/poly.ert-4.LSF-stdout>;
Tue May 14 13:24:47: Dispatched to <st-rsv>, Effective RES_REQ <select[
                     (cs)&&(type == any )&&(mem>maxmem*1/12)] :span[hosts=1] same[model] >;
Tue May 14 13:24:47: Starting (Pid 811111);
Tue May 14 13:24:47: Running with execution home </private/jonak>, Execution CW
                     D </private/jonak/ert/test-data/poly_example>, Execution P
                     id <811111>;
Tue May 14 13:24:48: Done successfully. The CPU time used is 0.3 seconds;
Tue May 14 13:24:49: Post job process done successfully;

Summary of time in seconds spent in various states by  Tue May 14 13:24:49
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  0        0        1        0        0        0        1

Is this sufficient?

sondreso · 2024-05-14T11:52:47Z

I think debug log level means it will not be propagated to central logger (if that is installed) 🤔

jonathan-eq · 2024-05-15T08:05:40Z

I think debug log level means it will not be propagated to central logger (if that is installed) 🤔

I re-added the logger.info(...) line.
Should I prefix it with something to make it easier to find?

berland · 2024-05-16T10:31:59Z

Is this sufficient?

Is it effectively logged twice??

jonathan-eq · 2024-05-16T10:56:21Z

Is this sufficient?

Is it effectively logged twice??

Yes, to the two different loggers (lsf-driver and driver)

berland · 2024-05-16T12:00:43Z

Yes, to the two different loggers (lsf-driver and driver)

Why do we want that?

jonathan-eq · 2024-05-16T12:09:30Z

Yes, to the two different loggers (lsf-driver and driver)

Why do we want that?

The default logger in execute_with_retries logs to debug. Should I change this to info instead?

jonathan-eq · 2024-05-21T10:15:13Z

@berland I added log_to_debug=True as a parameter to execute_with_retry. I have manually tested that bhist -l output is only logged once, to logger.info.

berland

Nice!

berland · 2024-05-21T10:35:25Z

Remember to squash.

jonathan-eq · 2024-05-21T10:41:36Z

Should I add a "bhist -l summary: " prefix to the logged stdout? @berland
This is the current format:

2024-05-21 12:11:25,434 - ert.scheduler.lsf_driver - LegacyEnsemble - INFO - Job <111111>, Job Name <poly.ert-99>, User <jonak>, Project <default>, Command
                     </private/jonak/ert/test-data/poly_example/poly_out/realiz
                     ation-99/iter-0/.lsf_submit_11111.sh /private/jonak/ert
                     /test-data/poly_example/poly_out/realization-99/iter-0>
Tue May 21 12:07:44: Submitted from host <11111>, to Queue <normal>, CWD <$
                     HOME/ert/test-data/poly_example>, Output File </private/jo
                     nak/ert/test-data/poly_example/poly_out/realization-99/ite
                     r-0/poly.ert-99.LSF-stdout>;
Tue May 21 12:07:44: Dispatched to <111111>, Effective RES_REQ <select[
                     (cs)&&(type == any )&&(mem>maxmem*1/12)] order[r] span[hosts=1] same[model] >;
Tue May 21 12:07:44: Starting (Pid 111111);
Tue May 21 12:07:45: Running with execution home </private/jonak>, Execution CW
                     D </private/jonak/ert/test-data/poly_example>, Execution P
                     id <1111111>;
Tue May 21 12:07:45: Done successfully. The CPU time used is 0.3 seconds;
Tue May 21 12:07:45: Post job process done successfully;

Summary of time in seconds spent in various states by  Tue May 21 12:07:45
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  0        0        1        0        0        0        1

berland · 2024-05-21T11:33:33Z

Should I add a "bhist -l summary: " prefix to the logged stdout? @berland This is the current format:

Yes, I think that makes sense. Don't use the word "summary" since it is ad-hoc invented here, maybe just write "Output from bhist -l" or something?

jonathan-eq · 2024-05-21T12:56:20Z

Should I add a "bhist -l summary: " prefix to the logged stdout? @berland This is the current format:

Yes, I think that makes sense. Don't use the word "summary" since it is ad-hoc invented here, maybe just write "Output from bhist -l" or something?

I think this looks good:

2024-05-21 14:01:18,949 - ert.scheduler.lsf_driver - LegacyEnsemble - INFO - Output from bhist -l: Job <111111>, Job Name <poly.ert-5>, User <jonak>, Project <default>, Command <
                     /private/jonak/ert/test-data/poly_example/poly_out/realiza
                     tion-5/iter-0/.111111.sh /private/jonak/ert/t
                     est-data/poly_example/poly_out/realization-5/iter-0>
Tue May 21 14:01:09: Submitted from host <111111>, to Queue <normal>, CWD <$
                     HOME/ert/test-data/poly_example>, Output File </private/jo
                     nak/ert/test-data/poly_example/poly_out/realization-5/iter
                     -0/poly.ert-5.LSF-stdout>;
Tue May 21 14:01:10: Dispatched to <111111>, Effective RES_REQ <select[
                     (cs)&&(type == any )&&(mem>maxmem*1/12)]>;
Tue May 21 14:01:10: Starting (Pid 111111);
Tue May 21 14:01:10: Running with execution home </private/jonak>, Execution CW
                     D </private/jonak/ert/test-data/poly_example>, Execution P
                     id <111111>;
Tue May 21 14:01:14: Done successfully. The CPU time used is 0.4 seconds;
Tue May 21 14:01:14: Post job process done successfully;

MEMORY USAGE:
MAX MEM: 21 Mbytes;  AVG MEM: 21 Mbytes

Summary of time in seconds spent in various states by  Tue May 21 14:01:14
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  1        0        4        0        0        0        5

jonathan-eq force-pushed the improve-lsf-bhist branch 2 times, most recently from e72660d to a0bcecc Compare April 29, 2024 08:49

jonathan-eq added release-notes:skip If there should be no mention of this in release notes improvement Something nice to have, that will make life easier for developers or users or both. labels Apr 29, 2024

jonathan-eq self-assigned this Apr 29, 2024

jonathan-eq force-pushed the improve-lsf-bhist branch 4 times, most recently from 7a2b476 to 8e4c8ac Compare May 3, 2024 10:28

jonathan-eq marked this pull request as ready for review May 3, 2024 10:29

xjules reviewed May 3, 2024

View reviewed changes

src/ert/scheduler/lsf_driver.py Outdated Show resolved Hide resolved

jonathan-eq force-pushed the improve-lsf-bhist branch 3 times, most recently from 43c9273 to 5293136 Compare May 3, 2024 12:20

xjules reviewed May 3, 2024

View reviewed changes

src/ert/scheduler/lsf_driver.py Outdated Show resolved Hide resolved

jonathan-eq force-pushed the improve-lsf-bhist branch from 5293136 to bfaaf30 Compare May 6, 2024 09:18

jonathan-eq requested a review from xjules May 6, 2024 12:02

xjules reviewed May 6, 2024

View reviewed changes

src/ert/scheduler/lsf_driver.py Outdated Show resolved Hide resolved