Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have scheduler lsf driver dump bhist summary to runpath #7794

Merged
merged 1 commit into from
May 22, 2024

Conversation

jonathan-eq
Copy link
Contributor

@jonathan-eq jonathan-eq commented Apr 29, 2024

Issue
Resolves #7694

Approach
The commit in this PR has the lsf_driver dump the output of bhist -l <JOB_ID> to runpath.

(Screenshot of new behavior in GUI if applicable)

  • PR title captures the intent of the changes, and is fitting for release notes.
  • Added appropriate release note label
  • Commit history is consistent and clean, in line with the contribution guidelines.
  • Make sure tests pass locally (after every commit!)

When applicable

  • When there are user facing changes: Updated documentation
  • New behavior or changes to existing untested code: Ensured that unit tests are added (See Ground Rules).
  • Large PR: Prepare changes in small commits for more convenient review
  • Bug fix: Add regression test for the bug
  • Bug fix: Create Backport PR to latest release

@jonathan-eq jonathan-eq force-pushed the improve-lsf-bhist branch 2 times, most recently from e72660d to a0bcecc Compare April 29, 2024 08:49
@jonathan-eq jonathan-eq added release-notes:skip If there should be no mention of this in release notes improvement Something nice to have, that will make life easier for developers or users or both. labels Apr 29, 2024
@jonathan-eq jonathan-eq self-assigned this Apr 29, 2024
@jonathan-eq jonathan-eq force-pushed the improve-lsf-bhist branch 4 times, most recently from 7a2b476 to 8e4c8ac Compare May 3, 2024 10:28
@jonathan-eq jonathan-eq marked this pull request as ready for review May 3, 2024 10:29
@jonathan-eq
Copy link
Contributor Author

I have manually tested this, and it works :)

@codecov-commenter
Copy link

codecov-commenter commented May 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 85.39%. Comparing base (bd52a37) to head (89c5349).
Report is 10 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7794      +/-   ##
==========================================
- Coverage   85.41%   85.39%   -0.03%     
==========================================
  Files         381      381              
  Lines       23565    23598      +33     
  Branches      887      888       +1     
==========================================
+ Hits        20128    20151      +23     
- Misses       3330     3333       +3     
- Partials      107      114       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jonathan-eq jonathan-eq force-pushed the improve-lsf-bhist branch 3 times, most recently from 43c9273 to 5293136 Compare May 3, 2024 12:20
@berland
Copy link
Contributor

berland commented May 10, 2024

Since we are adding a file to the runpath that we can never later rename or remove, we should have an approval from @sondreso as well.

@jonathan-eq jonathan-eq force-pushed the improve-lsf-bhist branch 2 times, most recently from 739e1eb to 486674b Compare May 13, 2024 10:09
@sondreso
Copy link
Collaborator

sondreso commented May 14, 2024

I am slightly skeptical to adding this file to the runpath: Contrary to the content of LSF-stderr, the output from bhist -l will be available also after the execution of the job itself. This means that either we or the user can run the command at any time to get the output from bhist (provided that you know the jobid). What is the benefit of writing this file instead?

@berland
Copy link
Contributor

berland commented May 14, 2024

This is only for the convencience of the user. The inconvenience is to figure out which number to pick for -n when running this an arbitrary amount of time after finish.

The user is maybe not so interested in this output as Ert developers are, and runpath pollution is a real thing, so a suggested alternative is to log this is as a multiline string from the driver instead.

@sondreso
Copy link
Collaborator

suggested alternative is to log this is as a multiline string from the driver instead.

I think this is a better option, as we avoid making another file we have to keep in the runpath. Additionally we get logging to Azure for free with this option as well.

@jonathan-eq
Copy link
Contributor Author

jonathan-eq commented May 14, 2024

This is only for the convencience of the user. The inconvenience is to figure out which number to pick for -n when running this an arbitrary amount of time after finish.

The user is maybe not so interested in this output as Ert developers are, and runpath pollution is a real thing, so a suggested alternative is to log this is as a multiline string from the driver instead.

It will automatically be logged inside _execute_with_retry in this format:

2024-05-14 13:25:01,700 - ert.scheduler.driver - LegacyEnsemble - DEBUG - Command "/global/bin/bhist -l 185509" succeeded with exit code 0, output: "Job <185509>, Job Name <poly.ert-4>, User <jonak>, Project <default>, Command <
                     /private/jonak/ert/test-data/poly_example/poly_out/realiza
                     tion-4/iter-0/.lsf_submit_doabu68u.sh /private/jonak/ert/t
                     est-data/poly_example/poly_out/realization-4/iter-0>
Tue May 14 13:24:47: Submitted from host <st-grid03>, to Queue <normal>, CWD <$
                     HOME/ert/test-data/poly_example>, Output File </private/jo
                     nak/ert/test-data/poly_example/poly_out/realization-4/iter
                     -0/poly.ert-4.LSF-stdout>;
Tue May 14 13:24:47: Dispatched to <st-rsv17-17-09>, Effective RES_REQ <select[
                     (cs)&&(type == any )&&(mem>maxmem*1/12)] order[r15s:pg:bjo
                     bs] span[hosts=1] same[model] >;
Tue May 14 13:24:47: Starting (Pid 83393);
Tue May 14 13:24:47: Running with execution home </private/jonak>, Execution CW
                     D </private/jonak/ert/test-data/poly_example>, Execution P
                     id <83393>;
Tue May 14 13:24:48: Done successfully. The CPU time used is 0.3 seconds;
Tue May 14 13:24:49: Post job process done successfully;

Summary of time in seconds spent in various states by  Tue May 14 13:24:49
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  0        0        1        0        0        0        1", and error: "<empty>"
2024-05-14 13:25:01,700 - ert.scheduler.lsf_driver - LegacyEnsemble - INFO - BHIST SUMMARY FOR REALIZATION 4:
Job <185509>, Job Name <poly.ert-4>, User <jonak>, Project <default>, Command <
                     /private/jonak/ert/test-data/poly_example/poly_out/realiza
                     tion-4/iter-0/.lsf_submit_doabu68u.sh /private/jonak/ert/t
                     est-data/poly_example/poly_out/realization-4/iter-0>
Tue May 14 13:24:47: Submitted from host <st-00003>, to Queue <test_queue>, CWD <$
                     HOME/ert/test-data/poly_example>, Output File </private/jo
                     nak/ert/test-data/poly_example/poly_out/realization-4/iter
                     -0/poly.ert-4.LSF-stdout>;
Tue May 14 13:24:47: Dispatched to <st-rsv>, Effective RES_REQ <select[
                     (cs)&&(type == any )&&(mem>maxmem*1/12)] :span[hosts=1] same[model] >;
Tue May 14 13:24:47: Starting (Pid 811111);
Tue May 14 13:24:47: Running with execution home </private/jonak>, Execution CW
                     D </private/jonak/ert/test-data/poly_example>, Execution P
                     id <811111>;
Tue May 14 13:24:48: Done successfully. The CPU time used is 0.3 seconds;
Tue May 14 13:24:49: Post job process done successfully;

Summary of time in seconds spent in various states by  Tue May 14 13:24:49
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  0        0        1        0        0        0        1

Is this sufficient?

@sondreso
Copy link
Collaborator

I think debug log level means it will not be propagated to central logger (if that is installed) 🤔

@jonathan-eq jonathan-eq force-pushed the improve-lsf-bhist branch 2 times, most recently from e939904 to 430994d Compare May 14, 2024 13:50
@jonathan-eq
Copy link
Contributor Author

I think debug log level means it will not be propagated to central logger (if that is installed) 🤔

I re-added the logger.info(...) line.
Should I prefix it with something to make it easier to find?

@berland
Copy link
Contributor

berland commented May 16, 2024

Is this sufficient?

Is it effectively logged twice??

@jonathan-eq
Copy link
Contributor Author

Is this sufficient?

Is it effectively logged twice??

Yes, to the two different loggers (lsf-driver and driver)

@berland
Copy link
Contributor

berland commented May 16, 2024

Yes, to the two different loggers (lsf-driver and driver)

Why do we want that?

@jonathan-eq
Copy link
Contributor Author

Yes, to the two different loggers (lsf-driver and driver)

Why do we want that?

The default logger in execute_with_retries logs to debug. Should I change this to info instead?

@jonathan-eq
Copy link
Contributor Author

@berland I added log_to_debug=True as a parameter to execute_with_retry. I have manually tested that bhist -l output is only logged once, to logger.info.

Copy link
Contributor

@berland berland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@berland
Copy link
Contributor

berland commented May 21, 2024

Remember to squash.

@jonathan-eq jonathan-eq force-pushed the improve-lsf-bhist branch 2 times, most recently from 71f5050 to d007e20 Compare May 21, 2024 10:38
@jonathan-eq
Copy link
Contributor Author

Should I add a "bhist -l summary: " prefix to the logged stdout? @berland
This is the current format:

2024-05-21 12:11:25,434 - ert.scheduler.lsf_driver - LegacyEnsemble - INFO - Job <111111>, Job Name <poly.ert-99>, User <jonak>, Project <default>, Command
                     </private/jonak/ert/test-data/poly_example/poly_out/realiz
                     ation-99/iter-0/.lsf_submit_11111.sh /private/jonak/ert
                     /test-data/poly_example/poly_out/realization-99/iter-0>
Tue May 21 12:07:44: Submitted from host <11111>, to Queue <normal>, CWD <$
                     HOME/ert/test-data/poly_example>, Output File </private/jo
                     nak/ert/test-data/poly_example/poly_out/realization-99/ite
                     r-0/poly.ert-99.LSF-stdout>;
Tue May 21 12:07:44: Dispatched to <111111>, Effective RES_REQ <select[
                     (cs)&&(type == any )&&(mem>maxmem*1/12)] order[r] span[hosts=1] same[model] >;
Tue May 21 12:07:44: Starting (Pid 111111);
Tue May 21 12:07:45: Running with execution home </private/jonak>, Execution CW
                     D </private/jonak/ert/test-data/poly_example>, Execution P
                     id <1111111>;
Tue May 21 12:07:45: Done successfully. The CPU time used is 0.3 seconds;
Tue May 21 12:07:45: Post job process done successfully;

Summary of time in seconds spent in various states by  Tue May 21 12:07:45
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  0        0        1        0        0        0        1

@berland
Copy link
Contributor

berland commented May 21, 2024

Should I add a "bhist -l summary: " prefix to the logged stdout? @berland This is the current format:

Yes, I think that makes sense. Don't use the word "summary" since it is ad-hoc invented here, maybe just write "Output from bhist -l" or something?

@jonathan-eq
Copy link
Contributor Author

Should I add a "bhist -l summary: " prefix to the logged stdout? @berland This is the current format:

Yes, I think that makes sense. Don't use the word "summary" since it is ad-hoc invented here, maybe just write "Output from bhist -l" or something?

I think this looks good:

2024-05-21 14:01:18,949 - ert.scheduler.lsf_driver - LegacyEnsemble - INFO - Output from bhist -l: Job <111111>, Job Name <poly.ert-5>, User <jonak>, Project <default>, Command <
                     /private/jonak/ert/test-data/poly_example/poly_out/realiza
                     tion-5/iter-0/.111111.sh /private/jonak/ert/t
                     est-data/poly_example/poly_out/realization-5/iter-0>
Tue May 21 14:01:09: Submitted from host <111111>, to Queue <normal>, CWD <$
                     HOME/ert/test-data/poly_example>, Output File </private/jo
                     nak/ert/test-data/poly_example/poly_out/realization-5/iter
                     -0/poly.ert-5.LSF-stdout>;
Tue May 21 14:01:10: Dispatched to <111111>, Effective RES_REQ <select[
                     (cs)&&(type == any )&&(mem>maxmem*1/12)]>;
Tue May 21 14:01:10: Starting (Pid 111111);
Tue May 21 14:01:10: Running with execution home </private/jonak>, Execution CW
                     D </private/jonak/ert/test-data/poly_example>, Execution P
                     id <111111>;
Tue May 21 14:01:14: Done successfully. The CPU time used is 0.4 seconds;
Tue May 21 14:01:14: Post job process done successfully;

MEMORY USAGE:
MAX MEM: 21 Mbytes;  AVG MEM: 21 Mbytes

Summary of time in seconds spent in various states by  Tue May 21 14:01:14
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  1        0        4        0        0        0        5

@jonathan-eq jonathan-eq merged commit 1a8f31f into equinor:main May 22, 2024
38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Something nice to have, that will make life easier for developers or users or both. release-notes:skip If there should be no mention of this in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dump output from bhist -l <lsfjobid>` to runpath
5 participants