Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add driver related failure info when job fails #7836

Merged
merged 1 commit into from
May 24, 2024

Conversation

xjules
Copy link
Contributor

@xjules xjules commented May 5, 2024

Issue
Resolves #7759

Approach
This adds two information explicitly to the logs:

  1. when driver.submit fails (ie. non-zero exit code) the error message will be kept and logged later on in handle_failure in job.py
  2. when job_runner fails, the stdout and stderr files (if provided by the queue), will be read and logged on handle_failure in job.py.

(Screenshot of new behavior in GUI if applicable)

  • PR title captures the intent of the changes, and is fitting for release notes.
  • Added appropriate release note label
  • Commit history is consistent and clean, in line with the contribution guidelines.
  • Make sure tests pass locally (after every commit!)

When applicable

  • When there are user facing changes: Updated documentation
  • New behavior or changes to existing untested code: Ensured that unit tests are added (See Ground Rules).
  • Large PR: Prepare changes in small commits for more convenient review
  • Bug fix: Add regression test for the bug
  • Bug fix: Create Backport PR to latest release

@xjules xjules self-assigned this May 5, 2024
@xjules xjules changed the title WIP: add driver status message to FinishedEvent WIP: add driver status message when job fails May 5, 2024
@xjules xjules marked this pull request as ready for review May 13, 2024 07:46
@xjules xjules changed the title WIP: add driver status message when job fails Add driver status message when job fails to submit May 13, 2024
@xjules xjules force-pushed the msg_finished branch 2 times, most recently from e9fa302 to 470a8a4 Compare May 13, 2024 07:50
@codecov-commenter
Copy link

codecov-commenter commented May 14, 2024

Codecov Report

Attention: Patch coverage is 46.66667% with 16 lines in your changes are missing coverage. Please review.

Project coverage is 85.75%. Comparing base (54cc5d6) to head (93d1994).
Report is 1 commits behind head on main.

Files Patch % Lines
src/ert/scheduler/lsf_driver.py 20.00% 16 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7836      +/-   ##
==========================================
- Coverage   85.81%   85.75%   -0.06%     
==========================================
  Files         378      378              
  Lines       23075    23103      +28     
  Branches      636      631       -5     
==========================================
+ Hits        19801    19812      +11     
- Misses       3201     3213      +12     
- Partials       73       78       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@xjules xjules changed the title Add driver status message when job fails to submit Add driver related failure info when job fails May 15, 2024
@xjules xjules force-pushed the msg_finished branch 2 times, most recently from ad46cb7 to ea304f0 Compare May 15, 2024 11:02
Comment on lines 190 to 196
if (
self.iens in self.driver._job_error_message
and self.driver._job_error_message[self.iens]
):
error_msg += (
f"\n\tDriver reported: {self.driver._job_error_message[self.iens]}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not enough to only check if error message for specific iens exists?

Suggested change
if (
self.iens in self.driver._job_error_message
and self.driver._job_error_message[self.iens]
):
error_msg += (
f"\n\tDriver reported: {self.driver._job_error_message[self.iens]}"
)
if (iens_error_msg := self.driver._job_error_message.get(self.iens)):
error_msg += f"\n\tDriver reported: {iens_error_msg}"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially those are the same. Yes, it should be enough.

@jonathan-eq
Copy link
Contributor

👍

@berland
Copy link
Contributor

berland commented May 16, 2024

If this PR solves the referred issue, the issue should be rephrased and leftover things to do should be in a new issue?

src/ert/scheduler/job.py Outdated Show resolved Hide resolved


@pytest.mark.parametrize("tail_chars_to_read", [(5), (50), (500)])
async def test_lsf_read_output_files(tmp_path, job_name, tail_chars_to_read):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_lsf_can_retrieve_stdout_and_stderr, is that more precise?

@@ -61,6 +63,12 @@ async def poll(self) -> None:
async def finish(self) -> None:
"""make sure that all the jobs / realizations are complete."""

def read_output_files(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is to be used in "case of failure", it should be reflected in the function name.

The driver could have a specific API for stdout and stderr, perhaps a failure message code can be made driver independent based on stdout/stderr api?

@xjules
Copy link
Contributor Author

xjules commented May 16, 2024

If this PR solves the referred issue, the issue should be rephrased and leftover things to do should be in a new issue?

I added definition of done section to do issue.

@xjules xjules added the improvement Something nice to have, that will make life easier for developers or users or both. label May 16, 2024
return error_msg


def get_info_from_text_outfile(file_path: Path, num_chars: int) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is essentially the tail shell command. Maybe just call the function that, or tail_textfile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will update, although @jonathan-eq didn't like the tail in the function name. I'd prefer tail_textfile

return "".join(random.choice(letters) for i in range(size))


@pytest.mark.parametrize("tail_chars_to_read", [(5), (50), (500)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be tested for a number higher than 600 also.

if msg := get_info_from_text_outfile(
stderr_file, num_characters_to_read_from_end
):
error_msg += f"\n\t LSF-err: {msg}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tab characters...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These I kept on purpose but can remove them 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use 4 or 8 explicit spaces if you want to do formatting. Will lineshift and indents always look good? It might make it more difficult to parse the logs, but it will look good to humans.


def get_info_from_text_outfile(file_path: Path, num_chars: int) -> str:
if not file_path.exists():
return f"No output files for {file_path}"
Copy link
Contributor

@berland berland May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line probably has no test coverage

@@ -18,6 +18,8 @@ class Driver(ABC):

def __init__(self, **kwargs: Dict[str, str]) -> None:
self._event_queue: Optional[asyncio.Queue[Event]] = None
# we will keep the error messages coming from the driver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(should not need this comment)

Copy link
Contributor

@berland berland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe fine to merge and then iterate when we have done more testing of error scenarios. Good Job!

@xjules
Copy link
Contributor Author

xjules commented May 24, 2024

This is the current log output:
image

This includes a log from the actual submit command and the output of the
job files; ie. append logging information from the LSF-out and LSF-err
files into the logging.
@xjules xjules enabled auto-merge (rebase) May 24, 2024 13:46
@xjules xjules merged commit 1119617 into equinor:main May 24, 2024
38 checks passed
@xjules xjules deleted the msg_finished branch May 24, 2024 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Something nice to have, that will make life easier for developers or users or both. scheduler
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve ambiguous logging when max_submit is 1
4 participants