-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add driver related failure info when job fails #7836
Conversation
e9fa302
to
470a8a4
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7836 +/- ##
==========================================
- Coverage 85.81% 85.75% -0.06%
==========================================
Files 378 378
Lines 23075 23103 +28
Branches 636 631 -5
==========================================
+ Hits 19801 19812 +11
- Misses 3201 3213 +12
- Partials 73 78 +5 ☔ View full report in Codecov by Sentry. |
ad46cb7
to
ea304f0
Compare
src/ert/scheduler/job.py
Outdated
if ( | ||
self.iens in self.driver._job_error_message | ||
and self.driver._job_error_message[self.iens] | ||
): | ||
error_msg += ( | ||
f"\n\tDriver reported: {self.driver._job_error_message[self.iens]}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not enough to only check if error message for specific iens exists?
if ( | |
self.iens in self.driver._job_error_message | |
and self.driver._job_error_message[self.iens] | |
): | |
error_msg += ( | |
f"\n\tDriver reported: {self.driver._job_error_message[self.iens]}" | |
) | |
if (iens_error_msg := self.driver._job_error_message.get(self.iens)): | |
error_msg += f"\n\tDriver reported: {iens_error_msg}" | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially those are the same. Yes, it should be enough.
36d7b04
to
603a6e8
Compare
👍 |
If this PR solves the referred issue, the issue should be rephrased and leftover things to do should be in a new issue? |
|
||
|
||
@pytest.mark.parametrize("tail_chars_to_read", [(5), (50), (500)]) | ||
async def test_lsf_read_output_files(tmp_path, job_name, tail_chars_to_read): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_lsf_can_retrieve_stdout_and_stderr
, is that more precise?
src/ert/scheduler/driver.py
Outdated
@@ -61,6 +63,12 @@ async def poll(self) -> None: | |||
async def finish(self) -> None: | |||
"""make sure that all the jobs / realizations are complete.""" | |||
|
|||
def read_output_files( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is to be used in "case of failure", it should be reflected in the function name.
The driver could have a specific API for stdout and stderr, perhaps a failure message code can be made driver independent based on stdout/stderr api?
I added |
src/ert/scheduler/lsf_driver.py
Outdated
return error_msg | ||
|
||
|
||
def get_info_from_text_outfile(file_path: Path, num_chars: int) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is essentially the tail
shell command. Maybe just call the function that, or tail_textfile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will update, although @jonathan-eq didn't like the tail in the function name. I'd prefer tail_textfile
return "".join(random.choice(letters) for i in range(size)) | ||
|
||
|
||
@pytest.mark.parametrize("tail_chars_to_read", [(5), (50), (500)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be tested for a number higher than 600 also.
src/ert/scheduler/lsf_driver.py
Outdated
if msg := get_info_from_text_outfile( | ||
stderr_file, num_characters_to_read_from_end | ||
): | ||
error_msg += f"\n\t LSF-err: {msg}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tab characters...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These I kept on purpose but can remove them 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use 4 or 8 explicit spaces if you want to do formatting. Will lineshift and indents always look good? It might make it more difficult to parse the logs, but it will look good to humans.
src/ert/scheduler/lsf_driver.py
Outdated
|
||
def get_info_from_text_outfile(file_path: Path, num_chars: int) -> str: | ||
if not file_path.exists(): | ||
return f"No output files for {file_path}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line probably has no test coverage
src/ert/scheduler/driver.py
Outdated
@@ -18,6 +18,8 @@ class Driver(ABC): | |||
|
|||
def __init__(self, **kwargs: Dict[str, str]) -> None: | |||
self._event_queue: Optional[asyncio.Queue[Event]] = None | |||
# we will keep the error messages coming from the driver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(should not need this comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe fine to merge and then iterate when we have done more testing of error scenarios. Good Job!
This includes a log from the actual submit command and the output of the job files; ie. append logging information from the LSF-out and LSF-err files into the logging.
Issue
Resolves #7759
Approach
This adds two information explicitly to the logs:
(Screenshot of new behavior in GUI if applicable)
When applicable