Add driver related failure info when job fails #7836

xjules · 2024-05-05T21:22:45Z

Issue
Resolves #7759

Approach
This adds two information explicitly to the logs:

when driver.submit fails (ie. non-zero exit code) the error message will be kept and logged later on in handle_failure in job.py
when job_runner fails, the stdout and stderr files (if provided by the queue), will be read and logged on handle_failure in job.py.

(Screenshot of new behavior in GUI if applicable)

PR title captures the intent of the changes, and is fitting for release notes.
Added appropriate release note label
Commit history is consistent and clean, in line with the contribution guidelines.
Make sure tests pass locally (after every commit!)

When applicable

When there are user facing changes: Updated documentation
New behavior or changes to existing untested code: Ensured that unit tests are added (See Ground Rules).
Large PR: Prepare changes in small commits for more convenient review
Bug fix: Add regression test for the bug
Bug fix: Create Backport PR to latest release

src/ert/scheduler/driver.py

src/ert/scheduler/lsf_driver.py

codecov-commenter · 2024-05-14T13:45:19Z

Codecov Report

Attention: Patch coverage is 46.66667% with 16 lines in your changes are missing coverage. Please review.

Project coverage is 85.75%. Comparing base (54cc5d6) to head (93d1994).
Report is 1 commits behind head on main.

Files	Patch %	Lines
src/ert/scheduler/lsf_driver.py	20.00%	16 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7836      +/-   ##
==========================================
- Coverage   85.81%   85.75%   -0.06%     
==========================================
  Files         378      378              
  Lines       23075    23103      +28     
  Branches      636      631       -5     
==========================================
+ Hits        19801    19812      +11     
- Misses       3201     3213      +12     
- Partials       73       78       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/ert/scheduler/driver.py

jonathan-eq · 2024-05-16T06:04:13Z

src/ert/scheduler/job.py

+        if (
+            self.iens in self.driver._job_error_message
+            and self.driver._job_error_message[self.iens]
+        ):
+            error_msg += (
+                f"\n\tDriver reported: {self.driver._job_error_message[self.iens]}"
+            )


Is it not enough to only check if error message for specific iens exists?

Suggested change

if (

self.iens in self.driver._job_error_message

and self.driver._job_error_message[self.iens]

):

error_msg += (

f"\n\tDriver reported: {self.driver._job_error_message[self.iens]}"

)

if (iens_error_msg := self.driver._job_error_message.get(self.iens)):

error_msg += f"\n\tDriver reported: {iens_error_msg}"

Essentially those are the same. Yes, it should be enough.

src/ert/scheduler/lsf_driver.py

tests/unit_tests/scheduler/test_job.py

tests/unit_tests/scheduler/test_lsf_driver.py

src/ert/scheduler/driver.py

jonathan-eq · 2024-05-16T10:11:38Z

👍

berland · 2024-05-16T10:37:08Z

If this PR solves the referred issue, the issue should be rephrased and leftover things to do should be in a new issue?

src/ert/scheduler/job.py

berland · 2024-05-16T10:52:09Z

tests/integration_tests/scheduler/test_lsf_driver.py

+
+
+@pytest.mark.parametrize("tail_chars_to_read", [(5), (50), (500)])
+async def test_lsf_read_output_files(tmp_path, job_name, tail_chars_to_read):


test_lsf_can_retrieve_stdout_and_stderr, is that more precise?

src/ert/scheduler/lsf_driver.py

berland · 2024-05-16T10:57:03Z

src/ert/scheduler/driver.py

@@ -61,6 +63,12 @@ async def poll(self) -> None:
    async def finish(self) -> None:
        """make sure that all the jobs / realizations are complete."""

+    def read_output_files(


If this is to be used in "case of failure", it should be reflected in the function name.

The driver could have a specific API for stdout and stderr, perhaps a failure message code can be made driver independent based on stdout/stderr api?

xjules · 2024-05-16T11:15:16Z

If this PR solves the referred issue, the issue should be rephrased and leftover things to do should be in a new issue?

I added definition of done section to do issue.

berland · 2024-05-23T12:53:01Z

src/ert/scheduler/lsf_driver.py

+        return error_msg
+
+
+def get_info_from_text_outfile(file_path: Path, num_chars: int) -> str:


this is essentially the tail shell command. Maybe just call the function that, or tail_textfile

will update, although @jonathan-eq didn't like the tail in the function name. I'd prefer tail_textfile

berland · 2024-05-23T12:53:46Z

tests/integration_tests/scheduler/test_lsf_driver.py

+    return "".join(random.choice(letters) for i in range(size))
+
+
+@pytest.mark.parametrize("tail_chars_to_read", [(5), (50), (500)])


This should be tested for a number higher than 600 also.

berland · 2024-05-23T12:56:55Z

src/ert/scheduler/lsf_driver.py

+        if msg := get_info_from_text_outfile(
+            stderr_file, num_characters_to_read_from_end
+        ):
+            error_msg += f"\n\t LSF-err: {msg}"


tab characters...

These I kept on purpose but can remove them 👍

Use 4 or 8 explicit spaces if you want to do formatting. Will lineshift and indents always look good? It might make it more difficult to parse the logs, but it will look good to humans.

berland · 2024-05-23T12:58:01Z

src/ert/scheduler/lsf_driver.py

+
+def get_info_from_text_outfile(file_path: Path, num_chars: int) -> str:
+    if not file_path.exists():
+        return f"No output files for {file_path}"


This line probably has no test coverage

berland · 2024-05-24T10:20:40Z

src/ert/scheduler/driver.py

@@ -18,6 +18,8 @@ class Driver(ABC):

    def __init__(self, **kwargs: Dict[str, str]) -> None:
        self._event_queue: Optional[asyncio.Queue[Event]] = None
+        # we will keep the error messages coming from the driver


(should not need this comment)

berland

Maybe fine to merge and then iterate when we have done more testing of error scenarios. Good Job!

xjules · 2024-05-24T12:57:15Z

This is the current log output:

This includes a log from the actual submit command and the output of the job files; ie. append logging information from the LSF-out and LSF-err files into the logging.

xjules self-assigned this May 5, 2024

xjules added the scheduler label May 5, 2024

xjules changed the title ~~WIP: add driver status message to FinishedEvent~~ WIP: add driver status message when job fails May 5, 2024

xjules mentioned this pull request May 6, 2024

Have scheduler lsf driver dump bhist summary to runpath #7794

Merged

9 tasks

xjules marked this pull request as ready for review May 13, 2024 07:46

xjules changed the title ~~WIP: add driver status message when job fails~~ Add driver status message when job fails to submit May 13, 2024

xjules force-pushed the msg_finished branch 2 times, most recently from e9fa302 to 470a8a4 Compare May 13, 2024 07:50

jonathan-eq reviewed May 13, 2024

View reviewed changes

src/ert/scheduler/driver.py Outdated Show resolved Hide resolved

jonathan-eq reviewed May 13, 2024

View reviewed changes

src/ert/scheduler/lsf_driver.py Show resolved Hide resolved

jonathan-eq reviewed May 13, 2024

View reviewed changes

src/ert/scheduler/lsf_driver.py Outdated Show resolved Hide resolved

xjules force-pushed the msg_finished branch from 2826695 to e4937aa Compare May 14, 2024 13:14

xjules changed the title ~~Add driver status message when job fails to submit~~ Add driver related failure info when job fails May 15, 2024

xjules force-pushed the msg_finished branch 2 times, most recently from ad46cb7 to ea304f0 Compare May 15, 2024 11:02