Improve ambiguous logging when max_submit is 1 #7759

xjules · 2024-04-24T11:13:17Z

Currently even-though MAX_SUBMIT is set 1 we log failure with failed after reaching max submit. We should provide more detailed explanation in job.handle_failure.
Also job._callback_status_msg might be empty, which produces empty output:

Realization: 29 failed after reaching max submit (1):
	
Realization: 44 failed after reaching max submit (1):
	
Realization: 30 failed after reaching max submit (1):

Definition of done:
In case that job fails, we should provide all the relevant information coming from the driver and the queue stdout and sterr files.

The text was updated successfully, but these errors were encountered:

xjules · 2024-04-25T11:19:20Z

Suggestion what to log:

exit code of the last successful / failed job
the last know state

xjules · 2024-04-29T12:08:22Z

Apparently, when mimicking NFS syncing issues, we will not get any logs too: Copy from @eivindjahren message:

If _ert_forward_model_runner crashes (for instance due to missing jobs.json because of NFS sync issues) then you get no indication of what happened. Just the empty failure message:

Realization: 44 failed after reaching max submit (1):

You can reproduce it with fault injecting not writing the jobs.json file:

--- a/src/ert/enkf_main.py
+++ b/src/ert/enkf_main.py
@@ -231,7 +231,7 @@ def create_run_path(
                     run_context.iteration,
                 )
 
-                json.dump(forward_model_output, fptr)
+                # json.dump(forward_model_output, fptr)
 
     run_context.runpaths.write_runpath_list(
         [run_context.iteration], run_context.active_realizations
 class LegacyEnsemble(Ensemble):
@@ -226,7 +227,7 @@ async def _evaluate_inner(  # pylint: disable=too-many-branches
                 self.min_required_realizations if self.stop_long_running else 0
             )
 
-            queue.add_dispatch_information_to_jobs_file()
+            # queue.add_dispatch_information_to_jobs_file()
             result = await queue.execute(min_required_realizations)
 
         except Exception:

xjules · 2024-05-02T13:52:36Z

The logging might be already fixed by 50a4421. Need to just test.

jonathan-eq · 2024-05-03T08:28:44Z

The logging might be already fixed by 50a4421. Need to just test.

It did not fix it

xjules · 2024-05-03T11:50:43Z

What we should do is to "find out" that the job does not run and get the lsf stdout into the logs.

xjules added release-notes:skip If there should be no mention of this in release notes scheduler labels Apr 24, 2024

xjules self-assigned this May 2, 2024

xjules mentioned this issue May 5, 2024

Add driver related failure info when job fails #7836

Merged

9 tasks

xjules closed this as completed in #7836 May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve ambiguous logging when max_submit is 1 #7759

Improve ambiguous logging when max_submit is 1 #7759

xjules commented Apr 24, 2024 •

edited

xjules commented Apr 25, 2024

xjules commented Apr 29, 2024

xjules commented May 2, 2024

jonathan-eq commented May 3, 2024

xjules commented May 3, 2024

Improve ambiguous logging when max_submit is 1 #7759

Improve ambiguous logging when max_submit is 1 #7759

Comments

xjules commented Apr 24, 2024 • edited

xjules commented Apr 25, 2024

xjules commented Apr 29, 2024

xjules commented May 2, 2024

jonathan-eq commented May 3, 2024

xjules commented May 3, 2024

xjules commented Apr 24, 2024 •

edited