Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ambiguous logging when max_submit is 1 #7759

Closed
xjules opened this issue Apr 24, 2024 · 5 comments · Fixed by #7836
Closed

Improve ambiguous logging when max_submit is 1 #7759

xjules opened this issue Apr 24, 2024 · 5 comments · Fixed by #7836
Assignees
Labels
release-notes:skip If there should be no mention of this in release notes scheduler

Comments

@xjules
Copy link
Contributor

xjules commented Apr 24, 2024

Currently even-though MAX_SUBMIT is set 1 we log failure with failed after reaching max submit. We should provide more detailed explanation in job.handle_failure.
Also job._callback_status_msg might be empty, which produces empty output:

Realization: 29 failed after reaching max submit (1):
	
Realization: 44 failed after reaching max submit (1):
	
Realization: 30 failed after reaching max submit (1):

Definition of done:
In case that job fails, we should provide all the relevant information coming from the driver and the queue stdout and sterr files.

@xjules xjules added release-notes:skip If there should be no mention of this in release notes scheduler labels Apr 24, 2024
@xjules
Copy link
Contributor Author

xjules commented Apr 25, 2024

Suggestion what to log:

  • exit code of the last successful / failed job
  • the last know state

@xjules
Copy link
Contributor Author

xjules commented Apr 29, 2024

Apparently, when mimicking NFS syncing issues, we will not get any logs too: Copy from @eivindjahren message:


If _ert_forward_model_runner crashes (for instance due to missing jobs.json because of NFS sync issues) then you get no indication of what happened. Just the empty failure message:

Realization: 44 failed after reaching max submit (1):

You can reproduce it with fault injecting not writing the jobs.json file:

--- a/src/ert/enkf_main.py
+++ b/src/ert/enkf_main.py
@@ -231,7 +231,7 @@ def create_run_path(
                     run_context.iteration,
                 )
 
-                json.dump(forward_model_output, fptr)
+                # json.dump(forward_model_output, fptr)
 
     run_context.runpaths.write_runpath_list(
         [run_context.iteration], run_context.active_realizations
 class LegacyEnsemble(Ensemble):
@@ -226,7 +227,7 @@ async def _evaluate_inner(  # pylint: disable=too-many-branches
                 self.min_required_realizations if self.stop_long_running else 0
             )
 
-            queue.add_dispatch_information_to_jobs_file()
+            # queue.add_dispatch_information_to_jobs_file()
             result = await queue.execute(min_required_realizations)
 
         except Exception:

@xjules xjules self-assigned this May 2, 2024
@xjules
Copy link
Contributor Author

xjules commented May 2, 2024

The logging might be already fixed by 50a4421. Need to just test.

@jonathan-eq
Copy link
Contributor

The logging might be already fixed by 50a4421. Need to just test.

It did not fix it

@xjules
Copy link
Contributor Author

xjules commented May 3, 2024

What we should do is to "find out" that the job does not run and get the lsf stdout into the logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes:skip If there should be no mention of this in release notes scheduler
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants