New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug fix and enhancement for error catching #181
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is close, but the multi write thing should be fixed first.
with open(traceback_file_path, 'a') as f: | ||
txt = get_error_details() | ||
txt = "\n" + "#" * 20 + "\n" + f"Traceback for building{i}\n" + txt | ||
f.write(txt) | ||
del txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a problem. This code is run in parallel across several processes. More than one of those processes could be trying to write to this file at the same time. What about writing to different files and then concatenating them when moving to lustre?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can simultaneously open the same file for append form two different process in python. And it looks like small appends to the files are atomic in linux system: https://stackoverflow.com/questions/1154446/is-file-append-atomic-in-unix, so this should work fine. If not, the worst that will happen is the writes from two different process will be interleaved (if they happen to append at exactly the same time and the OS does't do the append atomically). This seems pretty low risk.
@@ -231,3 +232,41 @@ def test_qos_high_job_submit(mock_subprocess, basic_residential_project_file, mo | |||
batch.queue_post_processing() | |||
mock_subprocess.run.assert_called_once() | |||
assert '--qos=high' in mock_subprocess.run.call_args[0][0] | |||
|
|||
|
|||
def test_run_building_error_caught(mocker, basic_residential_project_file): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test. Feel free to expand as you see fit.
@nmerket I did test on a small run in Eagle for both cases where BSB crashes or doesn't crash, and it works fine. It looks fine to merge from my side now. |
Fixes #158, part 2.
Pull Request Description
The buildstockbatch crashes that happened in the parallel execution of run_building were being silently discarded.
(because the function was returning from 'finally' clause. See: https://www.python.org/dev/peps/pep-0601/)
This would result in a completely empty simulation_output directory and no trace of what went wrong.
This fix will create traceback{job_id).out files in the simulation_outout directory that has detailed error logging for what went wrong for each attempted simulation.
To clarify, this was an issue when the simulation failed because of error in buildstockbatch; when the simulation fails due to error in OS, there will be singularity_output.log and other files to help debug.
Checklist
Not all may apply