Bug fix and enhancement for error catching #181

rajeee · 2020-08-28T02:36:58Z

Fixes #158, part 2.

Pull Request Description

The buildstockbatch crashes that happened in the parallel execution of run_building were being silently discarded.
(because the function was returning from 'finally' clause. See: https://www.python.org/dev/peps/pep-0601/)
This would result in a completely empty simulation_output directory and no trace of what went wrong.

This fix will create traceback{job_id).out files in the simulation_outout directory that has detailed error logging for what went wrong for each attempted simulation.

To clarify, this was an issue when the simulation failed because of error in buildstockbatch; when the simulation fails due to error in OS, there will be singularity_output.log and other files to help debug.

Checklist

Not all may apply

Code changes (must work)
Tests exercising your feature/bug fix (check coverage report on CircleCI build -> Artifacts)
All other unit tests passing
Update validation for project config yaml file changes
Update existing documentation
Run a small batch run to make sure it all works (local is fine, unless an Eagle specific feature)
Add to the changelog_dev.rst file and propose migration text in the pull request

nmerket

This is close, but the multi write thing should be fixed first.

nmerket · 2020-08-28T21:35:12Z

buildstockbatch/eagle.py

+                with open(traceback_file_path, 'a') as f:
+                    txt = get_error_details()
+                    txt = "\n" + "#" * 20 + "\n" + f"Traceback for building{i}\n" + txt
+                    f.write(txt)
+                    del txt


This could be a problem. This code is run in parallel across several processes. More than one of those processes could be trying to write to this file at the same time. What about writing to different files and then concatenating them when moving to lustre?

You can simultaneously open the same file for append form two different process in python. And it looks like small appends to the files are atomic in linux system: https://stackoverflow.com/questions/1154446/is-file-append-atomic-in-unix, so this should work fine. If not, the worst that will happen is the writes from two different process will be interleaved (if they happen to append at exactly the same time and the OS does't do the append atomically). This seems pretty low risk.

nmerket · 2020-08-28T21:52:39Z

buildstockbatch/test/test_eagle.py

@@ -231,3 +232,41 @@ def test_qos_high_job_submit(mock_subprocess, basic_residential_project_file, mo
        batch.queue_post_processing()
        mock_subprocess.run.assert_called_once()
        assert '--qos=high' in mock_subprocess.run.call_args[0][0]
+
+
+def test_run_building_error_caught(mocker,  basic_residential_project_file):


I added a test. Feel free to expand as you see fit.

rajeee · 2020-09-01T19:59:09Z

@nmerket I did test on a small run in Eagle for both cases where BSB crashes or doesn't crash, and it works fine. It looks fine to merge from my side now.
I will update the https://github.com/NREL/buildstockbatch/wiki/Simulation-Troubleshooting after it's merged.

rajeee added 2 commits August 27, 2020 20:28

Bug fix and enhancement for error catching

dce52e0

style fix

5994d9e

rajeee requested a review from nmerket August 28, 2020 05:58

adding a test for this error catching

962fc76

nmerket requested changes Aug 28, 2020

View reviewed changes

nmerket approved these changes Sep 17, 2020

View reviewed changes

nmerket merged commit f34d634 into develop Sep 17, 2020

nmerket deleted the error_catching branch September 17, 2020 23:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug fix and enhancement for error catching #181

Bug fix and enhancement for error catching #181

rajeee commented Aug 28, 2020 •

edited

nmerket left a comment

nmerket Aug 28, 2020

rajeee Aug 28, 2020

nmerket Aug 28, 2020

rajeee commented Sep 1, 2020 •

edited

Bug fix and enhancement for error catching #181

Bug fix and enhancement for error catching #181

Conversation

rajeee commented Aug 28, 2020 • edited

Pull Request Description

Checklist

nmerket left a comment

Choose a reason for hiding this comment

nmerket Aug 28, 2020

Choose a reason for hiding this comment

rajeee Aug 28, 2020

Choose a reason for hiding this comment

nmerket Aug 28, 2020

Choose a reason for hiding this comment

rajeee commented Sep 1, 2020 • edited

rajeee commented Aug 28, 2020 •

edited

rajeee commented Sep 1, 2020 •

edited