Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix and enhancement for error catching #181

Merged
merged 3 commits into from Sep 17, 2020
Merged

Bug fix and enhancement for error catching #181

merged 3 commits into from Sep 17, 2020

Conversation

rajeee
Copy link
Contributor

@rajeee rajeee commented Aug 28, 2020

Fixes #158, part 2.

Pull Request Description

The buildstockbatch crashes that happened in the parallel execution of run_building were being silently discarded.
(because the function was returning from 'finally' clause. See: https://www.python.org/dev/peps/pep-0601/)
This would result in a completely empty simulation_output directory and no trace of what went wrong.

This fix will create traceback{job_id).out files in the simulation_outout directory that has detailed error logging for what went wrong for each attempted simulation.

To clarify, this was an issue when the simulation failed because of error in buildstockbatch; when the simulation fails due to error in OS, there will be singularity_output.log and other files to help debug.

Checklist

Not all may apply

  • Code changes (must work)
  • Tests exercising your feature/bug fix (check coverage report on CircleCI build -> Artifacts)
  • All other unit tests passing
  • Update validation for project config yaml file changes
  • Update existing documentation
  • Run a small batch run to make sure it all works (local is fine, unless an Eagle specific feature)
  • Add to the changelog_dev.rst file and propose migration text in the pull request

@rajeee rajeee requested a review from nmerket August 28, 2020 05:58
Copy link
Member

@nmerket nmerket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is close, but the multi write thing should be fixed first.

Comment on lines +301 to +305
with open(traceback_file_path, 'a') as f:
txt = get_error_details()
txt = "\n" + "#" * 20 + "\n" + f"Traceback for building{i}\n" + txt
f.write(txt)
del txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a problem. This code is run in parallel across several processes. More than one of those processes could be trying to write to this file at the same time. What about writing to different files and then concatenating them when moving to lustre?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simultaneously open the same file for append form two different process in python. And it looks like small appends to the files are atomic in linux system: https://stackoverflow.com/questions/1154446/is-file-append-atomic-in-unix, so this should work fine. If not, the worst that will happen is the writes from two different process will be interleaved (if they happen to append at exactly the same time and the OS does't do the append atomically). This seems pretty low risk.

@@ -231,3 +232,41 @@ def test_qos_high_job_submit(mock_subprocess, basic_residential_project_file, mo
batch.queue_post_processing()
mock_subprocess.run.assert_called_once()
assert '--qos=high' in mock_subprocess.run.call_args[0][0]


def test_run_building_error_caught(mocker, basic_residential_project_file):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test. Feel free to expand as you see fit.

@rajeee
Copy link
Contributor Author

rajeee commented Sep 1, 2020

@nmerket I did test on a small run in Eagle for both cases where BSB crashes or doesn't crash, and it works fine. It looks fine to merge from my side now.
I will update the https://github.com/NREL/buildstockbatch/wiki/Simulation-Troubleshooting after it's merged.

@nmerket nmerket merged commit f34d634 into develop Sep 17, 2020
@nmerket nmerket deleted the error_catching branch September 17, 2020 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Empty simulation output for failed runs in v0.18
2 participants