Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post-processing timeout/memory error for large runs #253

Closed
aspeake opened this issue Nov 12, 2021 · 1 comment · Fixed by #258
Closed

Post-processing timeout/memory error for large runs #253

aspeake opened this issue Nov 12, 2021 · 1 comment · Fixed by #258
Assignees
Labels
bug Something isn't working

Comments

@aspeake
Copy link
Contributor

aspeake commented Nov 12, 2021

Describe the bug
I have encountered issues post-processing large resstock runs (~1.5MM sims) on Eagle.

Initial error in postprocessing.out:
slurmstepd: error: Detected 1 oom-kill event(s) in step 7788440.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

This error was circumvented by requesting bigmem nodes, however, I have timed out running --postprocessonly, requiring a large number of bigmem nodes which prevented it from being scheduled in a timely manner.

To Reproduce

  1. Project yaml
  2. Run --postprocessonly
    2.1 If requesting bigmem nodes, the job will take a very long time to schedule
    2.2 if requesting standard nodes, the out-of-memory error will likely occur

Logs
Memory error: /lustre/eaglefs/projects/scout/flex_measures/flex_full/postprocessing_202110291115.out
Time out error: /lustre/eaglefs/projects/scout/flex_measures/flex_full/postprocessing_202111080407.out

Platform:

@aspeake aspeake added the bug Something isn't working label Nov 12, 2021
@rajeee
Copy link
Contributor

rajeee commented Nov 12, 2021

I think the OOM error is occuring because the postprocessing attempts to compose a single gigantic results_csv dataframe here:

results_df = pd.DataFrame(dpouts).rename(columns=to_camelcase)

If we can break these to process only one results_csv_jobx at a time, it should not result in OOM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants