You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I have encountered issues post-processing large resstock runs (~1.5MM sims) on Eagle.
Initial error in postprocessing.out: slurmstepd: error: Detected 1 oom-kill event(s) in step 7788440.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
This error was circumvented by requesting bigmem nodes, however, I have timed out running --postprocessonly, requiring a large number of bigmem nodes which prevented it from being scheduled in a timely manner.
Run --postprocessonly
2.1 If requesting bigmem nodes, the job will take a very long time to schedule
2.2 if requesting standard nodes, the out-of-memory error will likely occur
Logs
Memory error: /lustre/eaglefs/projects/scout/flex_measures/flex_full/postprocessing_202110291115.out
Time out error: /lustre/eaglefs/projects/scout/flex_measures/flex_full/postprocessing_202111080407.out
Describe the bug
I have encountered issues post-processing large resstock runs (~1.5MM sims) on Eagle.
Initial error in
postprocessing.out
:slurmstepd: error: Detected 1 oom-kill event(s) in step 7788440.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
This error was circumvented by requesting bigmem nodes, however, I have timed out running
--postprocessonly
, requiring a large number of bigmem nodes which prevented it from being scheduled in a timely manner.To Reproduce
--postprocessonly
2.1 If requesting bigmem nodes, the job will take a very long time to schedule
2.2 if requesting standard nodes, the out-of-memory error will likely occur
Logs
Memory error:
/lustre/eaglefs/projects/scout/flex_measures/flex_full/postprocessing_202110291115.out
Time out error:
/lustre/eaglefs/projects/scout/flex_measures/flex_full/postprocessing_202111080407.out
Platform:
Eagle
restructure-v3-flex
(https://github.com/NREL/buildstockbatch/tree/restructure-v3-flex)restructure-v3-flex
(https://github.com/NREL/resstock/tree/restructure-v3-flex)The text was updated successfully, but these errors were encountered: