Segmentation Fault With Large Runs of gen_lib.py #111

CayenneMatt · 2024-04-22T13:52:03Z

Running gen_lib.py with large values of NSAMPS (or NREALS) on a cluster leads to segmentation fault. The definition of large here will vary depending on the cluster, but for me it maxes out at just under NSAMPS = 2500 (with NREALS=100).

The error message returned by the Great Lakes cluster reads: mpiexec noticed that process rank 26 with PID 0 on node gl3160 exited on signal 11 (Segmentation fault).

lzkelley · 2024-04-22T16:59:02Z

Hmmm. I've been able to run up to ~20k samples with up to ~1k realizations, but the code has changed since then. Still, I'm guessing it might not be something intrinsic about the large numbers. I would guess that there could be a memory error, or something like that, but generally the number of samples shouldn't increase the amount of memory being used.

If you check the outputs and logs from individual processors, do you see any other error messages, or hints at where the segmentation fault is coming from?

CayenneMatt · 2024-04-24T17:42:39Z

This is the only line indicating an error that I have been able to find. Here is the longer version of this error message (top part truncated because it's long):

...
  0%|          | 32/20000 [1:02:37<665:48:27, 120.04s/it]
  0%|          | 34/20000 [1:02:38<395:26:30, 71.30s/it]
  0%|          | 33/20000 [1:02:45<477:51:39, 86.16s/it]
  0%|          | 39/20000 [1:02:45<503:08:22, 90.74s/it]
  0%|          | 30/20000 [1:02:47<829:10:27, 149.48s/it]
  0%|          | 31/20000 [1:02:49<708:32:10, 127.73s/it]
  0%|          | 27/20000 [1:02:46<817:06:57, 147.28s/it]--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 65 with PID 1162569 on node gl3357 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

The other outputs/logs are just paused after the most recent simulation and the output/sims files look as expected, except there aren't as many as there should be. For completeness, here are the last few lines of the output file of the most recent run to return this type of error:

06:46:43 DEBUG : Saving 13481 to file | args.gwb_flag=True args.ss_flag=True args.params_flag=True [gen_lib.py:run_sam_at_pspace_num]
06:46:43 DEBUG : data has keys: ['fobs_cents', 'fobs_edges', 'sspar', 'bgpar', 'hc_ss', 'hc_bg', 'gwb'] [gen_lib.py:run_sam_at_pspace_num]
06:46:43 INFO : Saved to /home/cayenne/output/Evolution_up_to_3classic-phenom-uniform_new_n20000_r100_f20/sims/sam-lib__p013481.npz, size 533.3 KiB after 0:02:39.140451 [gen_lib.py:run_sam_at_pspace_num]
06:46:43 INFO : comm.rank=0 par_num=4560 [gen_lib.py:main]
06:46:43 INFO : gsmf_phi0_log10=-2.6189e+00, gsmf_mchar0_log10=1.2246e+01, mmb_mamp_log10=8.8216e+00, mmb_scatter_dex=6.9975e-01, hard_time=8.4084e+00, hard_gamma_inner=-1.1059e+00 [gen_lib.py:main]
06:46:43 INFO : pnum=4560 :: sim_fname=PosixPath('/home/cayenne/output/Evolution_up_to_3classic-phenom-uniform_new_n20000_r100_f20/sims/sam-lib__p004560.npz') beginning at 2024-04-24 06:46:43.126533 [gen_lib.py:run_sam_at_pspace_num]
06:46:43 DEBUG : Selecting `sam` and `hard` instances [gen_lib.py:run_sam_at_pspace_num]
06:46:43 DEBUG : params 4560 :: {'gsmf_phi0_log10': -2.6188518474115825, 'gsmf_mchar0_log10': 12.246288505178201, 'mmb_mamp_log10': 8.821624510622835, 'mmb_scatter_dex': 0.6997485137450986, 'hard_time': 8.40837032060143, 'hard_gamma_inner': -1.1058762639444715} [libraries.py:model_for_sample_number]
06:46:44 INFO : Scatter added after 42.528816 sec [sam.py:static_binary_density]
06:46:45 INFO : 	dens aft: (5.65e-29, 3.47e-13, 1.61e-08, 5.47e-04, 1.59e-02, 6.07e-02, 2.38e-01) [sam.py:static_binary_density]
06:46:45 INFO : 	mass: 2.40e-01 ==> 2.52e-01 || change = 5.1893e-02 [sam.py:static_binary_density]
06:46:45 INFO : zeroing out 5.60e+05/7.44e+05 = 7.53e-01 bins stalled from GMT [sam.py:static_binary_density]

I have run some tests to try and narrow down exactly where this issue might be coming from, but broadly there is no reliable pattern that I can find yet. The only thing that is consistent is that models will always fail if I am using 4 node, but I have had models with 2 nodes fail in this way as well. Recently, I had one 2 node run a complete all 20,000 samples (and it seems like a second might make it). The runs that fail (regardless of how many nodes / how much memory I request) always run for 40-50 minutes wall time. The one consistency I am seeing is that they either go for ~45 minutes and fail or they complete, never in between.

I'm still trying to dig into this and I will update if I have a breakthrough.

lzkelley · 2024-04-24T18:13:37Z

Thanks for the additional info. These types of issues are always a huge pain to debug! It still sounds like it could conceivably be a memory issue of some sort. In my experience, memory errors with python can be sporadic/chaotic, possibly because of the complex garbage collection system it uses. If you're already using the full memory on each node, you could try using full nodes but not all of the processors (so that the memory per processor is higher).

You could also try adding in additional memory information logging. There's a function holodeck.librarian.libraries.log_mem_usage that does this.

And, just to check, you know about the --resume argument, when generating libraries, right? So that you don't have to completely restart jobs that fail.

CayenneMatt · 2024-04-24T19:00:00Z

I typically request memory per cpu and have seen no difference between, e.g., 3G vs 5G/cpu; I will look into the memory logging to get more information. It does seem to be somewhat random, but given the all or nothing pattern it might be something that happens only in the early stages...

Thanks for checking, I am able to resume jobs and pick up where things left off, but they all still run into the segmentation fault error after the same amount of time. I resumed one run a few times and got up to about ~7,000 samples, but I would have to resume ~10 times at that rate to get to the full 20,000 samples 🤷

lzkelley · 2024-04-24T19:17:04Z

I don't have much useful to suggest, sorry! But a few grasping-at-straws thoughts:

Is it possible there are particular parameters that are causing the problem? The parameter-sets are ordered (and correspond to the output file numbers that are eventually produced), but are randomized when they're given to processors (both which number goes to which processor, and what the ordering of numbers is). So it's possible (maybe) that there is one (or a few) "bad" parameter sets, and when a processor hits one of those, the job dies. If that's the case, then resuming should actually fail faster and faster as more simulations succeed. This could also be consistent with it tending to require larger number of samples, such that some sort of weird corner of parameter space happens to be explored.
You could see if you still get errors when speeding up the simulations, particularly by decreasing the number of realizations and the sam grid size ('shape'). If you're able to reproduce the error with small nreals and small shape, it could help debugging. For example, you might be able to even run a full library on a single core (or small number of cores) which would make it easier to find a particular, bad parameter set.

CayenneMatt added the bug Something isn't working label Apr 22, 2024

CayenneMatt self-assigned this Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation Fault With Large Runs of gen_lib.py #111

Segmentation Fault With Large Runs of gen_lib.py #111

CayenneMatt commented Apr 22, 2024

lzkelley commented Apr 22, 2024

CayenneMatt commented Apr 24, 2024

lzkelley commented Apr 24, 2024

CayenneMatt commented Apr 24, 2024

lzkelley commented Apr 24, 2024

Segmentation Fault With Large Runs of gen_lib.py #111

Segmentation Fault With Large Runs of gen_lib.py #111

Comments

CayenneMatt commented Apr 22, 2024

lzkelley commented Apr 22, 2024

CayenneMatt commented Apr 24, 2024

lzkelley commented Apr 24, 2024

CayenneMatt commented Apr 24, 2024

lzkelley commented Apr 24, 2024