Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation Fault With Large Runs of gen_lib.py #111

Open
CayenneMatt opened this issue Apr 22, 2024 · 5 comments
Open

Segmentation Fault With Large Runs of gen_lib.py #111

CayenneMatt opened this issue Apr 22, 2024 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@CayenneMatt
Copy link
Collaborator

Running gen_lib.py with large values of NSAMPS (or NREALS) on a cluster leads to segmentation fault. The definition of large here will vary depending on the cluster, but for me it maxes out at just under NSAMPS = 2500 (with NREALS=100).

The error message returned by the Great Lakes cluster reads: mpiexec noticed that process rank 26 with PID 0 on node gl3160 exited on signal 11 (Segmentation fault).

@CayenneMatt CayenneMatt added the bug Something isn't working label Apr 22, 2024
@CayenneMatt CayenneMatt self-assigned this Apr 22, 2024
@lzkelley
Copy link
Member

Hmmm. I've been able to run up to ~20k samples with up to ~1k realizations, but the code has changed since then. Still, I'm guessing it might not be something intrinsic about the large numbers. I would guess that there could be a memory error, or something like that, but generally the number of samples shouldn't increase the amount of memory being used.

If you check the outputs and logs from individual processors, do you see any other error messages, or hints at where the segmentation fault is coming from?

@CayenneMatt
Copy link
Collaborator Author

This is the only line indicating an error that I have been able to find. Here is the longer version of this error message (top part truncated because it's long):

...
  0%|          | 32/20000 [1:02:37<665:48:27, 120.04s/it]
  0%|          | 34/20000 [1:02:38<395:26:30, 71.30s/it]
  0%|          | 33/20000 [1:02:45<477:51:39, 86.16s/it]
  0%|          | 39/20000 [1:02:45<503:08:22, 90.74s/it]
  0%|          | 30/20000 [1:02:47<829:10:27, 149.48s/it]
  0%|          | 31/20000 [1:02:49<708:32:10, 127.73s/it]
  0%|          | 27/20000 [1:02:46<817:06:57, 147.28s/it]--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 65 with PID 1162569 on node gl3357 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

The other outputs/logs are just paused after the most recent simulation and the output/sims files look as expected, except there aren't as many as there should be. For completeness, here are the last few lines of the output file of the most recent run to return this type of error:

06:46:43 DEBUG : Saving 13481 to file | args.gwb_flag=True args.ss_flag=True args.params_flag=True [gen_lib.py:run_sam_at_pspace_num]
06:46:43 DEBUG : data has keys: ['fobs_cents', 'fobs_edges', 'sspar', 'bgpar', 'hc_ss', 'hc_bg', 'gwb'] [gen_lib.py:run_sam_at_pspace_num]
06:46:43 INFO : Saved to /home/cayenne/output/Evolution_up_to_3classic-phenom-uniform_new_n20000_r100_f20/sims/sam-lib__p013481.npz, size 533.3 KiB after 0:02:39.140451 [gen_lib.py:run_sam_at_pspace_num]
06:46:43 INFO : comm.rank=0 par_num=4560 [gen_lib.py:main]
06:46:43 INFO : gsmf_phi0_log10=-2.6189e+00, gsmf_mchar0_log10=1.2246e+01, mmb_mamp_log10=8.8216e+00, mmb_scatter_dex=6.9975e-01, hard_time=8.4084e+00, hard_gamma_inner=-1.1059e+00 [gen_lib.py:main]
06:46:43 INFO : pnum=4560 :: sim_fname=PosixPath('/home/cayenne/output/Evolution_up_to_3classic-phenom-uniform_new_n20000_r100_f20/sims/sam-lib__p004560.npz') beginning at 2024-04-24 06:46:43.126533 [gen_lib.py:run_sam_at_pspace_num]
06:46:43 DEBUG : Selecting `sam` and `hard` instances [gen_lib.py:run_sam_at_pspace_num]
06:46:43 DEBUG : params 4560 :: {'gsmf_phi0_log10': -2.6188518474115825, 'gsmf_mchar0_log10': 12.246288505178201, 'mmb_mamp_log10': 8.821624510622835, 'mmb_scatter_dex': 0.6997485137450986, 'hard_time': 8.40837032060143, 'hard_gamma_inner': -1.1058762639444715} [libraries.py:model_for_sample_number]
06:46:44 INFO : Scatter added after 42.528816 sec [sam.py:static_binary_density]
06:46:45 INFO : 	dens aft: (5.65e-29, 3.47e-13, 1.61e-08, 5.47e-04, 1.59e-02, 6.07e-02, 2.38e-01) [sam.py:static_binary_density]
06:46:45 INFO : 	mass: 2.40e-01 ==> 2.52e-01 || change = 5.1893e-02 [sam.py:static_binary_density]
06:46:45 INFO : zeroing out 5.60e+05/7.44e+05 = 7.53e-01 bins stalled from GMT [sam.py:static_binary_density]

I have run some tests to try and narrow down exactly where this issue might be coming from, but broadly there is no reliable pattern that I can find yet. The only thing that is consistent is that models will always fail if I am using 4 node, but I have had models with 2 nodes fail in this way as well. Recently, I had one 2 node run a complete all 20,000 samples (and it seems like a second might make it). The runs that fail (regardless of how many nodes / how much memory I request) always run for 40-50 minutes wall time. The one consistency I am seeing is that they either go for ~45 minutes and fail or they complete, never in between.

I'm still trying to dig into this and I will update if I have a breakthrough.

@lzkelley
Copy link
Member

Thanks for the additional info. These types of issues are always a huge pain to debug! It still sounds like it could conceivably be a memory issue of some sort. In my experience, memory errors with python can be sporadic/chaotic, possibly because of the complex garbage collection system it uses. If you're already using the full memory on each node, you could try using full nodes but not all of the processors (so that the memory per processor is higher).

You could also try adding in additional memory information logging. There's a function holodeck.librarian.libraries.log_mem_usage that does this.

And, just to check, you know about the --resume argument, when generating libraries, right? So that you don't have to completely restart jobs that fail.

@CayenneMatt
Copy link
Collaborator Author

I typically request memory per cpu and have seen no difference between, e.g., 3G vs 5G/cpu; I will look into the memory logging to get more information. It does seem to be somewhat random, but given the all or nothing pattern it might be something that happens only in the early stages...

Thanks for checking, I am able to resume jobs and pick up where things left off, but they all still run into the segmentation fault error after the same amount of time. I resumed one run a few times and got up to about ~7,000 samples, but I would have to resume ~10 times at that rate to get to the full 20,000 samples 🤷

@lzkelley
Copy link
Member

I don't have much useful to suggest, sorry! But a few grasping-at-straws thoughts:

  • Is it possible there are particular parameters that are causing the problem? The parameter-sets are ordered (and correspond to the output file numbers that are eventually produced), but are randomized when they're given to processors (both which number goes to which processor, and what the ordering of numbers is). So it's possible (maybe) that there is one (or a few) "bad" parameter sets, and when a processor hits one of those, the job dies. If that's the case, then resuming should actually fail faster and faster as more simulations succeed. This could also be consistent with it tending to require larger number of samples, such that some sort of weird corner of parameter space happens to be explored.

  • You could see if you still get errors when speeding up the simulations, particularly by decreasing the number of realizations and the sam grid size ('shape'). If you're able to reproduce the error with small nreals and small shape, it could help debugging. For example, you might be able to even run a full library on a single core (or small number of cores) which would make it easier to find a particular, bad parameter set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants