Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation faults #704

Open
Tissot11 opened this issue Mar 20, 2024 · 29 comments
Open

Segmentation faults #704

Tissot11 opened this issue Mar 20, 2024 · 29 comments

Comments

@Tissot11
Copy link

Hi,

I could run one simulation with three restarts successfully but on the fourth restart, I see segmentation faults after 6 hours of runtime. I attach the err and out files. Please let me now what could be reason for this since physical results look to be fine until the crash.

On another machine, running Smilei sometimes triggers kernel panic bug in InfiniBand drivers leading to the node failures, as told to me the support team. Is this common to occur and could be some remedies for avoiding this sort of crash?

tjob_hybrid.err.9777446.txt
tjob_hybrid.out.9777446.txt

@mccoys
Copy link
Contributor

mccoys commented Mar 20, 2024

I see (Address not mapped to object [0xfffffffffffffffd]
and failed: Cannot allocate memory

You probably ran out of memory

@Tissot11
Copy link
Author

Smilei output file shows very little memory usage e.g. 60 GB while the nodes have 256 GB memory per node. In past I did encounter memory issues, but then Smilei output file would also show it.

@beck-llr
Copy link
Contributor

I agree with @mccoys , it looks like a memory problem. Where did you see a memory occupation of 60 GB ?

In any case, the memory occupation is always underestimated because of many temporary buffers. A more accurate (but still underestimated) way to measure memory occupation is to use the Performance diagnostic. A possible scenario is that a strong load imbalance drives a peak of memory occupation on a single node and crashes it.

I notice that you are using very small patches with respect to your number of threads (more than 100 patches per openMP threads). You can try using larger patches. This should reduce the memory overhead induced by patches communication.

If you detect a peak of memory occupation somewhere that crashes a node you can also consider using the particle merging feature to mitigate that effect.

@Tissot11
Copy link
Author

Tissot11 commented Mar 20, 2024

It's in the stdout files I had attached with this message earlier (see the first message). It says 60 GB. I can use the performance diagnostics to see if indeed the memory is the issue.

Last year, I had asked about memory issue and followed up on your suggestion to use larger patches. However, the runtimes get really slow and I couldn't finish simulations even after restarting them few times. Then I tried large no.processors e.g. 35000 for this problem. I could finish simulations in shorter time albeit the CPU usage was a bit low. I have also tried last year particle merging feature, but I couldn't optimize the merging parameters very well for my simulations.

@Tissot11
Copy link
Author

Looking at the memory bandwidth per socket, I see very little memory usage (see the attached file)

9777446.pdf

@beck-llr
Copy link
Contributor

If you need small patches for performance it confirms that your case is strongly imbalanced. It also explains why you have a poor CPU usage when scaling. It should show on the performance diag. Any chance you could use more openMP threads and less MPI processes ? Or are you already bound by the number of cores per socket of your system ?

@mccoys
Copy link
Contributor

mccoys commented Mar 20, 2024

At the end of the stdout, it says:

Maximum memory per node: 57.321124 GB (defined as MaxRSS*Ntasks/NNodes)

Is that used memory or available memory? I ask because in your document, the maximum memory per node appears to be about 50 GB, which is dangerously close to that limit above.

@Tissot11
Copy link
Author

Tissot11 commented Mar 20, 2024

@mccoys Maximum memory per node is 256 GB.
@beck-llr it’s a collisionless shock simulation so of course it can be imbalanced. I tried vectorization, SDMD, particle merging and OpenMP tasks to speed things up but with a limited success so far. I’m only using either 4 or 6 MPI processes per node and 12 and 19 OpenMP threads on two different machines because this gives the best performance.

@Tissot11
Copy link
Author

Just to add that vectorization does help and compute time improves by 2x.

@mccoys
Copy link
Contributor

mccoys commented Mar 20, 2024

Note that load balancing produces a memory spike that can be very substantial. The crash appears at that moment and it seems related to MPI that is not able to send all the data between MPIs. Have you tried to do load balancing more/less often?

@Tissot11
Copy link
Author

I do load balancing rather often, every 150 iterations. Should I even increase more? I can try it tonight.

@mccoys
Copy link
Contributor

mccoys commented Mar 20, 2024

No I bet you should reduce. If you do it rarely, it has to do a lot of patch swaps. Meaning a lot of memory allocation.

The default is 20, but maybe not optimal for your case

@Tissot11
Copy link
Author

Tissot11 commented Mar 21, 2024

Ok. I have launched a job and I'll let you if it works with aggressive load balancing. I have set every=40. Just to be sure the default load balancing is every=150 as written on the documentation page? I'm using vectorization every=20.

@beck-llr
Copy link
Contributor

beck-llr commented Mar 21, 2024

Yes the default is 150 according to pyinit.py. Another metric that you can monitor is the number of patches per MPI process. You can directly check it out in the patch_load.txt file. It displays the number of patches per mpi process after each load balance operation. You have a problem if an mpi process ends up with only a couple of patches.

@Tissot11
Copy link
Author

Unfortunately this simulation failed even earlier than before. I attach the err, out and patch_text files. From patch_text file, it see almost 200 from 1000 patches per thread. So I guess this is fine?

Although the simulation is imbalanced but when I plot the results until the crash, I don't see any unexpected behaviour. Everything seems physical and expected. This is why I'm worried. I asked the technical support and they seem to also suggest that debugging this would be very hard.

tjob_hybrid.err.9787451.txt
tjob_hybrid.out.9787451.txt

patch_load.txt

@mccoys
Copy link
Contributor

mccoys commented May 25, 2024

I had another quick look at this issue and UCX errors are usually related to MPI or network settings, allowing for different memory or cache amounts for MPI transfers. It is not directly a Smilei issue, so I am closing this.

@mccoys
Copy link
Contributor

mccoys commented Jun 3, 2024

Reopening from indication of @Tissot11 elsewhere that this is a regression as it used to work in v5.0. Can you confirm this? Do you have a case we could test?

@mccoys mccoys reopened this Jun 3, 2024
@Tissot11
Copy link
Author

Tissot11 commented Jun 4, 2024

Yeah, I do have a case...After I switched to Smilei v5.0 last year, I have seen numerous segmentation faults (with 2D simulations) on different machines with different compilers and libraries. Last month, I could manage to run the same simulation I complained about in the beginning of this thread with Smilei v4.7 without a segmentation fault or memory related crash.

Because of these widespread segmentation faults, I started using other codes for simulations. If you investigate this issue and we can hope to resolve it quickly then I can prepare a case and give to you...

@mccoys
Copy link
Contributor

mccoys commented Jun 4, 2024

It depends whether we are able to reproduce the error. If this error requires a large allocation to reproduce, it will take longer of course

@beck-llr
Copy link
Contributor

beck-llr commented Jun 4, 2024

Hi. It is indeed a large simulation and it will be difficult to provide a fix if one is really required.

@Tissot11 are you positive that there is a regression and that you observe the crash in an exactly identical configuration as before (same simulation size, number of patches, physical configuration, compiler, mpi module etc.) ?

I had a look at the logs you provided and it is indeed an extremely unbalanced simulation. After the last load balancing the number of patches per MPI rank spans from 176 to 4680 !! I assume this puts a lot of pressure on the dynamic load balancing process and MPI exchanges.
Moreover you are using a very high number of patches which also increases memory and communication overheads. Even 176 is a lot of patches when you have only 12 openMP threads.

I would strongly advise to divide your total number of patches by at least a factor of 4. You previously answered that this would slow down your simulation too much. By how much did you decrease your number of patches ? Did you check the minimum number of patch per MPI ? As long as you have at least 24 patches per MPI (with 12 openMP threads) it should not slow down dramaticaly. If you go down to less than one patch per tread is when you are going too far.

P.S: You may observe a serious slow down because of cash effect beyond a certain patch size. In that case you could try to increase your number of patch by only a factor of 2. I'd be really surprised if it didn't help but you can never know for sure :-)

@beck-llr
Copy link
Contributor

beck-llr commented Jun 4, 2024

Also for the particle merging to be efficient, you need to know what the distribution of your macro particles in your most populated patches/cells look like. I'm still convinced it could be very helpful in your case but it does require a bit of tuning.
Note that the default merge_momentum_cell_size is VERY conservative. Do not hesitate to reduce it significantly. On the opposite, make sure tat the merge_min_particles_per_cell is not too low. You're only interested in merging particles in cells with many more particles than average.

@Tissot11
Copy link
Author

Tissot11 commented Jun 4, 2024

Indeed, the problems I first reported were with large 2D simulations. One of this simulation with larger domain and >25K CPUs I could manage to run with Smilei v4.7 (without any filtering, not so efficient as you explained due to patches) using older Intel compiler and libraries (compiler/intel/2022.0.2 numlib/mkl/2022.0.2 mpi/impi/2021.5.1 lib/hdf5/1.12) on Horeka. I should emphasise that I'm mostly using interpolation oder 4 but sometimes I also use order 2.

However, I have now prepared a simple case (2D) that I ran on 8 nodes of Hawk at HLRS, and 4 nodes of another HPC machine. To summarize

  1. This simulation run fine with custom MPI library MPT at HLRS and OpenMPI 5.0 and gcc 10.2. However, it starts showing segmentation faults with OpenMPI if I just change the mass ratio and nothing else in the namelist.
  2. Even with MPT library, it shows segmentation faults (also with Smilei 4.7) if I enable Friedman filter. Also with intel mpi library on another machine same segmentation faults. Please see the attached namelist.

I fear that newer compilers and changes made in Smilei 5.0 have some subtle issues, at least for 2D simulations since 1D simulations I do not see any issues. I have spent lots of time trying to run the same and similar 2D simulations with several combinations of libraries and compilers and spent last few months talking with technical supports, and nothing could come out. This is why I have started using other codes.

I will be very happy if we could figure this out so that I can use Smilei for 2D simulations.

namelist.py.txt
Shock_test.e2581658.txt
Shock_test.e2581722.txt
Shock_test.e2581744.txt
Shock_test.e2583577.txt
Shock_test.e2583698.txt
tjob_hybrid.err.12833017.txt

@Tissot11
Copy link
Author

Tissot11 commented Jun 4, 2024

Also for the particle merging to be efficient, you need to know what the distribution of your macro particles in your most populated patches/cells look like. I'm still convinced it could be very helpful in your case but it does require a bit of tuning. Note that the default merge_momentum_cell_size is VERY conservative. Do not hesitate to reduce it significantly. On the opposite, make sure tat the merge_min_particles_per_cell is not too low. You're only interested in merging particles in cells with many more particles than average.

I had this problem last year with memory. I started using interpolation order 4 and less number of particles-per-cell and also launching 4-6 MPI processes and 12 OpenMP threads on a single node. With this approach, I would not have memory issues anymore as the memory usage reported by every tool remain below 256 GB per node. However, sometime I saw memory errors related segmentation faults as I reported before which you and @mccoys attributed to intermittent memory spikes which I would not catch in any performance monitoring tools. I suspect, problem is with the MPI communication and that's why segmentation faults have become a very frequent occurrence with these 2D simulations.

@beck-llr
Copy link
Contributor

beck-llr commented Jun 4, 2024

mi and vUPC are undefined in the namelist you provided.

@Tissot11
Copy link
Author

Tissot11 commented Jun 4, 2024

Sorry! This is a redacted version and I forgot that I still use these parameters later in the diagnostic

namelist.py.txt

@beck-llr
Copy link
Contributor

beck-llr commented Jun 4, 2024

Thanks. I have tried with the dummy values mi=50 and vUPC=0.01 and was able to reproduce a problem. I will look into it.

@mccoys
Copy link
Contributor

mccoys commented Jun 4, 2024

@beck-llr would it be possible to have a maximum_npatch_per_MPI in load balancing? It would prevent overloading ranks when there is a strong load imbalance. Maybe this is not the issue here but the older logs really look like MPI is overloaded.

Now the new logs are different so we have to see (errors in the projectors usually means that particles are not where they are supposed to be).

@beck-llr
Copy link
Contributor

beck-llr commented Jun 5, 2024

@mccoys There are already options to tune the dynamic load balancing like cell_load for instance which will influence min and max number of patch per MPI. In the present case I am more concerned by forcing a minimum number of patch (which can be achieved by increasing the cell load). But in fact the min and max are linked as if you increase the min, you mechanically decrease the max.

From my first tests the problem here now lies within the Friedman filter. I think it has been problematic for a while. This is a good opportunity to have a close look at it.

@Tissot11
Copy link
Author

Tissot11 commented Jun 5, 2024

@beck-llr , so should I change the cell_load for simulations? I never set it up in my simulations. As @mccoys says something automatic to reduce load imbalance would be useful since most of the plasma physics simulations have load imbalance situations after a short interaction time. With laser-solid interactions, this could even be more demanding than shock simulations...

Besides, the Friedman filter, I have also seen segmentation faults with different MPI libraries. In general, it would be nice to have Smilei at least working with OpenMPI always and shows no segmentation faults except for obvious understandable reasons....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants