-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation faults #704
Comments
I see You probably ran out of memory |
Smilei output file shows very little memory usage e.g. 60 GB while the nodes have 256 GB memory per node. In past I did encounter memory issues, but then Smilei output file would also show it. |
I agree with @mccoys , it looks like a memory problem. Where did you see a memory occupation of 60 GB ? In any case, the memory occupation is always underestimated because of many temporary buffers. A more accurate (but still underestimated) way to measure memory occupation is to use the Performance diagnostic. A possible scenario is that a strong load imbalance drives a peak of memory occupation on a single node and crashes it. I notice that you are using very small patches with respect to your number of threads (more than 100 patches per openMP threads). You can try using larger patches. This should reduce the memory overhead induced by patches communication. If you detect a peak of memory occupation somewhere that crashes a node you can also consider using the particle merging feature to mitigate that effect. |
It's in the stdout files I had attached with this message earlier (see the first message). It says 60 GB. I can use the performance diagnostics to see if indeed the memory is the issue. Last year, I had asked about memory issue and followed up on your suggestion to use larger patches. However, the runtimes get really slow and I couldn't finish simulations even after restarting them few times. Then I tried large no.processors e.g. 35000 for this problem. I could finish simulations in shorter time albeit the CPU usage was a bit low. I have also tried last year particle merging feature, but I couldn't optimize the merging parameters very well for my simulations. |
Looking at the memory bandwidth per socket, I see very little memory usage (see the attached file) |
If you need small patches for performance it confirms that your case is strongly imbalanced. It also explains why you have a poor CPU usage when scaling. It should show on the performance diag. Any chance you could use more openMP threads and less MPI processes ? Or are you already bound by the number of cores per socket of your system ? |
At the end of the stdout, it says:
Is that used memory or available memory? I ask because in your document, the maximum memory per node appears to be about 50 GB, which is dangerously close to that limit above. |
@mccoys Maximum memory per node is 256 GB. |
Just to add that vectorization does help and compute time improves by 2x. |
Note that load balancing produces a memory spike that can be very substantial. The crash appears at that moment and it seems related to MPI that is not able to send all the data between MPIs. Have you tried to do load balancing more/less often? |
I do load balancing rather often, every 150 iterations. Should I even increase more? I can try it tonight. |
No I bet you should reduce. If you do it rarely, it has to do a lot of patch swaps. Meaning a lot of memory allocation. The default is 20, but maybe not optimal for your case |
Ok. I have launched a job and I'll let you if it works with aggressive load balancing. I have set |
Yes the default is 150 according to |
Unfortunately this simulation failed even earlier than before. I attach the Although the simulation is imbalanced but when I plot the results until the crash, I don't see any unexpected behaviour. Everything seems physical and expected. This is why I'm worried. I asked the technical support and they seem to also suggest that debugging this would be very hard. |
I had another quick look at this issue and UCX errors are usually related to MPI or network settings, allowing for different memory or cache amounts for MPI transfers. It is not directly a Smilei issue, so I am closing this. |
Reopening from indication of @Tissot11 elsewhere that this is a regression as it used to work in v5.0. Can you confirm this? Do you have a case we could test? |
Yeah, I do have a case...After I switched to Smilei v5.0 last year, I have seen numerous segmentation faults (with 2D simulations) on different machines with different compilers and libraries. Last month, I could manage to run the same simulation I complained about in the beginning of this thread with Smilei v4.7 without a segmentation fault or memory related crash. Because of these widespread segmentation faults, I started using other codes for simulations. If you investigate this issue and we can hope to resolve it quickly then I can prepare a case and give to you... |
It depends whether we are able to reproduce the error. If this error requires a large allocation to reproduce, it will take longer of course |
Hi. It is indeed a large simulation and it will be difficult to provide a fix if one is really required. @Tissot11 are you positive that there is a regression and that you observe the crash in an exactly identical configuration as before (same simulation size, number of patches, physical configuration, compiler, mpi module etc.) ? I had a look at the logs you provided and it is indeed an extremely unbalanced simulation. After the last load balancing the number of patches per MPI rank spans from 176 to 4680 !! I assume this puts a lot of pressure on the dynamic load balancing process and MPI exchanges. I would strongly advise to divide your total number of patches by at least a factor of 4. You previously answered that this would slow down your simulation too much. By how much did you decrease your number of patches ? Did you check the minimum number of patch per MPI ? As long as you have at least 24 patches per MPI (with 12 openMP threads) it should not slow down dramaticaly. If you go down to less than one patch per tread is when you are going too far. P.S: You may observe a serious slow down because of cash effect beyond a certain patch size. In that case you could try to increase your number of patch by only a factor of 2. I'd be really surprised if it didn't help but you can never know for sure :-) |
Also for the particle merging to be efficient, you need to know what the distribution of your macro particles in your most populated patches/cells look like. I'm still convinced it could be very helpful in your case but it does require a bit of tuning. |
Indeed, the problems I first reported were with large 2D simulations. One of this simulation with larger domain and >25K CPUs I could manage to run with Smilei v4.7 (without any filtering, not so efficient as you explained due to patches) using older Intel compiler and libraries (compiler/intel/2022.0.2 numlib/mkl/2022.0.2 mpi/impi/2021.5.1 lib/hdf5/1.12) on Horeka. I should emphasise that I'm mostly using interpolation oder 4 but sometimes I also use order 2. However, I have now prepared a simple case (2D) that I ran on 8 nodes of Hawk at HLRS, and 4 nodes of another HPC machine. To summarize
I fear that newer compilers and changes made in Smilei 5.0 have some subtle issues, at least for 2D simulations since 1D simulations I do not see any issues. I have spent lots of time trying to run the same and similar 2D simulations with several combinations of libraries and compilers and spent last few months talking with technical supports, and nothing could come out. This is why I have started using other codes. I will be very happy if we could figure this out so that I can use Smilei for 2D simulations. namelist.py.txt |
I had this problem last year with memory. I started using interpolation order 4 and less number of particles-per-cell and also launching 4-6 MPI processes and 12 OpenMP threads on a single node. With this approach, I would not have memory issues anymore as the memory usage reported by every tool remain below 256 GB per node. However, sometime I saw memory errors related segmentation faults as I reported before which you and @mccoys attributed to intermittent memory spikes which I would not catch in any performance monitoring tools. I suspect, problem is with the MPI communication and that's why segmentation faults have become a very frequent occurrence with these 2D simulations. |
|
Sorry! This is a redacted version and I forgot that I still use these parameters later in the diagnostic |
Thanks. I have tried with the dummy values |
@beck-llr would it be possible to have a Now the new logs are different so we have to see (errors in the projectors usually means that particles are not where they are supposed to be). |
@mccoys There are already options to tune the dynamic load balancing like From my first tests the problem here now lies within the Friedman filter. I think it has been problematic for a while. This is a good opportunity to have a close look at it. |
@beck-llr , so should I change the cell_load for simulations? I never set it up in my simulations. As @mccoys says something automatic to reduce load imbalance would be useful since most of the plasma physics simulations have load imbalance situations after a short interaction time. With laser-solid interactions, this could even be more demanding than shock simulations... Besides, the Friedman filter, I have also seen segmentation faults with different MPI libraries. In general, it would be nice to have Smilei at least working with OpenMPI always and shows no segmentation faults except for obvious understandable reasons.... |
Hi,
I could run one simulation with three restarts successfully but on the fourth restart, I see segmentation faults after 6 hours of runtime. I attach the err and out files. Please let me now what could be reason for this since physical results look to be fine until the crash.
On another machine, running Smilei sometimes triggers kernel panic bug in InfiniBand drivers leading to the node failures, as told to me the support team. Is this common to occur and could be some remedies for avoiding this sort of crash?
tjob_hybrid.err.9777446.txt
tjob_hybrid.out.9777446.txt
The text was updated successfully, but these errors were encountered: