Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to set checkpoints using mpi #5505

Open
zhizhizhuzhu opened this issue Nov 28, 2023 · 7 comments
Open

Unable to set checkpoints using mpi #5505

zhizhizhuzhu opened this issue Nov 28, 2023 · 7 comments

Comments

@zhizhizhuzhu
Copy link

微信图片_20231128105019
微信图片_20231128105014

error message: mv: cannot stat ‘final-refine2/restart.mesh.new’: No such file or directory

The same calculations can be done using a single thread.
Unable to set checkpoints when calculating with multiple threads.
Is anyone else getting this error?
I have been plagued by this problem for a long time and sincerely seek your help!

@bangerth
Copy link
Contributor

Does the final-refine2 directory exist? And if so, can you show us what is in it?

@zhizhizhuzhu
Copy link
Author

Yes.
224bc23e7f1f1220117aed69a6e3863
final-refine2.zip
The attachment is the result file.

@tjhei
Copy link
Member

tjhei commented Nov 29, 2023

What kind of machine is this on? What kind of filesystem is the output on (network/local/etc.)?

@zhizhizhuzhu
Copy link
Author

What kind of machine is this on? What kind of filesystem is the output on (network/local/etc.)?

Cloud computers. The output is on the network.

@gassmoeller
Copy link
Member

Hi @zhizhizhuzhu: I am not completely sure what is happening with your run, but my guess would be that something goes wrong with the parallelization of your run on the cloud system. You can see that in the first screenshot you posted above you get the ASPECT header output twice (there are two lines that say This is ASPECT ...). Both of these instances of ASPECT seem to run with 1 MPI process (in your output there are two lines running with 1 MPI process). This is not what a correct ASPECT run with MPI and 2 processes should look like. Instead you should see 1 output line that says running with 2 MPI processes. So you should probably investigate if your cloud system supports MPI parallelization, and how to correctly start MPI jobs on that system. On desktop computers you can just do mpirun -np 2 ./aspect ... to start ASPECT in parallel, but some clusters or cloud systems require special instructions to start MPI jobs.

@zhizhizhuzhu
Copy link
Author

Hi @zhizhizhuzhu: I am not completely sure what is happening with your run, but my guess would be that something goes wrong with the parallelization of your run on the cloud system. You can see that in the first screenshot you posted above you get the ASPECT header output twice (there are two lines that say This is ASPECT ...). Both of these instances of ASPECT seem to run with 1 MPI process (in your output there are two lines running with 1 MPI process). This is not what a correct ASPECT run with MPI and 2 processes should look like. Instead you should see 1 output line that says running with 2 MPI processes. So you should probably investigate if your cloud system supports MPI parallelization, and how to correctly start MPI jobs on that system. On desktop computers you can just do mpirun -np 2 ./aspect ... to start ASPECT in parallel, but some clusters or cloud systems require special instructions to start MPI jobs.

Thanks. I know what you mean. But it can be performed normally just do 'mpirun -np 2 ./aspect ...' in the cloud systems when checkpoints are not used. Such a strange question!

@gassmoeller
Copy link
Member

But it can be performed normally just do 'mpirun -np 2 ./aspect ...'

If your output in these normal cases looks like the one you posted above (the two lines with 1 MPI process each) then that just means your system is computing the same model twice, each model using 1 process. Normally that is fine (except it is wasting compute time of 1 process), because both processes just add new output files.. However, if you use checkpoints both processes will delete old files and write new files. So you run into the following situation:

  1. MPI process 1 deletes the old files and starts writing new files.
  2. MPI process 2 checks for the existence of the old files and notices that they are missing. Then it crashes.

You should not focus on the error of the checkpoints. Your problem originates in that duplicated model run. If you want to run in parallel, your output line has to be running with 2 MPI processes (or however many you start with). All following errors are just consequences of this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants