Parallel post-processing with CGNS : performance issues #746

eliesaikali · 2024-01-16T20:13:08Z

eliesaikali
Jan 16, 2024

Dear all,

I started recently using CGNS as a post-processing format for our CFD code.
I am really happy with CGNS, its really well documented and clear to use. However, my objective was to find the best open source library that can be used to write the results of a simulation, in a "single file" and in an efficient parallel way.

After finishing my c++ class, and after doing a simple weak scaling test for my new post-processing class that uses CGNS, I noted the following : the time cost increases with the number of procs. Here is the graph

I don't know if this is expected or classical for CGNS, but I was expecting the cost to be somehow stable ... so what will happen if I run my code on 50000 MPI cores ... this will be too costy !

Any one can give me some advice ? Have you noted the same issue or you managed to obtain an efficient performant parallel data write with CGNS ?

here are some more precisions of my code :

we uses a domain splitting approach, so i write in parrallel the information of each proc.
i use INDEPENDENT mpi mode since I can have situations where some processors dont need to write to the file
I open the hdf file only once and I close it at the end of the simulation. (honestly, I dont know if this is a good practice of hdf5 or not ...)

Thanks
Regards

MicK7 · 2024-01-17T09:02:53Z

MicK7
Jan 17, 2024
Maintainer

If you have 50000 cores does it means that you will be writing 50000 CGNS Zones ? If this the case then it is expected that the CGNS library will perform badly when creating all the Zones.
The classical pattern would be to have a reduced number of zones in the file and to define dynamically zone tiles in memory so that during writing you can write in parallel the tiles.
For instance, if I have a Zone with lots of cells when doing the reading of the file I create a distributed representation of the Zone in memory so each process have a Zone_t but with a restriction of Cell Range (so it is not a true CGNS Zone_t anymore but just a chunk of it). Then the partitionner and load balancer can gather split and do their job to run the computation with convex partition of cells.
When writing some work is done to recombine the opposite way and only write a zone parts in parallel.

Concerning keeping the file open during the simulation I will let @brtnfld answer I because I tend to write each time step in separate files so I have no feedback on writing all timesteps of a huge computation in the same file. How do you handle corruption in this case ?

0 replies

eliesaikali · 2024-01-17T09:22:21Z

eliesaikali
Jan 17, 2024
Author

Hello Mickael and thanks for your reply !

For the moment, I only write via the CGNS library and I am not using (for the moment) the reading API ...
If I understand correctly, I have not done the good code to do the write in parallel ..
My code does this for the moment.

Say I have n procs

1 - I loop on the n procs and I write only the meta-data information (so all procs write same information)

* I call cg_zone_write (so yes, I do a zone per proc !!!!)
* I call cg_grid_write
* I call cgp_coord_write
* I call cgp_section_write

2- Now each proc writes his data
* I call cgp_coord_write_data
* I call cgp_elements_write_data

So if I understand correctly, you are telling me that the call of cg_zone_write and cg_grid_write in the loop is a bad practice and should be called once (for all processors) with the global nb_elem and nb_som ?

Thanks

8 replies

eliesaikali Jan 21, 2024
Author

I tried running the code on a cluster. I see the same issue when the code strats using multiple nodes (both approaches parallel in or over zone become expensive. Any hint for that ? (the cluster offers 32 procs per node. All is going ok till 32 procs, but CGNS becomes costly on 64 procs for example ...)

Is there any possibility to improve that ? maybe to write a file per node (don't know if possible ... ) ?

Thanks

brtnfld Jan 21, 2024
Maintainer

The CGNS "subfiling" branch uses an HDF5 v1.14 feature, which splits an HDF5 file among the nodes. Each node writes a section of the CGNS/HDF5 file. If you have node-local storage, you can write to that and, in the background, merge the subfiles into a single CGNS/HDF5 file if needed. Although that feature shines >10k nodes, it may not improve much for two nodes. There might be another issue if you see a drop-off for two nodes. Does your system create darshan logs, https://wordpress.cels.anl.gov/darshan/? Can you give the Lustre(?) parameters that you are using? I happen to be holding an open questions session this Tuesday if you would rather discuss it in person: https://www.hdfgroup.org/call-the-doctor/

Also, only some of the ranks need to participate in I/O, just those members of the MPI communicator. That way, you can use either independent or collective. Sometimes, the I/O times can vary dramatically between the two.

You can also create the CGNS "skeleton" on one rank, close the file, and open it on all the ranks to write the raw data. You can then have ranks make the write calls but not write any data.
What versions of CGNS and HDF5 are you using?

eliesaikali Jan 22, 2024
Author

Hello Scot and thanks for your answer.
I am using independent MPI mode since I have sometimes processors that do nothing.
For present, I do open the CGNS file once in parallel, and close it at the end before ending the simulation. I do have all the ranks that write the same skeleton ...

So first I will go for testing what you proposed; ie: to write the skeleton only if rank == 0 ! I hope that this will solve the issue .. (I just tested quickly but I am having an MPI return error when using CG_MODE_MODIFY ... which is not the case with CG_MODE_WRITE .... I hope taht I could find the reason of this error !

And thanks for the link of tomorrow's discussion, I will be there to ask my questions on live ! thanks

brtnfld Jan 22, 2024
Maintainer

Remember, if you use only one rank, call the cg_ APIs, not the cgp_ APIs. Otherwise, you must split the MPI communicator (or use MPI_COMM_SELF) to allow one rank access. Take a look at fopenclose.F90 in ptests.

If a rank has nothing to write, pass NULL for the data buffer.

eliesaikali Jan 22, 2024
Author

Oh thanks, in fact it works with MPI_COMM_SELF !

But I dont think that this is a good approach for my case, I see clearly the simulations that become more and more slower by increasing the time steps.

I tested by writing the results at each time step. (its fast at the begining, but slows down progressively ! thats impressive ...) That is clearly the cost of closing/opening at each time step.

Is it more appropriate to use the subfiling branch or the code will become more complicated ?

MicK7 · 2024-01-17T12:58:59Z

MicK7
Jan 17, 2024
Maintainer

@brtnfld : Would it be possible to write a list of Zone_t nodes with only one call at the HDF5 level to get better performance ?

1 reply

brtnfld Jan 21, 2024
Maintainer

Currently, the only way to combine I/O at the hdf5 level is by combining multiple datasets into a single I/O request at the MPI IO level. CGNS has multidataset APIs for coordinates, arrays, and fields only.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel post-processing with CGNS : performance issues #746

{{title}}

Replies: 3 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Parallel post-processing with CGNS : performance issues #746

eliesaikali Jan 16, 2024

Replies: 3 comments · 9 replies

MicK7 Jan 17, 2024 Maintainer

eliesaikali Jan 17, 2024 Author

eliesaikali Jan 21, 2024 Author

brtnfld Jan 21, 2024 Maintainer

eliesaikali Jan 22, 2024 Author

brtnfld Jan 22, 2024 Maintainer

eliesaikali Jan 22, 2024 Author

MicK7 Jan 17, 2024 Maintainer

brtnfld Jan 21, 2024 Maintainer

eliesaikali
Jan 16, 2024

Replies: 3 comments 9 replies

MicK7
Jan 17, 2024
Maintainer

eliesaikali
Jan 17, 2024
Author

eliesaikali Jan 21, 2024
Author

brtnfld Jan 21, 2024
Maintainer

eliesaikali Jan 22, 2024
Author

brtnfld Jan 22, 2024
Maintainer

eliesaikali Jan 22, 2024
Author

MicK7
Jan 17, 2024
Maintainer

brtnfld Jan 21, 2024
Maintainer