Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expecting one or two elements per part from split and zsplit #424

Open
eisungy opened this issue Apr 2, 2024 · 2 comments
Open

Expecting one or two elements per part from split and zsplit #424

eisungy opened this issue Apr 2, 2024 · 2 comments

Comments

@eisungy
Copy link

eisungy commented Apr 2, 2024

Hi. One of application codes I'm involved with is using split and zsplit to partition 982 faces of a serial mesh. The application code is based on discontinuous galerkin method, and users want to distribute the mesh to get about one or two elements per MPI rank.

However, for some cases, split and zsplit didn't work. Below is from the parameter scan.

(1) when the number of parts is purely power of 2,

# of parts split zsplit
32 O O
256 O O
512 O X

(2) when the number of parts is NOT power of 2,

# of parts split zsplit
48 X O
96 X O
288 X O
336 X X
384 X X

Since the users' computer cluster equips 48 cores per one computing node, they want to partition the mesh to get multiples of 48. Since the number of elements for the mesh is small, one idea I'm thinking of is for each rank to load the entire same mesh without partitioning. But for that case, I'm worried that PUMI didn't work because all meshes are duplicated and they don't have partition map or any partition.

In sum, I have two questions.

  1. Is there limitation in split and zsplit as the ratio of number of elmenets to total MPI ranks is close to 1?
  2. If the limitation exists, is there a recommending way to handle that kind of application with PUMI?

Thanks.

@cwsmith
Copy link
Contributor

cwsmith commented Apr 2, 2024

Hi @eisungy,

We typically don't run PUMI with so few elements per part (MPI rank).

I'm worried that PUMI didn't work because all meshes are duplicated and they don't have partition map or any partition.

Your concern is correct; without a partition of the mesh none of the PUMI distributed functions will work as expected.

  1. IIRC, there is no guarantee that Zoltan/Parmetis (used by zsplit) won't create empty parts. We have not tested split down to the levels described here. I'd have to see the error logs to say more. For one of the failed cases, would you please provide the input mesh, build info, execution command (split/zsplit and the arguments), and error logs. I can't give an estimate of how soon someone will be able to do a deep dive on the bug, but maybe we'll see something in the error log.

  2. I can't think of something offhand.

@eisungy
Copy link
Author

eisungy commented May 7, 2024

split_err_test.tar.gz
Hi @cwsmith ,
Thank you for your answer. I upload mesh files with error messages returned by split for 48/96/144 parts. (split.err.XX files)

All results printed out one below message only.

(1 << depth) == multiple failed at /home/esyoon/src/core/core-master-20240315/parma/rib/parma_mesh_rib.cc + 69

I couldn't include an error message from zsplit for the 336 parts case in the attached file, but its error message is as below.

APF warning: 9 empty parts
numDc+numIso >= 1 failed at /home/esyoon/src/core/core-master-20240315/parma/diffMC/parma_dcpart.cc + 124
numDc+numIso >= 1 failed at /home/esyoon/src/core/core-master-20240315/parma/diffMC/parma_dcpart.cc + 124
numDc+numIso >= 1 failed at /home/esyoon/src/core/core-master-20240315/parma/diffMC/parma_dcpart.cc + 124
numDc+numIso >= 1 failed at /home/esyoon/src/core/core-master-20240315/parma/diffMC/parma_dcpart.cc + 124
numDc+numIso >= 1 failed at /home/esyoon/src/core/core-master-20240315/parma/diffMC/parma_dcpart.cc + 124
numDc+numIso >= 1 failed at /home/esyoon/src/core/core-master-20240315/parma/diffMC/parma_dcpart.cc + 124
numDc+numIso >= 1 failed at /home/esyoon/src/core/core-master-20240315/parma/diffMC/parma_dcpart.cc + 124
numDc+numIso >= 1 failed at /home/esyoon/src/core/core-master-20240315/parma/diffMC/parma_dcpart.cc + 124
numDc+numIso >= 1 failed at /home/esyoon/src/core/core-master-20240315/parma/diffMC/parma_dcpart.cc + 124

It is a warning, but couldn't get any resultant partitioned files.

Thank you for your investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants