Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The 'mesh.read()' operation encounters errors when executed with 2048 processors #3817

Open
Ricardotu opened this issue Apr 2, 2024 · 1 comment

Comments

@Ricardotu
Copy link

int main(int argc, char **argv)
{
  LibMeshInit init(argc, argv);
  libmesh_example_requires(libMesh::default_solver_package() == PETSC_SOLVERS,
                           "--enable-petsc");
  // get the input parameters
  GetPot input_file("input_file.in");
  
  // define the mesh
  Mesh mesh(init.comm());
  std::string mesh_file = input_file("mesh_file", "no_generate");
  std::string meshfile_dir = "./mesh/" + mesh_file;
  unsigned int dim = input_file("dim", 0);
  mesh.read(meshfile_dir);
  std::cout << "hello world "<< std::endl;
  return 0;
}

The 'mesh.read()' operation encounters errors when executed with 2048 cores, whereas it operates without issue with 1024 cores. The file name is *.e. I have tested it on two platforms. It will throw an error on the other platform with 512 cores.

      1 e1009.para.bscc:UCM:2fca9:ee53dc40: 71638176 us(71638176 us!!!):  qp_alloc ERR 22 Invalid argument on device mlx5_0
      2 e1009.para.bscc:UCM:2fca9:ee53dc40: 71638201 us(25 us):  qp_attr: SQ 432,1 cq 0xa62500 RQ 8,1 cq 0xa62500 SRQ (nil) [inl 64 typ 2]
      3 e1009.para.bscc:UCM:2fca9:ee53dc40: 71638206 us(5 us):  DAPL ERR create_qp Invalid argument
      4 [9:e1009][../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c:621] error(0x60000): ofa-v2-mlx5_0-1u: could not create DAPL endpoint: DAT_INVALID_PARAMETER()
      5 e1009.para.bscc:UCM:2fcac:c18c0c40: 71655904 us(71655904 us!!!):  qp_alloc ERR 22 Invalid argument on device mlx5_0
      6 e1009.para.bscc:UCM:2fcac:c18c0c40: 71655926 us(22 us):  qp_attr: SQ 432,1 cq 0x10c3060 RQ 8,1 cq 0x10c3060 SRQ (nil) [inl 64 typ 2]
      7 e1009.para.bscc:UCM:2fcac:c18c0c40: 71655930 us(4 us):  DAPL ERR create_qp Invalid argument
      8 e1009.para.bscc:UCM:2fd18:d9e4fc40: 71849340 us(71849340 us!!!):  qp_alloc ERR 22 Invalid argument on device mlx5_0
      9 e1009.para.bscc:UCM:2fd18:d9e4fc40: 71849366 us(26 us):  qp_attr: SQ 432,1 cq 0x1eb0540 RQ 8,1 cq 0x1eb0540 SRQ (nil) [inl 64 typ 2]
     10 e1009.para.bscc:UCM:2fd18:d9e4fc40: 71849370 us(4 us):  DAPL ERR create_qp Invalid argument
     11 [120:e1009][../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c:621] error(0x60000): ofa-v2-mlx5_0-1u: could not create DAPL endpoint: DAT_INVALID_PARAMETER()
     12 [12:e1009][../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c:621] error(0x60000): ofa-v2-mlx5_0-1u: could not create DAPL endpoint: DAT_INVALID_PARAMETER()
     13 e1009.para.bscc:UCM:2fcd1:59dfbc40: 71853882 us(71853882 us!!!):  qp_alloc ERR 22 Invalid argument on device mlx5_0
     14 e1009.para.bscc:UCM:2fcd1:59dfbc40: 71853905 us(23 us):  qp_attr: SQ 432,1 cq 0xfd6020 RQ 8,1 cq 0xfd6020 SRQ (nil) [inl 64 typ 2]
     15 e1009.para.bscc:UCM:2fcd1:59dfbc40: 71853912 us(7 us):  DAPL ERR create_qp Invalid argument
     16 [49:e1009][../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c:621] error(0x60000): ofa-v2-mlx5_0-1u: could not create DAPL endpoint: DAT_INVALID_PARAMETER()
     17 e1009.para.bscc:UCM:2fcad:67670c40: 71652369 us(71652369 us!!!):  qp_alloc ERR 22 Invalid argument on device mlx5_0
     18 e1009.para.bscc:UCM:2fcad:67670c40: 71652386 us(17 us):  qp_attr: SQ 432,1 cq 0x1636500 RQ 8,1 cq 0x1636500 SRQ (nil) [inl 64 typ 2]
     19 e1009.para.bscc:UCM:2fcad:67670c40: 71652390 us(4 us):  DAPL ERR create_qp Invalid argument
     20 [13:e1009][../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c:621] error(0x60000): ofa-v2-mlx5_0-1u: could not create DAPL endpoint: DAT_INVALID_PARAMETER()
     21 e1009.para.bscc:UCM:2fcbb:2edddc40: 71652158 us(71652158 us!!!):  qp_alloc ERR 22 Invalid argument on device mlx5_0
     22 e1009.para.bscc:UCM:2fcbb:2edddc40: 71652181 us(23 us):  qp_attr: SQ 432,1 cq 0x23b60e0 RQ 8,1 cq 0x23b60e0 SRQ (nil) [inl 64 typ 2]
@Ricardotu Ricardotu changed the title The 'mesh.read()' operation encounters errors when executed with 2048 cores The 'mesh.read()' operation encounters errors when executed with 2048 processors Apr 2, 2024
@jwpeterson
Copy link
Member

Hmm... I think if you do it this way, the file is opened and read from simultaneously on all N cores, which is probably not great for the filesystem. Can you try wrapping the mesh.read() line in if (comm().rank() == 0), and then, on all processors call

MeshCommunication().broadcast(mesh);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants