Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

matchedNodeElmReader alloc crash #338

Open
KennethEJansen opened this issue May 28, 2021 · 94 comments
Open

matchedNodeElmReader alloc crash #338

KennethEJansen opened this issue May 28, 2021 · 94 comments

Comments

@KennethEJansen
Copy link
Contributor

Sorry if this has been covered before but how big of a mesh should I be able to stream in with this build settingkjansen@pfe26:~/SCOREC-core/buildMGEN_write3D> more doConfigure14_18
#!/bin/bash -ex

For Chef

cmake
-DCMAKE_C_COMPILER=mpicc
-DCMAKE_CXX_COMPILER=mpicxx
-DSCOREC_CXX_WARNINGS=OFF
-DSCOREC_CXX_OPTIMIZE=ON
-DSCOREC_CXX_SYMBOLS=ON
-DENABLE_ZOLTAN=ON
-DENABLE_SIMMETRIX=ON
-DPCU_COMPRESS=ON
-DSIM_MPI="mpt"
-DSIM_PARASOLID=ON
-DMDS_SET_MAX=1024
-DMDS_ID_TYPE=long
-DIS_TESTING=ON
-DMESHES=/projects/tools/SCOREC-core/meshes
-DCMAKE_INSTALL_PREFIX=$PWD/install
../core

#-DMDS_ID_TYPE=int or long

I have progressively thrown more nodes at it (doubling compute nodes 4 times now even though the number of mesh nodes only doubled from a prior successful run) but keep getting this stack trace

MPT: #7 0x00002aaab019c61a in abort () from /lib64/libc.so.6
MPT: #8 0x000000000049109b in reel_fail (
MPT: format=format@entry=0x49be6e "realloc(%p, %lu) failed")
MPT: at /home5/kjansen/SCOREC-core/core/pcu/reel/reel.c:24
MPT: #9 0x0000000000490fc8 in noto_realloc (p=0x5a756a90,
MPT: size=18446744056901788296)
MPT: at /home5/kjansen/SCOREC-core/core/pcu/noto/noto_malloc.c:60
MPT: #10 0x0000000000490626 in pcu_push_buffer (b=0x3afd84b8,
MPT: size=size@entry=18446744056901788288)
MPT: at /home5/kjansen/SCOREC-core/core/pcu/pcu_buffer.c:37
MPT: #11 0x0000000000490917 in pcu_msg_pack (m=m@entry=0x6b79c0 <global_pmsg>,
MPT: id=id@entry=639, size=size@entry=18446744056901788288)
MPT: at /home5/kjansen/SCOREC-core/core/pcu/pcu_msg.c:133
MPT: #12 0x000000000048eb5f in PCU_Comm_Pack (to_rank=to_rank@entry=639,
MPT: data=data@entry=0x2357cb30, size=18446744056901788288)
MPT: at /home5/kjansen/SCOREC-core/core/pcu/pcu.c:141
MPT: #13 0x000000000047e733 in apf::setCoords (m=m@entry=0x3312d300,
MPT: coords=0x2357cb30, nverts=3015292, globalToVert=...)
MPT: at /home5/kjansen/SCOREC-core/core/apf/apfConstruct.cc:202
MPT: #14 0x000000000043722c in main (argc=, argv=)
MPT: at /home5/kjansen/SCOREC-core/core/test/matchedNodeElmReader.cc:832

@KennethEJansen
Copy link
Contributor Author

I seem to be crashing when trying to stream in a 8,437,865,894 element, 1.9B node mesh. It is a mix of wedges and tets. I was trying to stream it in to 160 Broadwell nodes each running 4 processes and I think that since I told PBS this:
#PBS -l select=160:ncpus=4:mpiprocs=4:model=bro
it should have given each process 192G/4= 48G so that is is 13.2M elements and 3M nodes per process which is better than what I was last successful with (half the mesh sizes in both elements and nodes ran through on 40 nodes with 1 process per core). I guess this means it is either an index limit or a balloon of memory usage with more processes or PBS not really doing what it should in distributing the processes (though I have sshed to a node and found the expected 4 processes running there).

@KennethEJansen
Copy link
Contributor Author

If I am reading the output right,

MPT: #13 0x000000000047e733 in apf::setCoords (m=m@entry=0x3312d300,
MPT: coords=0x2357cb30, nverts=3015292, globalToVert=...)

is trying to handle 30M verts which I would not expect to be a problem in terms of memory usage so I think I am hitting an index size issue.

@cwsmith
Copy link
Contributor

cwsmith commented May 28, 2021

What branch/commit were these tests using?

@cwsmith
Copy link
Contributor

cwsmith commented May 28, 2021

Discussion notes:

@KennethEJansen
Copy link
Contributor Author

KennethEJansen commented May 28, 2021

MGEN_write3D (c1d05c1)

@KennethEJansen
Copy link
Contributor Author

Is there a quick way to convert from long to int? As you feared, there is propagation

[ 38%] Building CXX object mds/CMakeFiles/mds.dir/apfBox.cc.o
[ 57%] Built target parma
/home5/kjansen/SCOREC-core/core/mds/apfMDS.cc: In function 'void apf::deriveMdlFromManifold(apf::Mesh2*, bool*, int, int ()[5], apf::GlobalToVert&, std::map<int, apf::MeshEntity>&)':
/home5/kjansen/SCOREC-core/core/mds/apfMDS.cc:1054:55: error: no matching function for call to 'apf::Mesh2::setIntTag(apf::MeshEntity*&, apf::MeshTag*&, const long int*)'
mesh->setIntTag(vit->second, vIDTag, &(vit->first));
^
In file included from /home5/kjansen/SCOREC-core/core/apf/apfMesh2.h:14:0,
from /home5/kjansen/SCOREC-core/core/mds/apfPM.h:14,
from /home5/kjansen/SCOREC-core/core/mds/apfMDS.cc:15:
/home5/kjansen/SCOREC-core/core/apf/apfMesh.h:245:18: note: candidate: virtual void apf::Mesh::setIntTag(apf::MeshEntity*, apf::MeshTag*, const int*)
virtual void setIntTag(MeshEntity* e, MeshTag* tag, int const* data) = 0;
^~~~~~~~~
but I guess I am hoping that with a cast of some sorts on the usage of GlobalToVert, the propagation can be stopped at least for now in routines I am hopefully not using.

@KennethEJansen
Copy link
Contributor Author

Note I also had to create a PCU_Max_Long which I hopefully replicated correctly from PCU_Max_Int

@KennethEJansen
Copy link
Contributor Author

I got around the casting issue above but am hitting new issues
Scanning dependencies of target matchedNodeElmReader
[100%] Building CXX object test/CMakeFiles/matchedNodeElmReader.dir/matchedNodeElmReader.cc.o
In file included from /home5/kjansen/SCOREC-core/core/test/matchedNodeElmReader.cc:6:0:
/home5/kjansen/SCOREC-core/core/apf/apfConvertTags.h: In function '{anonymous}::Gid {anonymous}::getMax(const GlobalToVert&)':
/home5/kjansen/SCOREC-core/core/apf/apfConvertTags.h:14:36: error: no matching function for call to 'max({anonymous}::Gid&, const long int&)'
max = std::max(max, it->first);
^
In file included from /nasa/pkgsrc/sles12/2016Q4/gcc6/include/c++/bits/stl_tree.h:63:0,
from /nasa/pkgsrc/sles12/2016Q4/gcc6/include/c++/map:60,
from /home5/kjansen/SCOREC-core/core/mds/apfMDS.h:33,
from /home5/kjansen/SCOREC-core/core/test/matchedNodeElmReader.cc:3:
/nasa/pkgsrc/sles12/2016Q4/gcc6/include/c++/bits/stl_algobase.h:219:5: note: candidate: template constexpr const _Tp& std::max(const _Tp&, const _Tp&)
max(const _Tp& __a, const _Tp& __b)
^~~
/nasa/pkgsrc/sles12/2016Q4/gcc6/include/c++/bits/stl_algobase.h:219:5: note: template argument deduction/substitution failed:
In file included from /home5/kjansen/SCOREC-core/core/test/matchedNodeElmReader.cc:6:0:
/home5/kjansen/SCOREC-core/core/apf/apfConvertTags.h:14:36: note: deduced conflicting types for parameter 'const _Tp' ('int' and 'long int')
max = std::max(max, it->first);
^
In file included from /nasa/pkgsrc/sles12/2016Q4/gcc6/include/c++/bits/stl_tree.h:63:0,
from /nasa/pkgsrc/sles12/2016Q4/gcc6/include/c++/map:60,
from /home5/kjansen/SCOREC-core/core/mds/apfMDS.h:33,
from /home5/kjansen/SCOREC-core/core/test/matchedNodeElmReader.cc:3:
/nasa/pkgsrc/sles12/2016Q4/gcc6/include/c++/bits/stl_algobase.h:265:5: note: candidate: template<class _Tp, class _Compare> constexpr const _Tp& std::max(const _Tp&, const _Tp&, _Compare)
max(const _Tp& __a, const _Tp& __b, _Compare __comp)
^~~
/nasa/pkgsrc/sles12/2016Q4/gcc6/include/c++/bits/stl_algobase.h:265:5: note: template argument deduction/substitution failed:
In file included from /home5/kjansen/SCOREC-core/core/test/matchedNodeElmReader.cc:6:0:
/home5/kjansen/SCOREC-core/core/apf/apfConvertTags.h:14:36: note: deduced conflicting types for parameter 'const _Tp' ('int' and 'long int')
max = std::max(max, it->first);
^
/home5/kjansen/SCOREC-core/core/test/matchedNodeElmReader.cc: In function 'int main(int, char**)':
/home5/kjansen/SCOREC-core/core/test/matchedNodeElmReader.cc:825:73: error: cannot convert '{anonymous}::Gid* {aka int*}' to 'const Gid* {aka const long int*}' for argument '2' to 'void apf::construct(apf::Mesh2*, const Gid*, int, int, apf::GlobalToVert&)'
apf::construct(mesh, m.elements, m.localNumElms, m.elementType, outMap);

@jacobmerson
Copy link
Contributor

Where do you define the GID typedef? The error makes it look like its defined in an anonymous namespace.

@cwsmith
Copy link
Contributor

cwsmith commented May 28, 2021

It is/was in apf here:

typedef long Gid;

I'm hacking at this now.

@KennethEJansen
Copy link
Contributor Author

KennethEJansen commented May 28, 2021

Reproducer: It is not asmall as I would like but here is a path to the case on the viz nodes for your you to grab (note it it grabs inputs from one directory above in the run line so safest to just grab the dir above were I am running it).

  1. Runs successfully if I roll back to where we were when we tried to make everything with GID long (e.g., here is the hash)
    git checkout c1d05c1

  2. here is the case and the successful run

Case DIR /projects/tools/Models/BoeingBump/LES_DNS_Meshing/FPS-MTW-12-30/MGEN4mner_noIDX/mner

 mpirun -np 8 /projects/tools/SCOREC-core/buildLastWorking_mner/test/matchedNodeElmReader ../geom3D.cnn_data ../geom3D.crd ../geom3D.match ../geom3D.class ../geom3D.fathers2D NULL ../geom3DHead.cnn outModel.dmg outMesh/
numVerts 2560663
0 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 0 lastVtx 320082
1 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 320082 lastVtx 640164
2 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 640164 lastVtx 960246
7 readMatches numvtx 2560663 localnumvtx 320089 firstVtx 2240574 lastVtx 2560663
5 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 1600410 lastVtx 1920492
6 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 1920492 lastVtx 2240574
3 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 960246 lastVtx 1280328
4 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 1280328 lastVtx 1600410
isMatched 1
seconds to create mesh 117.841
  - verifying tags: fathers2D
mesh verified in 47.882082 seconds
mesh outMesh/ written in 12.792607 seconds
writeVtuFile into buffers: 5.066701 seconds
writeVtuFile buffers to disk: 1.829939 seconds
vtk files rendered written in 7.314078 seconds

Note though the covert takes 2 minutes (with -g)

  1. our current code (run here with output) takes only 20 seconds to crash as follows (in the same dir same data).
mpirun -np 8 /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader ../geom3D.cnn_data ../geom3D.crd ../geom3D.match ../geom3D.class ../geom3D.fathers2D NULL ../geom3DHead.cnn outModel.dmg outMesh/

numVerts 2560663
0 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 0 lastVtx 320082
1 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 320082 lastVtx 640164
2 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 640164 lastVtx 960246
3 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 960246 lastVtx 1280328
4 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 1280328 lastVtx 1600410
5 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 1600410 lastVtx 1920492
6 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 1920492 lastVtx 2240574
7 readMatches numvtx 2560663 localnumvtx 320089 firstVtx 2240574 lastVtx 2560663
isMatched 1
[viz002:25725] *** Process received signal ***
[viz002:25725] Signal: Segmentation fault (11)
[viz002:25725] Signal code: Address not mapped (1)
[viz002:25725] Failing at address: (nil)
[viz002:25725] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf890)[0x7f70e3098890]
[viz002:25725] [ 1] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader[0x4ab948]
[viz002:25725] [ 2] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader[0x4abb2d]
[viz002:25725] [ 3] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader[0x4abc66]
[viz002:25725] [ 4] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader[0x4ac630]
[viz002:25725] [ 5] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader(mds_create_entity+0xe1)[0x4accbc]
[viz002:25725] [ 6] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader(mds_apf_create_entity+0x49)[0x4adaca]
[viz002:25725] [ 7] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader(_ZN3apf7MeshMDS13createEntity_EiPNS_11ModelEntityEPPNS_10MeshEntityE+0x13f)[0x49b579]
[viz002:25725] [ 8] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader(_ZN3apf5Mesh212createEntityEiPNS_11ModelEntityEPPNS_10MeshEntityE+0x44)[0x4d04bc]
[viz002:25725] [ 9] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader(_ZN3apf10makeOrFindEPNS_5Mesh2EPNS_11ModelEntityEiPPNS_10MeshEntityEPNS_13BuildCallbackEPb+0x6a)[0x4ceda0]
[viz002:25725] [10] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader(_ZN3apf14ElementBuilder5applyEiPPNS_10MeshEntityE+0x43)[0x4d0559]
[viz002:25725] [11] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader(_ZN3apf13ElementVertOp3runEiPPNS_10MeshEntityE+0x44)[0x4c8a2a]
[viz002:25725] [12] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader[0x4c84c5]
[viz002:25725] [13] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader(_ZN3apf13ElementVertOp7runDownEiPPNS_10MeshEntityES3_+0x3d)[0x4c89e3]
[viz002:25725] [14] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader(_ZN3apf13ElementVertOp3runEiPPNS_10MeshEntityE+0x2a)[0x4c8a10]
[viz002:25725] [15] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader[0x4c8658]
[viz002:25725] [16] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader(_ZN3apf13ElementVertOp7runDownEiPPNS_10MeshEntityES3_+0x3d)[0x4c89e3]
[viz002:25725] [17] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader(_ZN3apf13ElementVertOp3runEiPPNS_10MeshEntityE+0x2a)[0x4c8a10]
[viz002:25725] [18] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader(_ZN3apf12buildElementEPNS_5Mesh2EPNS_11ModelEntityEiPPNS_10MeshEntityEPNS_13BuildCallbackE+0x48)[0x4cee21]
[viz002:25725] [19] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader[0x501d44]
[viz002:25725] [20] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader(_ZN3apf9constructEPNS_5Mesh2EPKliiRSt3mapIlPNS_10MeshEntityESt4lessIlESaISt4pairIS2_S6_EEE+0x54)[0x5024e6]
[viz002:25725] [21] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader(main+0x1ea)[0x48eb2c]
[viz002:25725] [22] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f70e2cffb45]
[viz002:25725] [23] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader[0x48b488]
[viz002:25725] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 25725 on node viz002 exited on signal 11 (Segmentation fault).

edit: formatting

@KennethEJansen
Copy link
Contributor Author

I decided to try serial. Long-mod code makes it through construct in serial but then segfaults on the line 138 in mds_apf.c on return gmi_find(m-> user_model, ....
I made a debug executable rolled back to before we moved int to long.

what totalview showed me for the mesh (m) was pretty messed up so its note REALLY getting through construct cleanly. I have a new run stopping just after constuct to check that but it is probably time to build with memory sanitizer.

OK it got to my break just before delete [] m.elements on line 826 of matchedNodeElemReader.

m.elements are junk as the first entry is 140007343915089 according to totalview

I will move my search into where this is assigned to see what is going wrong.

@KennethEJansen
Copy link
Contributor Author

KennethEJansen commented May 29, 2021

looks like line 705 is a problem.

gmi_fscanf(f,1, "%u%,elmVtx+j);

does not put a long into elmVtx according to TotalView which does seem to know that elmVtx is apf::Gid[6]. Here is a screenshot. I suppose there is a different format to read long ints? I will try to dig and find it but if someone can share that I will see if that is the last problem.

Screen Shot 2021-05-28 at 6 19 37 PM

@KennethEJansen
Copy link
Contributor Author

Found it. Changed %u to %ld and it seems to be working on the small case with 8 processes

Still crashing on NAS for the big case in the same place. What does do when q=t/p and t is long, p is an int and q is an int? Does it automatically cast? This is what we are doing in set Coords where it is crashing

@jacobmerson
Copy link
Contributor

jacobmerson commented May 29, 2021

There is an implicit conversion happening so t is converted to an int and the division happens. This will obviously cause issues if t>max int (2^32/2-1 for 32 bit integer which is about 2 billion). If you are using gcc or clang you can use -Wconversion which will find this error. I'm not sure if there is any similar warning on XL.

See here for an example on compiler explorer.

EDIT: I was initially wrong. both t and p is converted to long then the implicit conversion happens during assignment.

See cppreference section on arithmetic operator conversion.

This can be verified with the following code: static_assert(std::is_same_v<std::common_type_t<long,int>,long>);

@KennethEJansen
Copy link
Contributor Author

KennethEJansen commented May 29, 2021

Thanks for the advice. Before I went to sleep last night (and before I saw this), i put the following conditional print statements into setCoords
Gid max = getMax(globalToVert);
Gid total = max + 1;
int peers = PCU_Comm_Peers();
int quotient = total / peers;
int remainder = total % peers;
int mySize = quotient;
int self = PCU_Comm_Self();
if (self == (peers - 1))
mySize += remainder;
int myOffset = self * quotient;

/* Force each peer to have exactly mySize verts.
This means we might need to send and recv some coords /
double
c = new double[mySize*3];

int start = PCU_Exscan_Int(nverts);

PCU_Comm_Begin();
int to = std::min(peers - 1, start / quotient);
int n = std::min((to+1)*quotient-start, nverts);
if(n > 100000000) {
lion_eprint(1, "setCoords int overflow of: self=%d,mySize=%d,total=%ld, n=%d,to=%d, quotient=%d, remainder=%d start=%d, peers=%d \n",self,mySize,total,n,to,quotient,remainder,start,peers);
Gid peersG = PCU_Comm_Peers();
Gid quotientG = total / peersG;
Gid remainderG = total % peersG;
lion_eprint(1, "setCoords Gid0test: self=%d,mySize=%d,total=%ld, quotientG=%ld, peers=%ld \n",self,mySize,total,quotientG,remainderG,peersG);
}

while (nverts > 0) {
PCU_COMM_PACK(to, start);
PCU_COMM_PACK(to, n);
PCU_Comm_Pack(to, coords, n3sizeof(double));

They produced no output. Maybe my code was wrong or perhaps lion_eprint is buffering??? but, if it is not, then the theory that it is n which is blowing up is false. Is there a way to force lion_eprint to empty its buffer?

Note I suppose I can also dig back to past messages from Cameron to try to an addr2line command to confirm what line number we are actually crashing on in setCoord. We also have cores if someone wants to tell me how to mine them for information?

I guess I can also try to use the hints from Jacob but I will need to read up there too. NAS does not provide very modern gcc (6.2 is what they provide and what I am using).

@KennethEJansen
Copy link
Contributor Author

KennethEJansen commented May 29, 2021

addr2line is not helping (or I am not using it correctly)

MPT: #12 0x0000000000523ab2 in PCU_Comm_Pack ()
MPT: #13 0x0000000000508bd8 in apf::setCoords(apf::Mesh2*, double const*, int, std::map<long, apf::MeshEntity*, std::less, std::allocator<std::pair<long const, apf::MeshEntity*> > >&) ()
MPT: #14 0x0000000000496fd4 in main ()
MPT: (gdb) A debugging session is active.

kjansen@pfe27:/nobackup/kjansen/SeparatedBump/DNS/ReL_2M/MGEN/mner> addr2line -e /home5/kjansen/SCOREC-core/buildMGEN_write3D/test/matchedNodeElmReader 0x0000000000508bd8
??:?
kjansen@pfe27:/nobackup/kjansen/SeparatedBump/DNS/ReL_2M/MGEN/mner> addr2line -e /home5/kjansen/SCOREC-core/buildMGEN_write3D/test/matchedNodeElmReader 0x0000000000523ab2
??:?
kjansen@pfe27:/nobackup/kjansen/SeparatedBump/DNS/ReL_2M/MGEN/mner>

That said, setCoords calls PCU_Comm_Pack only twice: the place we have been staring at

PCU_Comm_Pack(to, coords, n*3*sizeof(double));

and
PCU_Comm_Pack(to, &c[i3], 3sizeof(double));

Since that last one is 3 doubles long, I don't see any way it could not be the first one which means n MUST be blowing up which I guess means lion_eprint is not emptying its buffer to help me debug this.

I am not sure what changed but I am also no longer getting arguments (e.g.n nverts=xxx) in the stack trace viz

MPT: #9 0x0000000000527ddd in noto_realloc ()
MPT: #10 0x0000000000526d11 in pcu_push_buffer ()
MPT: #11 0x0000000000527268 in pcu_msg_pack ()
MPT: #12 0x0000000000523ab2 in PCU_Comm_Pack ()
MPT: #13 0x0000000000508bd8 in apf::setCoords(apf::Mesh2*, double const*, int, std::map<long, apf::MeshEntity*, std::less, std::allocator<std::pair<long const, apf::MeshEntity*> > >&) ()
MPT: #14 0 x0000000000496fd4 in main ()
MPT: (gdb) A debugging session is active.

But I am still gettign this as the first ERROR output
PUMI error: realloc(0x2405ebce0, 18446744058928065000) failed
PUMI error: realloc(0x3ccbd4ed0, 18446744061822745920) failed
PUMI error: realloc(0x2d6178d60, 18446744065875299208) failed
PUMI error: realloc(0xeb3a6050, 18446744057770192632) failed
PUMI error: realloc(0x200168420, 18446744062401682104) failed
PUMI error: realloc(0x799ff4820, 18446744064717426840) failed
PUMI error: realloc(0x3d737ab50, 18446744061243809736) failed
PUMI error: realloc(0x21136aef0, 18446744069348916312) failed
PUMI error: realloc(0x2f8aa4e60, 18446744060664873552) failed
PUMI error: realloc(0x1c9a2c570, 18446744058349128816) failed
PUMI error: realloc(0x5a8591d80, 18446744064138490656) failed
PUMI error: realloc(0x2401c2f70, 18446744072243597232) failed
PUMI error: realloc(0x2d38cb810, 18446744063559554472) failed
PUMI error: realloc(0x3ccbc6d20, 18446744071085724864) failed
PUMI error: realloc(0x3d2f91fb0, 18446744062980618288) failed
PUMI error: realloc(0x226d3dd70, 18446744066454235392) failed
PUMI error: realloc(0x5bcd4f1c0, 18446744073401469600) failed
PUMI error: realloc(0x259cedc60, 18446744060085937368) failed
PUMI error: realloc(0x59fe5e550, 18446744071664661048) failed
PUMI error: realloc(0x3ccbc5da0, 18446744067033171576) failed
PUMI error: realloc(0x226bd3430, 18446744069927852496) failed
PUMI error: realloc(0x2d34b28d0, 18446744068191043944) failed
PUMI error: realloc(0x1a42608a0, 18446744068769980128) failed
PUMI error: realloc(0x3d0f92750, 18446744072822533416) failed
PUMI error: realloc(0x2d3603280, 18446744070506788680) failed
PUMI error: realloc(0x1b680a1c0, 18446744065296363024) failed
PUMI error: realloc(0x2d34a3810, 18446744067612107760) failed
PUMI error: realloc(0xeb376ba0, 18446744057191256448) failed
PUMI error: realloc(0x2e07a4db0, 18446744059507001184) failed

There are 29 of them. Not sure if that tells us anything but not all 80 at least are reporting this error before crashing.

@KennethEJansen KennethEJansen assigned cwsmith and matthb2 and unassigned cwsmith May 29, 2021
@KennethEJansen
Copy link
Contributor Author

Core fie interrogation with gdb is not helping either. kjansen@pfe27:/nobackup/kjansen/SeparatedBump/DNS/ReL_2M/MGEN/mner> gdb /home5/kjansen/SCOREC-core/buildMGEN_write3D/test/matchedNodeElmReader core.19660
GNU gdb (GDB; SUSE Linux Enterprise 12) 8.3.1
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://bugs.opensuse.org/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home5/kjansen/SCOREC-core/buildMGEN_write3D/test/matchedNodeElmReader...
BFD: warning: /nobackupp2/kjansen/SeparatedBump/DNS/ReL_2M/MGEN/mner/core.19660 is truncated: expected core file size >= 35391074304, found: 32594722816

warning: core file may not match specified executable file.
[New LWP 19660]
Cannot access memory at address 0x2aaaaacce128
Cannot access memory at address 0x2aaaaacce120
Failed to read a valid object file image from memory.
Core was generated by `/home5/kjansen/SCOREC-core/buildMGEN_write3D/test/matchedNodeElmReader geom3D.c'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00002aaab019b247 in ?? ()

@cwsmith
Copy link
Contributor

cwsmith commented May 29, 2021

lion_eprint

Are you capturing stdout and stderr? lion_eprint writes to stderr; e stands for error.

The verbosity level set here

lion_set_verbosity(1);

is correct for the added call(s) to lion_eprint; i.e., lion_verbosity_level >= the level set for the given print statement.

if(lion_verbosity_level >= lvl) \

addr2line and gdb

The usage looks OK to me. As a sanity check, what is the output of the following command?

file /home5/kjansen/SCOREC-core/buildMGEN_write3D/test/matchedNodeElmReader

@KennethEJansen
Copy link
Contributor Author

kjansen@pfe27:/nobackup/kjansen/SeparatedBump/DNS/ReL_2M/MGEN/mner> file /home5/kjansen/SCOREC-core/buildMGEN_write3D/test/matchedNodeElmReader
/home5/kjansen/SCOREC-core/buildMGEN_write3D/test/matchedNodeElmReader: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.0.0, not stripped

@KennethEJansen
Copy link
Contributor Author

Yes, I am looking in both stdout and stderr.
Yes, line 790 has what you show so I would expect that my lion_eprint statements are not being stopped by verbosity level.

Does lion_eprint flush its buffer? If not, given that these are likely the lines written right before a crash then they may not get flushed before mpi kills the job?

Is there a function to flush the buffer?

@cwsmith
Copy link
Contributor

cwsmith commented Jun 3, 2021

Nice.

setMappedTag is templated on the tag type (int, long, etc.). My simple explanation, which likely misses important facts, is that each time the function is called with a different tag type the compiler needs to create an instance of the code for the specific type if it has not done so already. I think there is a way around putting it into a header by declaring the needed variants... but I didn't do that.

@KennethEJansen
Copy link
Contributor Author

KennethEJansen commented Jun 3, 2021

Distributed read is done in 15 seconds on 160 cores and has already overtaken at single file read that started an hour earlier.
This is just the read phase of course.

@KennethEJansen
Copy link
Contributor Author

sadly

MPT: #8 0x000000000048f485 in PCU_Comm_Pack (to_rank=to_rank@entry=159,
MPT: data=data@entry=0x36ba9240, size=18446744073709551548)
MPT: at /home5/kjansen/SCOREC-core/core/pcu/pcu.c:145
MPT: #9 0x000000000043aedc in apf::setMappedTag (m=m@entry=0x12b6ada10,
MPT: tagName=tagName@entry=0x4922a2 "classification", vals=0x36ba9240,
MPT: entries=entries@entry=1, nverts=13461799, globalToVert=...)
MPT: at /home5/kjansen/SCOREC-core/core/apf/apfConvertTags.h:91
MPT: #10 0x0000000000437578 in main (argc=, argv=)
MPT: at /home5/kjansen/SCOREC-core/core/test/matchedNodeElmReader.cc:820

@KennethEJansen
Copy link
Contributor Author

OPTIMIZE=OFF version

*** Error in `/home5/kjansen/SCOREC-core/buildMGEN_write3D/test/matchedNodeElmReader': free(): invalid pointer: 0x0000000276f5ea00 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7749f)[0x2aaab01da49f]
/lib64/libc.so.6(+0x7c916)[0x2aaab01df916]
/lib64/libc.so.6(+0x7d75b)[0x2aaab01e075b]
/home5/kjansen/SCOREC-core/buildMGEN_write3D/test/matchedNodeElmReader[0x491749]
/home5/kjansen/SCOREC-core/buildMGEN_write3D/test/matchedNodeElmReader[0x49173d]
/home5/kjansen/SCOREC-core/buildMGEN_write3D/test/matchedNodeElmReader(pcu_order_receive+0xfc)[0x4918ab]
/home5/kjansen/SCOREC-core/buildMGEN_write3D/test/matchedNodeElmReader(PCU_Comm_Receive+0xb)[0x48ff66]
/home5/kjansen/SCOREC-core/buildMGEN_write3D/test/matchedNodeElmReader(_ZN3apf12setMappedTagIiEEPNS_7MeshTagEPNS_5Mesh2EPKcPKT_iiRSt3mapIlPNS_10MeshEntityESt4lessIlESaISt4pairIKlSC_EEE+0x227)[0x43b1b7]
/home5/kjansen/SCOREC-core/buildMGEN_write3D/test/matchedNodeElmReader(main+0x1d8)[0x4375f8]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaab0186a35]
/home5/kjansen/SCOREC-core/buildMGEN_write3D/test/matchedNodeElmReader(_start+0x29)[0x437a19]
======= Memory map: ========

@KennethEJansen
Copy link
Contributor Author

KennethEJansen commented Jun 3, 2021

Actually, I confused myself. This is the code that reads a single file and now that I point it to the right executable it is line 100 of OPTIMIZE=ON exec

59 template
60 apf::MeshTag* setMappedTag(Mesh2* m, const char* tagName,
61 const T* vals, const int entries,
62 int nverts, GlobalToVert& globalToVert)
63 {
64 apf::Gid max = getMax(globalToVert);
65 apf::Gid total = max + 1;
66 int peers = PCU_Comm_Peers();
67 int quotient = total / peers;
68 int remainder = total % peers;
69 int mySize = quotient;
70 int self = PCU_Comm_Self();
71 if (self == (peers - 1))
72 mySize += remainder;
73 apf::Gid myOffset = self * quotient;
74
75 /* Force each peer to have exactly mySize verts.
76 This means we might need to send and recv some coords /
77 T
c = new T[mySizeentries];
78
79 apf::Gid start = PCU_Exscan_Int(nverts);
80
81 PCU_Comm_Begin();
82 apf::Gid tmpL=start / quotient;
83 int tmpI=tmpL;
84 int to = std::min(peers - 1, tmpI);
85 tmpL=(to+1)quotient-start;
86 tmpI=tmpL;
87 int n = std::min(tmpI, nverts);
88 while (nverts > 0) {
89 PCU_COMM_PACK(to, start);
90 PCU_COMM_PACK(to, n);
91 PCU_Comm_Pack(to, vals, n
entries
sizeof(T));
92
93 nverts -= n;
94 start += n;
95 vals += n*entries;
96 to = std::min(peers - 1, to + 1);
97 n = std::min(quotient, nverts);
98 }
99 PCU_Comm_Send();
100 while (PCU_Comm_Receive()) {
101 PCU_COMM_UNPACK(start);

So we are still having an issue here with tags.

The OPTIMIZE=OFF is churning slowly and will hopefully tell us more and/or be fast enough to allow me to launch TotalView on a devel <=2hour qsub

@KennethEJansen
Copy link
Contributor Author

KennethEJansen commented Jun 3, 2021

OPTIMIZE OFF says this so size is blowing up

MPT: #6 0x00002aaaac3e0247 in raise () from /lib64/libc.so.6
MPT: #7 0x00002aaaac3e161a in abort () from /lib64/libc.so.6
MPT: #8 0x0000000000525c1d in PCU_Comm_Pack (to_rank=159, data=0x36c61240,
MPT: size=18446744073709551548) at /home5/kjansen/SCOREC-core/core/pcu/pcu.c:145
MPT: #9 0x000000000049b5a6 in apf::setMappedTag (m=0x12b765a10,
MPT: tagName=0x52ac86 "classification", vals=0x36c61240, entries=1,
MPT: nverts=13461799, globalToVert=...)
MPT: at /home5/kjansen/SCOREC-core/core/apf/apfConvertTags.h:91
MPT: #10 0x0000000000498385 in main (argc=10, argv=0x7fffffffde88)
MPT: at /home5/kjansen/SCOREC-core/core/test/matchedNodeElmReader.cc:820
MPT: (gdb) A debugging session is active.

And is being caught by the code @cwsmith added here on line 144 (sure enough I do get that error message)

137 int PCU_Comm_Pack(int to_rank, const void* data, size_t size)
138 {
139 if (global_state == uninit)
140 reel_fail("Comm_Pack called before Comm_Init");
141 if ((to_rank < 0)||(to_rank >= pcu_mpi_size()))
142 reel_fail("Invalid rank in Comm_Pack");
143 if ( size > (size_t)INT_MAX ) {
144 fprintf(stderr, "ERROR Attempting to pack a PCU message whose size exceeds INT_MAX... exiting\n");
145 abort();
146 }
147 memcpy(pcu_msg_pack(get_msg(),to_rank,size),data,size);
148 return PCU_SUCCESS;
149 }
150

I have a totalview session on 160 processes to try to find the issue here but I am pretty certain that I am feeding it a classifcation array that is int* but the template seems stuck on the data being long*.

@jacobmerson
Copy link
Contributor

I took a skim through the templated code you were asking about:

globalToVert is a map and the Gid (long), so you you need the long version here.

return PCU_Max_Int(max); // this is type-dependent

Not necessarily a bug, but the fact that you commented out the failure indicates you hit this in the past! Essentially this is your base template, so anything that doesn't match the other specializations (only int and double are provided) will use the definition there. So, createTag with say long will just silently do nothing. Easiest thing to do with the most safety is to completely get rid of the definition, but leave the declaration. This will give a linker error rather than the runtime error, or silent failure which were employed there in the past.

template <class T> inline
apf::MeshTag* createTag(apf::Mesh*,
const char*, const int) {
// exit(EXIT_FAILURE);
return 0;
}

Possibly a bug because the sum of a bunch of ints might not fit into an int. You set it to a long implying this is the case, but the PCU function returns an int, so the overflow already happened.

apf::Gid start = PCU_Exscan_Int(nverts);

The following isn't a bug at the moment since Gid is the same as long, but is likely to become a nasty bug if the types ever change because this is a violation of the one definition rule.

typedef std::map<Gid, MeshEntity*> GlobalToVert;

typedef std::map<long, MeshEntity*> GlobalToVert;

@jacobmerson
Copy link
Contributor

@KennethEJansen Point me at the code where you are giving a int* but the template it matching with long* I can probably tell you what's going on with the template.

@KennethEJansen
Copy link
Contributor Author

KennethEJansen commented Jun 3, 2021

Thanks Jacob. I am not sure if @cwsmith has opened up access to this but the code is here
kjansen@viz003: /projects/tools/SCOREC-core/core (MGEN_Parted)$ git remote -v
andy https://github.com/a-jp/core.git (fetch)
andy https://github.com/a-jp/core.git (push)
cws git@github.com:cwsmith/core.git (fetch)
cws git@github.com:cwsmith/core.git (push)
origin git@github.com:SCOREC/core.git (fetch)
origin git@github.com:SCOREC/core.git (push)

Or here is the calling code on line 819
818 }
819 apf::MeshTag* tc = setMappedTag(mesh, "classification", m.classification, 1,
820 m.localNumVerts, outMap);
821 setClassification(model,mesh,tc);
822 apf::removeTagFromDimension(mesh, tc, 0);
823 mesh->destroyTag(tc);

where

736 readClassification(ff, mesh.localNumVerts, &(mesh.classification));

and that routine allocated the memory with

597 void readClassification(FILE* f, int localNumVtx, int** classification) {
598 *classification = new int[localNumVtx];
599 rewind(f);
600 int mdlId;
601 for(int i=0; i<localNumVtx; i++) {
602 gmi_fscanf(f, 1, "%d", &mdlId);
603 (*classification)[i] = mdlId;
604 }
605 }

and
683
684 struct MeshInfo {
685 double* coords;
686 double* solution;
687 apf::Gid* elements;
688 apf::Gid* matches;
689 int* classification;
690 int* fathers2D;
691 unsigned dim;
692 unsigned elementType;
693 apf::Gid numVerts;
694 int localNumVerts;
695 apf::Gid numElms;
696 int localNumElms;
697 unsigned numVtxPerElm;
698 };

@jacobmerson
Copy link
Contributor

Is either the MGEN_Parted or MGEN_write3D branch on the SCOREC github up to date? I need to take a look at the template definition.

@KennethEJansen
Copy link
Contributor Author

KennethEJansen commented Jun 3, 2021

It has the bugs you noted above. I will push my interpretation of your fixes to _Parted now

I pushed without checking for compiling which it does not so I will try to fix that.

@jacobmerson
Copy link
Contributor

I don't see how it's possible that

apf::MeshTag* tc = setMappedTag(mesh, "classification", m.classification, 1,
m.localNumVerts, outMap);

Is using long * with the following template definition, since the m.classification should be int*.

template<class T>
apf::MeshTag* setMappedTag(Mesh2* m, const char* tagName,
const T* vals, const int entries,
int nverts, GlobalToVert& globalToVert)

What's making you think that when you call this that T is long not int? I probably misunderstand what you are asking.

@KennethEJansen
Copy link
Contributor Author

KennethEJansen commented Jun 3, 2021

I was just going by the fact that when it crashed size > INT_MAX but maybe that is not the reason.

In TotalView at the scene of the crime and start is negative so my assumption above was wrong. Digging through the rubble to find the bomb.

@KennethEJansen
Copy link
Contributor Author

We have a bigger problem. total=max+1 where max is supposed to be the max across ranks of globalToVert is returning
2140417137
while
m.numVerts =
21538878936.

I have verified that the last element does have that node in it so I guess I need to figure out what is going wrong in the construction of globalToVert since that is where it supposed to come from?

@KennethEJansen
Copy link
Contributor Author

The first computation of Max is correct but it seems that in the process getMax or PCU_Max_Long destroy globalToVert as it has junk soon after. Running again to see how that happens.

Is this routine intended to operate on key-value stores?

@KennethEJansen
Copy link
Contributor Author

Totalview licenses were taken so I wrote some code to try to check this theory
Gid ifirst=0;
int self2 = PCU_Comm_Self();
APF_ITERATE(GlobalToVert, globalToVert, it) {
Gid gid = it->first;
if(ifirst==0 || ifirst==13437400 ) {
lion_eprint(1, "constructResidence: self=%ld,gid=%ld,ifirst=%ld \n",self2,gid,ifirst);
}
ifirst++;
}

Gid max = getMax(globalToVert);

APF_ITERATE(GlobalToVert, globalToVert, it) {
Gid gid = it->first;
if(ifirst==0 || ifirst==13437400 ) {
lion_eprint(1, "constructResidence: self=%ld,gid=%ld,ifirst=%ld,max=%ld \n",self2,gid,ifirst,max);
}
ifirst++;
}

The idea is print the first and near last "key" before and after the computation of the Max that Totalview showed destroying the globalToVert data structure.

Curiously, I got this type of output
276 constructResidence: self=138,gid=1847913712,ifirst=0
277 constructResidence: self=138,gid=1861351112,ifirst=13437400

from ranks 0 to 138 for the stuff before the max computation (but nothing form ranks higher and since odd, one rank only did one of the two writes)

This is not exactly a proof that the max is munging globalToVert but is odd unless you see something I have done badly (yes I do see that i have a %ld for self but ...

@jacobmerson
Copy link
Contributor

Ok. I might be missing something, but you don't do ifirst=0 after the call to getMax, so I would only expect the second loop to print once. I guess what you are saying is that you get a segfault or something when getMax is called? The getMax function doesn't do anything suspect although I'm not familiar with the details of PCU. The tricky thing about memory errors is that where it crashes may or may not be where the bug is. Print debugging isn't really your friend here because with memory errors something simple like changing the order of statements can change where the bug ends up crashing your code...

For memory bugs like this I've had luck with any of the following approaches in the past:

  1. Use the memory debugger that's part of totalview. When you are starting totalview there is a button to enable the memory debugger. It has various options that have different runtime costs.
  2. address and undefined sanitizers. I actually checked this time and the basic ones are part of gcc

@KennethEJansen
Copy link
Contributor Author

You are correct. The code did have the ifirst=0.

Magically, I ran it outside of totalview as I went to bed last night and it ran through. Note this had all your bug fixes except these which I rolled back because I could not get them to compile. I am offline most of today but leaving the smoking gun here for further review and/or help with how to get Long Tags properly . Also pasting a grep that shows their prolific use.

diff --git a/apf/apfConvertTags.h b/apf/apfConvertTags.h
index 44da7db6..18199918 100644
--- a/apf/apfConvertTags.h
+++ b/apf/apfConvertTags.h
@@ -27,11 +27,11 @@ namespace {
return m->createIntTag(name,entries);
}

  • template <> inline

  • apf::MeshTag* createTag(apf::Mesh* m,

  •  const char* name, const long entries) {
    
  • return m->createLongTag(name,entries);

  • }
    +// template <> inline
    +// apf::MeshTag* createTag(apf::Mesh* m,
    +// const char* name, const long entries) {
    +// return m->createLongTag(name,entries);
    +// }

    template <> inline
    apf::MeshTag* createTag(apf::Mesh* m,
    @@ -39,10 +39,10 @@ namespace {
    return m->createDoubleTag(name,entries);
    }

  • inline void setEntTag(apf::Mesh* m, apf::MeshTag* t,

  •  apf::MeshEntity* e, long* vals) {
    
  • return m->setLongTag(e,t,vals);

  • }
    +// inline void setEntTag(apf::Mesh* m, apf::MeshTag* t,
    +// apf::MeshEntity* e, long* vals) {
    +// return m->setLongTag(e,t,vals);
    +// }

    inline void setEntTag(apf::Mesh* m, apf::MeshTag* t,
    apf::MeshEntity* e, int* vals) {
    kjansen@pfe27:/SCOREC-core/core> grep -ri createLong *
    apf/apfMesh2.cc: return m->createLongTag(name.c_str(), size);
    apf/apfConvert.cc: out = outMesh->createLongTag(tagName, tagSize);
    apf/apfTagData.h: return m->createLongTag(n,s);
    apf/apfConvertTags.h:// return m->createLongTag(name,entries);
    apf/apfMesh.h: virtual MeshTag* createLongTag(const char* name, int size) = 0;
    apf/apfConstruct.cc: MeshTag* matchGidTag = m->createLongTag("matchGids", 1);
    apf_sim/apfSIM.cc:MeshTag* MeshSIM::createLongTag(const char* name, int size)
    apf_sim/apfSIM.h: MeshTag* createLongTag(const char* name, int size);
    mds/apfMDS.cc: MeshTag* createLongTag(const char* name, int size)
    mds/apfMDS.cc: apf::MeshTag* vIDTag = mesh->createLongTag("_vert_id", 1);
    mds/apfMDS.cc: apf::MeshTag* vIDTag = mesh->createLongTag("_vert_id", 1);
    pumi/pumi_mtag.cc:pMeshTag pumi_mesh_createLongTag(pMesh m, const char* name, int size)
    pumi/pumi_mtag.cc: return m->createLongTag(name, size);
    pumi/pumi.h:pMeshTag pumi_mesh_createLongTag(pMesh m, const char* name, int size);
    test/pumi.cc: pMeshTag long_tag = pumi_mesh_createLongTag(m, "long", 1);
    test/pumi.cc: pMeshTag longarr_tag=pumi_mesh_createLongTag(m, "long array", 3);
    zoltan/apfInterElement.cc: MeshTag* t = m->createLongTag(name, 1);
    kjansen@pfe27:
    /SCOREC-core/core> grep -ri setLong *
    apf/apfConvert.cc: outMesh->setLongTag(newFromOld[e], out, lngData);
    apf/apfTagData.h: return m->setLongTag(e,t,data);
    apf/apfConvertTags.h:// return m->setLongTag(e,t,vals);
    apf/apfMesh.h: virtual void setLongTag(MeshEntity* e, MeshTag* tag, long const* data) = 0;
    apf/apfConstruct.cc: m->setLongTag(globalToVert[gid], matchGidTag, &match);
    apf_sim/apfSIM.cc:void MeshSIM::setLongTag(MeshEntity* e, MeshTag* tag, long const* data)
    apf_sim/apfSIM.h: void setLongTag(MeshEntity* e, MeshTag* tag, long const* data);
    mds/apfMDS.cc: void setLongTag(MeshEntity* e, MeshTag* tag, long const* data)
    mds/apfMDS.cc: mesh->setLongTag(vit->second, vIDTag, &(vit->first));
    mds/apfMDS.cc: mesh->setLongTag(vit->second, vIDTag, &(vit->first));
    pumi/pumi_gtag.cc:void pumi_gent_setLongTag (pGeomEnt ent, pTag tag, const long data)
    pumi/pumi_gtag.cc:void pumi_gent_setLongArrTag (pGeomEnt ent, pTag tag, const long* data)
    pumi/pumi_mtag.cc:void pumi_ment_setLongTag(pMeshEnt e, pMeshTag t, long const* data)
    pumi/pumi_mtag.cc: pumi::instance()->mesh->setLongTag(e, t, data);
    pumi/pumi.h:void pumi_gent_setLongTag (pGeomEnt ent, pTag tag, const long data);
    pumi/pumi.h:void pumi_gent_setLongArrTag (pGeomEnt ent, pTag tag, const long* data);
    pumi/pumi.h:void pumi_ment_setLongTag(pMeshEnt e, pMeshTag tag, long const* data);
    test/pumi.cc: pumi_gent_setLongTag(ent, long_tag, 3000);
    test/pumi.cc: pumi_gent_setLongArrTag (ent, longarr_tag, long_arr);
    test/pumi.cc: pumi_ment_setLongTag(ent, long_tag, &long_value);
    test/pumi.cc: pumi_ment_setLongTag (ent, longarr_tag, long_arr);
    zoltan/apfInterElement.cc: m->setLongTag(s, t, &gid);
    kjansen@pfe27:~/SCOREC-core/core>

@KennethEJansen
Copy link
Contributor Author

Apparently we are NOT out of the woods yet. Chef did not like the mesh it seems

kjansen@pfe26:/nobackup/kjansen/SeparatedBump/DNS/ReL_2M/MGEN190_Parted/Chef/4320-160-Chef> tail -f output.11201203
PUMI Git hash 2.2.0
PUMI version 2.2.0 Git hash a3ee8c9
"../outModel.dmg" and "../LES_Periodic.spj" loaded in 0.003206 seconds
mesh ../../mner/outMesh/ loaded in 551.383934 seconds
number of tet 1768308546 hex 0 prism 3703246344 pyramid 0
mesh entity counts: v 2151489252 e 9490228518 f 12807904473 r 5471554890
signal 11 caught by pcu
/home5/kjansen/SCOREC-core/buildMT_3/test/chef(reel_trace+0x13)[0x13c7fde]
/home5/kjansen/SCOREC-core/buildMT_3/test/chef[0x13c802f]
/lib64/libc.so.6(+0x382c0)[0x2aaab019b2c0]
/nasa/hpe/mpt/2.17r13/lib/libmpi.so(+0x11ec8e)[0x2aaaaf624c8e]
MPT ERROR: 21:12:22: Rank 1755: Send completion to rank 0, status 9: remote invalid request error
1755->0 uses r589i6n4:mlx4_0:1:0x40000000000c0fe -> r509i0n9:0:1:0x40000000000c0fe
MPT ERROR: Rank 1755(g:1755) is aborting with error code 0.
Process ID: 21402, Host: r589i6n4, Program: /home5/kjansen/SCOREC-core/buildMT_3/test/chef
MPT Version: HPE MPT 2.17 11/30/17 08:08:29

MPT: --------stack traceback-------
MPT ERROR: 21:12:22: Rank 1242: Send completion to rank 0, status 9: remote invalid request error
1242->0 uses r513i0n7:mlx4_0:1:0x40000000000c0fe -> r509i0n9:0:1:0x40000000000c0fe
MPT ERROR: Rank 1242(g:1242) is aborting with error code 0.
Process ID: 17826, Host: r513i0n7, Program: /home5/kjansen/SCOREC-core/buildMT_3/test/chef
MPT Version: HPE MPT 2.17 11/30/17 08:08:29

MPT: --------stack traceback-------
MPT ERROR: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize()
aborting job
MPT: Received signal 11

Note that this is an OLD version of chef but I am not thinking that any of our changes really affect Chef. I am pretty we have worked with meshes whose GLOBAL entity counts would overflow ints or even unsigned ints but, if successfully spit across 160 ranks, the on part entity counts are far below.

I can try to break it into smaller steps. This was breaking it from 160 to 4320 (splitFactor 27) if we think that will make a difference. The strange slit factor is an attempt to find a final number close to 64K (1602715=64800 = 98.8% of 64K so I would only waste 1.2% of my resources on Theta if I move the job there and yet it fits nicely into several of NAS's core counts (20, 24, and 40 but not the 28 core Broadmore).

@KennethEJansen
Copy link
Contributor Author

KennethEJansen commented Jun 5, 2021

Cut back to splitFactor of 9 and got this errorPUMI version 2.2.0 Git hash ba64b1d
"../outModel.dmg" and "../LES_Periodic.spj" loaded in 0.004470 seconds
mesh ../../mner/outMesh/ loaded in 236.142403 seconds
number of tet 1768308546 hex 0 prism 3703246344 pyramid 0
mesh entity counts: v 2151489252 e 9490228518 f 12807904473 r 5471554890
chef: ibdev_multirail.c:4853: ib_progress_receive: Assertion `sender == gps->grank' failed.
signal 6 caught by pcu
/home5/kjansen/SCOREC-core/buildMGEN_Parted/test/chef(reel_trace+0x13)[0x13c842e]
/home5/kjansen/SCOREC-core/buildMGEN_Parted/test/chef[0x13c847f]
/lib64/libc.so.6(+0x382c0)[0x2aaab019b2c0]
/lib64/libc.so.6(gsignal+0x37)[0x2aaab019b247]
/lib64/libc.so.6(abort+0x13a)[0x2aaab019c61a]
/lib64/libc.so.6(+0x31036)[0x2aaab0194036]
/lib64/libc.so.6(+0x310e2)[0x2aaab01940e2]
/nasa/hpe/mpt/2.17r13/lib/libmpi.so(+0x48818)[0x2aaaaf54e818]
MPT ERROR: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize()
aborting job
MPT: Received signal 6

on CascadeLake . I have never seen anything like this before so perhaps it is a hardware/software missmatch. I am trying one more time with splitFactor 2 before retreating back to more familiar hardware.

That failed and so did the familiar hardware so I think we have to conclude that matchedNodeElemReader is making an error that verify did not find.

I am going back to try to fix the LongTag issue. I uncommented those lines of code and a compile generates this error:

[100%] Building CXX object test/CMakeFiles/matchedNodeElmReader.dir/matchedNodeElmReader.cc.o
In file included from /home5/kjansen/SCOREC-core/core/test/matchedNodeElmReader.cc:6:0:
/home5/kjansen/SCOREC-core/core/apf/apfConvertTags.h:31:17: error: template-id 'createTag' for 'apf::MeshTag* {anonymous}::createTag(apf::Mesh*, const char*, long int)' does not match any template declaration
apf::MeshTag* createTag(apf::Mesh* m,
^~~~~~~~~~~~~~~
/home5/kjansen/SCOREC-core/core/apf/apfConvertTags.h:18:17: note: candidate is: template apf::MeshTag* {anonymous}::createTag(apf::Mesh*, const char*, int)
apf::MeshTag* createTag(apf::Mesh*,
^~~~~~~~~
test/CMakeFiles/matchedNodeElmReader.dir/build.make:62: recipe for target 'test/CMakeFiles/matchedNodeElmReader.dir/matchedNodeElmReader.cc.o' failed
make[3]: *** [test/CMakeFiles/matchedNodeElmReader.dir/matchedNodeElmReader.cc.o] Error 1
CMakeFiles/Makefile2:3574: recipe for target 'test/CMakeFiles/matchedNodeElmReader.dir/all' failed
make[2]: *** [test/CMakeFiles/matchedNodeElmReader.dir/all] Error 2
CMakeFiles/Makefile2:3586: recipe for target 'test/CMakeFiles/matchedNodeElmReader.dir/rule' failed
make[1]: *** [test/CMakeFiles/matchedNodeElmReader.dir/rule] Error 2
Makefile:940: recipe for target 'matchedNodeElmReader' failed
make: *** [matchedNodeElmReader] Error 2
kjansen@pfe27:~/SCOREC-core/buildMGEN_Parted>

With no formal training in C++ and not even grasping what a templated function is, I am out of my element here.

I suppose it is obvious that I copy and pasted the int form of the functions and replaced int with long which is obviously not sufficient.

@KennethEJansen
Copy link
Contributor Author

KennethEJansen commented Jun 5, 2021

I just noticed that doubles used int for entries so maybe that is my problem. This compiled so I will give it a go.

diff --git a/apf/apfConvertTags.h b/apf/apfConvertTags.h
index 44da7db6..020916fa 100644
--- a/apf/apfConvertTags.h
+++ b/apf/apfConvertTags.h
@@ -29,7 +29,7 @@ namespace {
   
   template <> inline
   apf::MeshTag* createTag<long>(apf::Mesh* m,
-      const char* name, const long entries) {
+      const char* name, const int entries) {
     return m->createLongTag(name,entries);
   }

kjansen@pfe27:~/SCOREC-core/core>

@KennethEJansen
Copy link
Contributor Author

That build leads to different crashing in Chef (after rerunning matchedNodelElmReader of course)

[0] Zoltan ERROR in DD_Update_Local (line 278 of ../../src/Utilities/DDirectory/DD_Update.c): Invalid input parameter
[0] Zoltan ERROR in DD_Update_Local (line 278 of ../../src/Utilities/DDirectory/DD_Update.c): Invalid input parameter
[0] Zoltan ERROR in DD_Update_Local (line 278 of ../../src/Utilities/DDirectory/DD_Update.c): Invalid input parameter

a gazillion times.

Is Zoltan trying to go back to global ID's?

I will try some alternate partitioning approaches.

@KennethEJansen
Copy link
Contributor Author

KennethEJansen commented Jun 5, 2021

Switching from graph to zrib dodged that issue and successfully planned the partition. I suppose I should not be surprised that the actual migration is taking

foreverkjansen@pfe26:/nobackup/kjansen/SeparatedBump/DNS/ReL_2M/MGEN190_Parted/Chef/4320-160-Chef> tail -f output.11203347 
PUMI Git hash 2.2.0
PUMI version 2.2.0 Git hash 81fe4b9cc8713722c8878c3d866231c1fe2086bd
"../outModel.dmg" and "../LES_Periodic.spj" loaded in 0.042673 seconds
mesh ../../mner/outMesh/ loaded in 1195.702922 seconds
number of tet 1768308546 hex 0 prism 3703246344 pyramid 0
mesh entity counts: v 2151489252 e 9490228518 f 12807904473 r 5471554890
planned Zoltan split factor 3 to target imbalance 1.010000 in 518.890855 seconds
mesh expanded from 160 to 480 parts in 790.456072 seconds

and nothing for about an hour I only have a 26 minutes left according to top and I am guessing it won't complete and even if it does with the two reading processes consuming 38 and 35GB and them needing to push 1/3 of that onto two other that puts me at 38*(1+2/3)+35*(1 + 2/3)=121GB which is periously close to the 128GB on this node. Other nodes might be worse so I am going to have to find beefier nodes for this first step since I can't do the 1->2 and still land on the part count I want (only odd factors to get to 64800 parts starting from 160 e.g., 160X3X9X15=164800 which is the closest integer to 64*1024 if I want this to be a balanced optimal between running on ALCF 64 cores and NAS 40 cores per node).

iB Mem:  13111882+total, 11654528+used, 14573548 free,        0 buffers
KiB Swap:        0 total,        0 used,        0 free.   170956 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S   %CPU  %MEM     TIME+ COMMAND                                                                    
70519 kjansen   20   0  0.162t 0.038t  28940 R 100.00 30.94  93:47.33 chef                                                                       
70522 kjansen   20   0  0.160t 0.035t  28936 R 100.00 28.68  93:47.62 chef                                                                       
70520 kjansen   20   0  0.132t 9.313g  28900 R 100.00 7.448  93:55.64 chef                                                                       
70523 kjansen   20   0  0.132t 9.088g  29416 R 100.00 7.267  93:55.67 chef                                                                       
70521 kjansen   20   0  0.132t 8.730g  29732 R 100.00 6.981  93:55.68 chef                                                                       
70524 kjansen   20   0  0.130t 6.865g  28268 R 100.00 5.490  93:55.71 chef  

So it is really hard to guess how long this is going to take as this is the (lack of) progress when I am about out of time

   PID USER      PR  NI    VIRT    RES    SHR S   %CPU  %MEM     TIME+ COMMAND                                                                    
70519 kjansen   20   0  0.162t 0.038t  28940 R 100.00 30.94 118:08.04 chef                                                                       
70522 kjansen   20   0  0.160t 0.035t  28936 R 100.00 28.80 118:08.68 chef                                                                       
70524 kjansen   20   0  0.133t 9.431g  28268 R 100.00 7.542 118:16.77 chef                                                                       
70520 kjansen   20   0  0.132t 9.313g  28900 R 100.00 7.448 118:16.72 chef                                                                       
70523 kjansen   20   0  0.132t 9.133g  29416 R 100.00 7.304 118:16.74 chef                                                                       
70521 kjansen   20   0  0.132t 8.730g  29732 R 100.00 6.981 118:16.75 chef  

@KennethEJansen
Copy link
Contributor Author

local zrib came close to working before running out of memory AFTER the mesh had been migrated. I am trying it again on a larger number of nodes and am hopeful it will work.

This raises an efficiency question. @matthb2 and @cwsmith , do either of you know how to control the process layout to make somethinng like local zrib completely "on-node"? It looks to me like the first stage "works:". That is to say, if I want to part from 160 to 480 and I run this on 160 nodes, chef is smart enough to put one part on each node so that, on each node, I have one proces reading the super heavy part and 2 processes "waiting" until the reading and partingion from 160 to 480 is planned. My question is, is it possible to force that local partitioning to be completely local to a node-- that is that the two nodes waiting to receive the read mesh split into 3 parts are the local part numbers from the "read" part number.

I guess this must means that i need my process-id to follow chef's notion of new part numbering, right?

Maybe this is already the case but if it is not, I would think it could be a big win (but I guess only if the code exploited this locality and/or mpi was smart enough to see that this message did not have to leave the node?).

@KennethEJansen
Copy link
Contributor Author

The following worked:

  1. mesh generation of a 2.15B node, 5.47B element (3.70B Wedges and 1.77B Tet) mesh in serial, streamed to 160 files with linearly distributed nodes and approximately matching ranges of elements.

  2. matchedNodeElemReader parallel ingested those files and produced an mds mesh using 60 Cascadelake nodes (each restricted to 3 processes). The PBS log reported the following
    seconds to create mesh 919.473

  • verifying tags: fathers2D
    mesh verified in 389.217627 seconds
    mesh outMesh/ written in 48.277404 seconds
    writeVtuFile into buffers: 5.725096 seconds
    writeVtuFile buffers to disk: 1.328838 seconds
    vtk files rendered written in 8.456587 seconds
    CPU Time Used : 01:06:49
    Real Memory Used : 84629872kb
    Walltime Used : 00:23:44
  1. chef initially struggled to part this mesh until I retreated to local zrib which allowed me to get it parted from 160 to 480 using 160 Broadwell nodes, each restricted to 3 processes. Logged info
    Current directory is /nobackup/kjansen/SeparatedBump/DNS/ReL_2M/MGEN190_Parted/Chef/4320-160-Chef
    PUMI Git hash 2.2.0
    PUMI version 2.2.0 Git hash 81fe4b9
    "../outModel.dmg" and "../LES_Periodic.spj" loaded in 0.002827 seconds
    mesh ../../mner/outMesh/ loaded in 246.321636 seconds
    number of tet 1768308546 hex 0 prism 3703246344 pyramid 0
    mesh entity counts: v 2151489252 e 9490228518 f 12807904473 r 5471554890
    planned Zoltan split factor 3 to target imbalance 1.010000 in 22.594112 seconds
    mesh expanded from 160 to 480 parts in 110.375169 seconds
    mesh migrated from 160 to 480 in 2317.685050 seconds
    PARMA_STATUS postSplit disconnected 386 0.804
    PARMA_STATUS postSplit neighbors 8 4.642
    PARMA_STATUS postSplit smallest side of max neighbor part 1253
    PARMA_STATUS postSplit num parts with max neighbors 12
    PARMA_STATUS postSplit empty parts 0
    PARMA_STATUS postSplit small neighbor counts 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0
    PARMA_STATUS postSplit weighted vtx 3603574603.0 8977855.0 2439156.0 7507447.090
    PARMA_STATUS postSplit weighted edge 13834161337.0 44416947.0 8528092.0 28821169.452
    PARMA_STATUS postSplit weighted face 15702142104.0 63000269.0 8523617.0 32712796.050
    PARMA_STATUS postSplit weighted rgn 5471554890.0 27434732.0 2434680.0 11399072.688
    PARMA_STATUS postSplit owned bdry vtx 1448573566 8976010 0 3017861.596
    PARMA_STATUS postSplit shared bdry vtx 2898269233 8977855 118213 6038060.902
    PARMA_STATUS postSplit model bdry vtx 18066815 3502417 9495 37639.198
    PARMA_STATUS postSplit sharedSidesToElements 2.001 0.009 1.285
    PARMA_STATUS postSplit entity imbalance : 1.20 1.54 1.93 2.41
    mesh reordered in 221.988584 seconds
    max vertex load imbalance of partitioned mesh = 1.195860
    ratio of sum of all vertices to sum of owned vertices = 1.674921
    max region (3D) or face (2D) load imbalance of partitioned mesh = 2.406751
    getGrowthCurves: warning! not implemented for MDS mesh
    Found tag fathers2D
    generated output structs in 78.074114 seconds
    solution written in 87.502694 seconds
    mesh bz2:mdsMesh_bz2/ written in 417.615845 seconds
    geombc file written in 3.826407 seconds
  • verifying tags: fathers2D
    mesh verified in 234.789929 seconds

Job Resource Usage Summary for 11205702.pbspl1.nas.nasa.gov

CPU Time Used            : 03:18:45
Real Memory Used         : 21050824kb
Walltime Used            : 01:07:31
Exit Status              : 0
  1. Splitting by 9 from 480 to 4320 worked with global graph (really unsure why global graph was blowing up on 160-480) using 240 Broadwell nodes each running 18 processes (note 160 Broadwell nodes with 27 processes ran out of memory so loading 3 initial parts and asking each to split by 9 failed but then dropping back to 2 initial parts per node worked). Here is the log from that.

Current directory is /nobackup/kjansen/SeparatedBump/DNS/ReL_2M/MGEN190_Parted/Chef/4320-480-Chef
PUMI Git hash 2.2.0
PUMI version 2.2.0 Git hash 81fe4b9
"../outModel.dmg" and "../LES_Periodic.spj" loaded in 0.004401 seconds
mesh bz2:../480-160-Chef/mdsMesh_bz2/ loaded in 151.345034 seconds
number of tet 1768308546 hex 0 prism 3703246344 pyramid 0
mesh entity counts: v 2151489252 e 9490228518 f 12807904473 r 5471554890
planned Zoltan split factor 9 to target imbalance 1.010000 in 244.990717 seconds
mesh expanded from 480 to 4320 parts in 43.342638 seconds
mesh migrated from 480 to 4320 in 1383.880790 seconds
PARMA_STATUS postSplit disconnected 1 0.003
PARMA_STATUS postSplit neighbors 40 12.977
PARMA_STATUS postSplit smallest side of max neighbor part 1
PARMA_STATUS postSplit num parts with max neighbors 2
PARMA_STATUS postSplit empty parts 0
PARMA_STATUS postSplit small neighbor counts 1:16 2:10 3:10 4:4 5:16 6:4 7:6 8:8 9:8 10:4
PARMA_STATUS postSplit weighted vtx 2254357585.0 529733.0 441757.0 521842.034
PARMA_STATUS postSplit weighted edge 9728657223.0 3578498.0 1740384.0 2252003.987
PARMA_STATUS postSplit weighted face 12945858856.0 6047156.0 2152503.0 2996726.587
PARMA_STATUS postSplit weighted rgn 5471554890.0 2997581.0 853875.0 1266563.632
PARMA_STATUS postSplit owned bdry vtx 98104860 59598 0 22709.458
PARMA_STATUS postSplit shared bdry vtx 198583509 62550 21623 45968.405
PARMA_STATUS postSplit model bdry vtx 15468842 26756 0 3580.750
PARMA_STATUS postSplit sharedSidesToElements 0.083 0.014 0.054
PARMA_STATUS postSplit entity imbalance : 1.02 1.59 2.02 2.37
mesh reordered in 28.269127 seconds
max vertex load imbalance of partitioned mesh = 1.015121
ratio of sum of all vertices to sum of owned vertices = 1.047813
max region (3D) or face (2D) load imbalance of partitioned mesh = 2.366704
getGrowthCurves: warning! not implemented for MDS mesh
Found tag fathers2D
generated output structs in 10.640220 seconds
pcu peers 4320 max group size 4096 posix groups 2
solution written in 6.323300 seconds
pcu peers 4320 max group size 4096 posix groups 2
mesh bz2:mdsMesh_bz2/ written in 50.941237 seconds
pcu peers 4320 max group size 4096 posix groups 2
geombc file written in 2.029042 seconds

  • verifying tags: fathers2D
    mesh verified in 14.829367 seconds

Job Resource Usage Summary for 11207073.pbspl1.nas.nasa.gov

CPU Time Used            : 09:39:42
Real Memory Used         : 44068708kb
Walltime Used            : 00:33:14
Exit Status              : 0
  1. The final partition (for now) from 4320 to 38880 (splitFactor 9) ran on 486 Skylake nodes each running 80 processes on 40 cores with the following log
    Current directory is /nobackup/kjansen/SeparatedBump/DNS/ReL_2M/MGEN190_Parted/Chef/38880-4320-Chef
    PUMI Git hash 2.2.0
    PUMI version 2.2.0 Git hash 81fe4b9
    "../outModel.dmg" and "../LES_Periodic.spj" loaded in 2.583941 seconds
    pcu peers 4320 max group size 4096 posix groups 2
    mesh bz2:../4320-480-Chef/mdsMesh_bz2/ loaded in 19.140128 seconds
    number of tet 1768308546 hex 0 prism 3703246344 pyramid 0
    mesh entity counts: v 2151489252 e 9490228518 f 12807904473 r 5471554890
    planned Zoltan split factor 9 to target imbalance 1.010000 in 37.364211 seconds
    mesh expanded from 4320 to 38880 parts in 6.185460 seconds
    mesh migrated from 4320 to 38880 in 106.696375 seconds
    PARMA_STATUS postSplit disconnected 1 0.003
    PARMA_STATUS postSplit neighbors 46 14.141
    PARMA_STATUS postSplit smallest side of max neighbor part 17
    PARMA_STATUS postSplit num parts with max neighbors 1
    PARMA_STATUS postSplit empty parts 0
    PARMA_STATUS postSplit small neighbor counts 1:656 2:478 3:394 4:448 5:324 6:398 7:348 8:454 9:350 10:460
    PARMA_STATUS postSplit weighted vtx 2382598233.0 63391.0 51160.0 61280.819
    PARMA_STATUS postSplit weighted edge 10010836664.0 410226.0 195085.0 257480.367
    PARMA_STATUS postSplit weighted face 13099832291.0 679345.0 235601.0 336929.843
    PARMA_STATUS postSplit weighted rgn 5471554890.0 333226.0 91675.0 140729.292
    PARMA_STATUS postSplit owned bdry vtx 216506303 14542 0 5568.578
    PARMA_STATUS postSplit shared bdry vtx 445225600 15466 3877 11451.276
    PARMA_STATUS postSplit model bdry vtx 16059495 7160 0 413.053
    PARMA_STATUS postSplit sharedSidesToElements 0.169 0.022 0.115
    PARMA_STATUS postSplit entity imbalance : 1.03 1.59 2.02 2.37
    mesh reordered in 4.683730 seconds
    max vertex load imbalance of partitioned mesh = 1.034435
    ratio of sum of all vertices to sum of owned vertices = 1.107418
    max region (3D) or face (2D) load imbalance of partitioned mesh = 2.367851
    getGrowthCurves: warning! not implemented for MDS mesh
    Found tag fathers2D
    generated output structs in 2.497168 seconds
    pcu peers 38880 max group size 4096 posix groups 10
    solution written in 5.776681 seconds
    pcu peers 38880 max group size 4096 posix groups 10
    mesh bz2:mdsMesh_bz2/ written in 33.627953 seconds
    pcu peers 38880 max group size 4096 posix groups 10
    geombc file written in 16.964271 seconds
  • verifying tags: fathers2D
    mesh verified in 12.251053 seconds
    =>> PBS: job killed: walltime 7254 exceeded limit 7200

Job Resource Usage Summary for 11207325.pbspl1.nas.nasa.gov

CPU Time Used            : 06:48:06
Real Memory Used         : 81796536kb
Walltime Used            : 02:01:07
Exit Status              : 271

Note if you add up the chef times you see that Chef completed in a couple of minutes but I followed the Chef run with a convert from posix to syncio format and this apparently produced a hang viz
kjansen@pfe27:/nobackup/kjansen/SeparatedBump/DNS/ReL_2M/MGEN190_Parted/Chef/38880-4320-Chef> ls -alt |more
total 48
drwxrwx---+ 4 kjansen a1983 4096 Jun 7 05:03 .
-rwxrwx---+ 1 kjansen a1983 1329 Jun 7 04:38 runC.sh
-rw------- 1 kjansen a1983 3470 Jun 7 03:59 ChefTetDz6Bump.o11207325
-rw-rw----+ 1 kjansen a1983 0 Jun 7 02:02 outputC.11207325
-rw-rw----+ 1 kjansen a1983 2453 Jun 7 02:02 output.11207325
drwxrwx---+ 21 kjansen a1983 4096 Jun 7 02:02 38880-procs_case
drwxr-x---+ 21 kjansen a1983 4096 Jun 7 02:01 mdsMesh_bz2
-rwxrwx---+ 1 kjansen a1983 1323 Jun 7 01:57 run.sh.11207325

My forensics suggest that the Chef part of the job started at 1:57 and finished at 2:02 (5 minutes) and the converter, which pipes its log to a separate script opened that log script but did nothing for the remaining 115 minutes. Stderr reports nothing beyond the PBS job killed line.
The convert is not SCOREC-core code so we can declare victory on this ticket I think (though it is perhaps interesting to still have a dialog on my previous post about process/partition layout).

I really want to thank you both for your help with this very challenging case.

@KennethEJansen
Copy link
Contributor Author

KennethEJansen commented Jul 16, 2021

UPDATE3: Realizing I should tell you all what branch these fixes (and remaining problem) are on
globalToVertCorruption

Not exactly an apt name but that is what it looked like from the first signs of a problem.

The above celebration was premature. A twice bigger grid has been failing miserably and we have found several more overflows. @jacobmerson, we did finally get the -undefined flag added with @cwsmith help and it was great at finding a few overflowing int products that I did not realize needed casting to land on on a long RHS of the equality without overflow.

However, we are apparently still missing something as verify is still reporting
3205 seconds to create mesh 1848.589
3206 - verifying tags: fathers2D
3207 ERROR PCU message size exceeds INT_MAX... exiting
3208 MPT ERROR: Rank 297(g:297) received signal SIGABRT/SIGIOT(6).

And a stacktrace of

3245 MPT: (gdb) #0 0x00002aaaaecaa7da in waitpid () from /lib64/libpthread.so.0
3246 MPT: #1 0x00002aaaaf60d906 in mpi_sgi_system (
3247 MPT: #2 MPI_SGI_stacktraceback (
3248 MPT: header=header@entry=0x7fffffff74c0 "MPT ERROR: Rank 297(g:297) rec eived signal SIGABRT/SIGIOT(6).\n\tProcess ID: 30670, Host: r135i6n9, Progr am: /home5/kjansen/SCOREC-core/buildMoreSan/test/matchedNodeElmReader\n\tMP T Version: HPE MPT 2.17 "...) at sig.c:339
3249 MPT: #3 0x00002aaaaf60db08 in first_arriver_handler (signo=signo@entry=6,
3250 MPT: stack_trace_sem=stack_trace_sem@entry=0x2aaab2d80080) at sig.c:488
3251 MPT: #4 0x00002aaaaf60deeb in slave_sig_handler (signo=6, siginfo=,
3252 MPT: extra=) at sig.c:563
3253 MPT: #5
3254 MPT: #6 0x00002aaab0eb0247 in raise () from /lib64/libc.so.6
3255 MPT: #7 0x00002aaab0eb161a in abort () from /lib64/libc.so.6
3256 MPT: #8 0x00000000006ee72e in pcu_pmpi_send2 (m=,
3257 MPT: tag=, comm=)
3258 MPT: at /home5/kjansen/SCOREC-core/core/pcu/pcu_pmpi.c:64
3259 MPT: #9 0x00000000006ed3f4 in send_peers (t=)
3260 MPT: at /home5/kjansen/SCOREC-core/core/pcu/pcu_msg.c:152
3261 MPT: #10 0x00000000006ed803 in pcu_msg_send (m=m@entry=0xc51d80 <global_pms g>)
3262 MPT: at /home5/kjansen/SCOREC-core/core/pcu/pcu_msg.c:161
3263 MPT: #11 0x00000000006e9062 in PCU_Comm_Send ()
3264 MPT: at /home5/kjansen/SCOREC-core/core/pcu/pcu.c:162
3265 MPT: #12 0x00000000006a8d81 in apf::verifyRemoteCopies (m=0x11194ce10)
3266 MPT: at /home5/kjansen/SCOREC-core/core/apf/apfVerify.cc:294
3267 MPT: #13 apf::verify (m=0x11194ce10, abort_on_error=true)
3268 MPT: at /home5/kjansen/SCOREC-core/core/apf/apfVerify.cc:829
3269 MPT: #14 0x000000000043f66e in main (argc=, argv=)
3270 MPT: at /home5/kjansen/SCOREC-core/core/test/matchedNodeElmReader.cc:85 4

The following errors were also reported early in the program (presumably from undefined)

/home5/kjansen/SCOREC-core/core/pcu/pcu_coll.c:34:3: runtime error: null pointer passed as argument 1, which is declared to never be null
96 /home5/kjansen/SCOREC-core/core/pcu/pcu_coll.c:34:3: runtime error: null pointer passed as argument 2, which is declared to never be null

but it does not crash or give me the stack that leads to this code. grep also did not give me clues how the code gets here.

I can't get totalview to launch on this executable either. I guess I will try an exec without -undefined.
UPDATE: I did that and determined that this is may be coming from line 718 as those lines occur NRANKS time for each line

714 FILE* fc = fopen(filename , "r");
715 PCU_ALWAYS_ASSERT(fc);
716 getNumVerts(fc,mesh.numVerts);
717 mesh.localNumVerts=mesh.numVerts;
718 mesh.numVerts=PCU_Add_Long(mesh.numVerts);
719

Though going forward I also see that line 83 of apfBox.cc also hits that function. My TV skill (actually my C++) skills are not good enough to discern which one is getting to that point with NULLS to merge but I suppose both are a one-per-rank event and thus both candidates. Probably neither is causing the INT_MAX issue though.

UPDATE2. TV confirms that it is apf::BoxBoulder that is the stack source that gets there with NULL on the first 2 args. Unless someone thinks this could screw up stuff I am going to assume this message from UBSAN is not affecting my problem.

The above is all with gnu 6.2 compiler. I am also contemplating a 9.3 build as NAS now has this available. I can dig back over your messages but if you have recommendations regarding flags to try that would be great (since, thanks to your ticket I do now know how to set "extra" flags and not have them clobbered.

@KennethEJansen
Copy link
Contributor Author

My conversion from 320 parts to 160 parts completed and this was the case I fed to Totalview that I described above. The conversion code was written between 2 hours past when I usually sleep so it could be wrong BUT it does crash during the creation of the remotes rather than during verify
2920 MPT: #7 0x00002aaab0eb161a in abort () from /lib64/libc.so.6
2921 MPT: #8 0x00000000006ee72e in pcu_pmpi_send2 (m=,
2922 MPT: tag=, comm=)
2923 MPT: at /home5/kjansen/SCOREC-core/core/pcu/pcu_pmpi.c:64
2924 MPT: #9 0x00000000006ed3f4 in send_peers (t=)
2925 MPT: at /home5/kjansen/SCOREC-core/core/pcu/pcu_msg.c:152
2926 MPT: #10 0x00000000006ed803 in pcu_msg_send (m=m@entry=0xc51d80 <global_pm sg>)
2927 MPT: at /home5/kjansen/SCOREC-core/core/pcu/pcu_msg.c:161
2928 MPT: #11 0x00000000006e9062 in PCU_Comm_Send ()
2929 MPT: at /home5/kjansen/SCOREC-core/core/pcu/pcu.c:162
2930 MPT: #12 0x00000000004a8ab0 in align_copies (net=0xf6107c68, m=0xf6107740)
2931 MPT: at /home5/kjansen/SCOREC-core/core/mds/mds_apf.c:281
2932 MPT: #13 0x00000000004a9efe in mds_align_remotes (m=)
2933 MPT: at /home5/kjansen/SCOREC-core/core/mds/mds_apf.c:301
2934 MPT: #14 0x0000000000458ba5 in apf::alignMdsRemotes (in=in@entry=0xf610391 0)
2935 MPT: at /home5/kjansen/SCOREC-core/core/mds/apfMDS.cc:1026
2936 MPT: #15 0x000000000043f47c in main (argc=, argv=)
2937 MPT: at /home5/kjansen/SCOREC-core/core/test/matchedNodeElmReader.cc:8 08

so theres that. While waiting for another 160 node allocation I have been reviewing the mesh and I don't see any errors in it. On one hand, this is the only mesh I can get into TotalView (256 core license so 320 part original mesh is a no go).

@KennethEJansen
Copy link
Contributor Author

KennethEJansen commented Jul 17, 2021

I restored the writing of debug information, compiled with gcc 9.3 (newest available), searched the docs for the flags @jacobmerson listed above (found none of them so stuck with -fsanitize=undefined ) and reran the 320 part case (no surgery to come down to 1600. It looks like pretty much the same crash BUT now we hav debug.txt information that was written at the time of the remote construction. Please critique the following autopsy:
1)seconds to create mesh 2286.417

  • verifying tags: fathers2D
    ERROR PCU message size exceeds INT_MAX... exiting
    MPT ERROR: Rank 297(g:297) received signal SIGABRT/SIGIOT(6).
    Process ID: 34985, Host: r145i6n1, Program: /home5/kjansen/SCOREC-core/build9pt3/test/matchedNodeElmReader
    MPT Version: HPE MPT 2.17 11/30/17 08:08:29
    Can I trust that Rank 297 was the one reporting an INT_MAX with the following stackRROR PCU message size exceeds INT_MAX... exiting
    MPT ERROR: Rank 297(g:297) received signal SIGABRT/SIGIOT(6).
    Process ID: 34985, Host: r145i6n1, Program: /home5/kjansen/SCOREC-core/build9pt3/test/matchedNodeElmReader
    MPT Version: HPE MPT 2.17 11/30/17 08:08:29

MPT: --------stack traceback-------
MPT: Attaching to program: /proc/34985/exe, process 34985
MPT: (No debugging symbols found in /home5/kjansen/Utilities/simModSuite/14.0-180813dev/lib/x64_rhel7_gcc48/psKrnl/libpskernel.so)
MPT: (No debugging symbols found in /usr/lib64/libbz2.so.1)
MPT: (No debugging symbols found in /lib64/libpthread.so.0)
MPT: [Thread debugging using libthread_db enabled]
MPT: Using host libthread_db library "/lib64/libthread_db.so.1".
MPT: (No debugging symbols found in /usr/lib64/libcpuset.so.1)
MPT: (No debugging symbols found in /usr/lib64/libbitmask.so.1)
MPT: (No debugging symbols found in /lib64/libm.so.6)
MPT: (No debugging symbols found in /lib64/libc.so.6)
MPT: (No debugging symbols found in /lib64/ld-linux-x86-64.so.2)
MPT: (No debugging symbols found in /lib64/libdl.so.2)
MPT: (No debugging symbols found in /lib64/librt.so.1)
MPT: (No debugging symbols found in /usr/lib64/libibverbs.so.1)
MPT: (No debugging symbols found in /usr/lib64/libnl-route-3.so.200)
MPT: (No debugging symbols found in /usr/lib64/libnl-3.so.200)
MPT: (No debugging symbols found in /usr/lib64/libmthca-rdmav2.so)
MPT: (No debugging symbols found in /usr/lib64/libmlx5-rdmav2.so)
MPT: (No debugging symbols found in /usr/lib64/libmlx4-rdmav2.so)
MPT: (No debugging symbols found in /usr/lib64/libcxgb3-rdmav2.so)
MPT: 0x00002aaaaecaa7da in waitpid () from /lib64/libpthread.so.0
MPT: warning: File "/nasa/pkgsrc/sles12/2021Q1/gcc9/lib64/libstdc++.so.6.0.28-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
MPT: To enable execution of this file add
MPT: add-auto-load-safe-path /nasa/pkgsrc/sles12/2021Q1/gcc9/lib64/libstdc++.so.6.0.28-gdb.py
MPT: line to your configuration file "/home5/kjansen/.gdbinit".
MPT: To completely disable this security protection add
MPT: set auto-load safe-path /
MPT: line to your configuration file "/home5/kjansen/.gdbinit".
MPT: For more information about this security protection see the
MPT: "Auto-loading safe path" section in the GDB manual. E.g., run from the shell:
MPT: info "(gdb)Auto-loading safe path"
MPT: Missing separate debuginfos, use: zypper install glibc-debuginfo-2.22-114.8.3.x86_64 libbz2-1-debuginfo-1.0.6-30.8.1.x86_64 libcxgb3-rdmav2-debuginfo-1.3.1-6.2.x86_64 libibverbs-debuginfo-41mlnx1-OFED.4.9.0.0.7.49017.x86_64 libmlx4-debuginfo-41mlnx1-OFED.4.7.3.0.3.49017.x86_64 libmlx5-debuginfo-41mlnx1-OFED.4.9.0.1.2.49017.x86_64 libmthca-rdmav2-debuginfo-1.0.6-5.2.x86_64 libnl3-200-debuginfo-3.2.23-4.4.6.x86_64
MPT: (gdb) #0 0x00002aaaaecaa7da in waitpid () from /lib64/libpthread.so.0
MPT: #1 0x00002aaaaf60d906 in mpi_sgi_system (
MPT: #2 MPI_SGI_stacktraceback (
MPT: header=header@entry=0x7fffffffbf80 "MPT ERROR: Rank 297(g:297) received signal SIGABRT/SIGIOT(6).\n\tProcess ID: 34985, Host: r145i6n1, Program: /home5/kjansen/SCOREC-core/build9pt3/test/matchedNodeElmReader\n\tMPT Version: HPE MPT 2.17 11"...) at sig.c:339
MPT: #3 0x00002aaaaf60db08 in first_arriver_handler (signo=signo@entry=6,
MPT: stack_trace_sem=stack_trace_sem@entry=0x2aaab2c20080) at sig.c:488
MPT: #4 0x00002aaaaf60deeb in slave_sig_handler (signo=6, siginfo=,
MPT: extra=) at sig.c:563
MPT: #5
MPT: #6 0x00002aaab0d53247 in raise () from /lib64/libc.so.6
MPT: #7 0x00002aaab0d5461a in abort () from /lib64/libc.so.6
MPT: #8 0x00000000007b7ee0 in pcu_pmpi_send2 (m=0x4244ba628, tag=,
MPT: comm=) at /home5/kjansen/SCOREC-core/core/pcu/pcu_pmpi.c:64
MPT: #9 pcu_pmpi_send2 (m=0x4244ba628, tag=, comm=)
MPT: at /home5/kjansen/SCOREC-core/core/pcu/pcu_pmpi.c:60
MPT: #10 0x00000000007b5c7a in send_peers (t=)
MPT: at /home5/kjansen/SCOREC-core/core/pcu/pcu_msg.c:152
MPT: #11 0x00000000007b6439 in pcu_msg_send (m=m@entry=0xc731c0 <global_pmsg>)
MPT: at /home5/kjansen/SCOREC-core/core/pcu/pcu_msg.c:161
MPT: #12 0x00000000007b047d in PCU_Comm_Send ()
MPT: at /home5/kjansen/SCOREC-core/core/pcu/pcu.c:162
MPT: #13 0x000000000074b467 in apf::verifyRemoteCopies (m=0x11196c560)
MPT: at /home5/kjansen/SCOREC-core/core/apf/apfVerify.cc:294
MPT: #14 apf::verify (m=, abort_on_error=true)
MPT: at /home5/kjansen/SCOREC-core/core/apf/apfVerify.cc:829
MPT: #15 0x000000000044b029 in main (argc=, argv=)
MPT: at /home5/kjansen/SCOREC-core/core/test/matchedNodeElmReader.cc:854
MPT: (gdb) A debugging session is active.
MPT:
MPT: Inferior 1 [process 34985] will be detached.
MPT:
MPT: Quit anyway? (y or n) [answered Y; input not from terminal]
MPT: Detaching from program: /proc/34985/exe, process 34985
MPT: [Inferior 1 (process 34985) detached]

MPT: -----stack traceback ends-----

  1. if yes then looking at debug297.txt I find the following logging (line#, rank, what, who, info) just when "what" changes (. lines means same what continues).
    1 sending remotes: gid 3985683327 to 296
    2 sending remotes: gid 3985683328 to 296
    .
    .
    10236348 sending remotes: gid 3995919674 to 296
    10236349 sending remotes: gid 3999724462 to 298
    .
    .
    19885836 sending remotes: gid 4009373949 to 298
    19885837 receive remotes: from 296 gid 3985683327
    .
    .
    30122184 receive remotes: from 296 gid 3995919674
    30122185 receive remotes: from 298 gid 3999724462
    .
    .
    39771672 receive remotes: from 298 gid 4009373949
    39771673 297 mysize 13454275
    39771674 297 sending start 3995919675 n 13454275 to 297
    39771675 297 receiving start 3995919675 n 13454275 from 297
    39771676 297 found match 3995919217 at gid 3995920277
    .
    .
    39793573 297 found match 4009374714 at gid 4009373062
    39793574 297 requesting matches of gid 3985683327 isShared 1 isOwned 0 from 296
    .
    .
    63484196 297 requesting matches of gid 4009373949 isShared 1 isOwned 0 from 297
    63484197 297 packing i 602 gid 3995920277 matchGid 3995919217 to 297
    .
    .
    63521295 297 packing i 13453387 gid 4009373062 matchGid 4009374714 to 298
    63521296 297 attaching match 3985684419 to gid 3985683327
    .
    .
    63558726 297 attaching match 4009374714 to gid 4009373062
    63558727 297 packing pointer to 296 gid 3985683327 vert 0x3941
    .
    .
    87249349 297 packing pointer to 297 gid 4009373949 vert 0xb4bebf1
    87249350 297 unpacking pointer from 297 gid 3995919675 vert 0x19
    .
    .
    110353112 297 unpacking pointer from 298 gid 4009373949 vert 0x9
    110353113 297 packing req ptr to 296 gid 3985683327 matchGid 3985684419
    .
    .
    110390543 297 packing req ptr to 298 gid 4009373062 matchGid 4009374714
    110390544 297 unpacking ptr req from 296 gid 3995919217 matchGid 3995920277
    .
    .
    110427642 297 unpacking ptr req from 298 gid 4009374714 matchGid 4009373062
    110427643 297 packing match ptr to 296 gid 3995919217 matchGid 3995920277 vert 0x3919 owner 297
    .
    .
    110495142 297 packing match ptr to 298 gid 4009374714 matchGid 4009373062 vert 0x4881 owner 298
    110495143 297 unpacked match ptr from 296 gid 3985683327 matchGid 3985684419 matchPtr 0x321b4d1 owner 296
    110495144 297 unpacked match ptr from 296 gid 3985683327 matchGid 3985684419 matchPtr 0x3921 owner 297
    .
    .
    110532905 297 unpacked match ptr from 297 gid 3999724461 matchGid 3999722761 matchPtr 0x3b8b191 owner 297
    110532906 297 unpacked match ptr from 297 gid 3999724462 matchGid 3999726162 matchPtr 0x3b95281 owner 297
    110532907 297 unpacked match ptr from 297 gid 3999724462 matchGid 3999726162 matchPtr 0x4871 owner 298
    110532908 297 unpacked match ptr from 297 gid 3999726162 matchGid 3999724462 matchPtr 0x3b91d61 owner 297
    110532909 297 unpacked match ptr from 297 gid 3999726162 matchGid 3999724462 matchPtr 0x4891 owner 298
    .
    .
    110563306 297 unpacked match ptr from 298 gid 4009373062 matchGid 4009374714 ma chPtr 0x4869 owner 298
    110563307 297 add remote copy match ptr to 296 gid 3985683327
    .
    .
    110578839 297 add remote copy match ptr to 296 gid 3995919217
    110578840 297 add remote copy match ptr to 298 gid 3999724462
    .
    .
    110594040 297 add remote copy match ptr to 298 gid 4009373062

Note the slab nature of mesh and its ordering is going to make process m have common boundary with m-1 and m+1. Looking at these numbers they seem reasonable and I don't see any reason why verify should come to think that there are messages greater than INT_MAX. The above shows 110,594,040 TOTAL messages which is far less than INT_MAX. The mesh has 4B vertices distributed across 320 processes in tis first stream-partition so that is about 13M vertices per part. It is simply impossible to generate 2B packets of information so I guess something is again getting corrupted.

That said, I feel like we are probably logging the PROCESS of finding the remotes but I am not sure we are actually logging the remotes of a given part which is what verify is checking, right? That is, once the process is complete the part boundary information knows nothing about these long global ID's, rather they are only a crutch to allow parts to recognize when they have remotes on other ranks so, while I think we now have that process working correctly with long's it is still not clear to me why verify thinks there is a problem (but this is likely because I don't understand what pcu is doing and lack the c++ skills to figure it out.

Sleep may help but I doubt it. I am also running a case with verify turned off just for grins.

@KennethEJansen
Copy link
Contributor Author

KennethEJansen commented Jul 17, 2021

Humbly reporting that the above is full of miss-statements. While there are ONLY 13.5 M verts on a part, I failed to remember that mds tracks edges, and face remotes as well (e = 4.5v, f=6v both approximate for this mesh that is a mix of wedges and tets ) AND the code also sends messages to self . Finally it packs this all into ONE message for all entities and overflows INT_MAX. I broke the messages into one d entity per messages and it looks like this keeps the message size < INT_MAX.

I think it would not be TOO hard to put in some logic to do or not do this split based on an upper bound on the size of the message by looking at #e, #f, #v on a part. Of course some headroom could also be made by not sending messages to yourself as I think this is ALWAYS going to be the largest message and would seem to be altogether un-necessary. That is pervasive in the code though so, unless I am missing a need for such a thing, this was deep rooted mistake that will now be not so easy to remove. Seems like it would be influencing performance pretty heavily as well?

@cwsmith
Copy link
Contributor

cwsmith commented Jul 18, 2021

I think some of the self sends are used to deal with matched meshes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants