Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove workRequest layer from GPU code #166

Merged
merged 48 commits into from
Apr 22, 2024
Merged

Conversation

spencerw
Copy link
Member

@spencerw spencerw commented Feb 7, 2024

This gets rid of the workRequest functions which allowed ChaNGa to asynchronously interact with the GPU. Instead, GPU requests are more directly handled by submitting memory transfer/allocation requests, kernel launches and callback assignments to CUDA streams.

Each TreePiece is assigned a CUDA stream in a round-robin fashion. The amount of available streams can be controlled via the runtime parameter 'numStreams'.

Remove WorkRequest code from TreePiecePartListDataTransferLocal
@spencerw
Copy link
Member Author

spencerw commented Feb 23, 2024

@trquinn I'm having a look through the PR changes to see where it would be helpful to add comments and further documentation. Obviously, a docstring with a @brief and some @param descriptions would probably be useful above any functions I added, especially ones that get called from another module.

I'm wondering what to do about most of the functions in HostCUDA.cu, such as TreePieceCellListDataTransferLocal. The function parameters are all contained with CudaRequest objects. I'm not sure what the best way to use doxygen here would be. Would just a '@brief' comment above each be sufficient? I didn't any examples elsewhere in the code to follow.

On a related note: do you know how doxygen handles ifdefs (such as TreePiece::commenceCalculateGravityLocal)? If there are multiple versions of a function, does it work to have multiple docstrings?

@spencerw
Copy link
Member Author

spencerw commented Apr 1, 2024

@trquinn I think this PR is ready for review and merge. I misplaced an #ifdef inside of commenceCalculateGravityLocal that was preventing the dual tree walk from initializing properly when CUDA was defined, but GPU_LOCAL_TREE_WALK was not, hence some TreePieces skipping their gravity calculation.

'teststep' and 'testcosmo' are now passing on Frontera and Lonestar, with and without the local gpu tree walk enabled, on both single and multiple nodes.

@trquinn
Copy link
Member

trquinn commented Apr 2, 2024

I'm getting an abort on line 844 of DataManager.cpp. "partIndex" is vestigial, and its declaration at line 791 should also be deleted.
This problem must have been introduced in an earlier pull request, but we should take the opportunity to fix it.

@spencerw
Copy link
Member Author

spencerw commented Apr 2, 2024

I'm getting an abort on line 844 of DataManager.cpp. "partIndex" is vestigial, and its declaration at line 791 should also be deleted. This problem must have been introduced in an earlier pull request, but we should take the opportunity to fix it.

For some reason that 'CkAssert' on line 844 isn't firing when I run my tests. I just tried stepping through the function with GDB and it goes straight from line 839 to line 848 without a clear reason why. This was compiled with gcc/12.2.0 on Fontera with optimization disabled.

Breakpoint 4, DataManager::serializeLocal (this=0x2aaab80c8d90, nodeRoot=0x2aaab8dbd6d0) at DataManager.cpp:838
838	  savedNumTotalParticles = numParticles;
(gdb) n
839	  savedNumTotalNodes = localMoments.length();
(gdb) n
848	  size_t sLocalParts = numParticles*sizeof(CompactPartData);
(gdb) list
843	#endif
844	  CkAssert(partIndex == numParticles);
845	#if COSMO_PRINT_BK > 1
846	  CkPrintf("(%d): DM->GPU local tree\n", CkMyPe());
847	#endif
848	  size_t sLocalParts = numParticles*sizeof(CompactPartData);
849	  size_t sLocalMoments = localMoments.length()*sizeof(CudaMultipoleMoments);
850	  allocatePinnedHostMemory((void **)&bufLocalParts, sLocalParts);
851	  allocatePinnedHostMemory((void **)&bufLocalMoments, sLocalMoments);

In any case, I've removed the unneeded references to partIndex in serializeLocal. I already did an upstream merge of the gpu_multicopy changes into this PR and the auto-merge created a bit of a mess I had to clean up.

@trquinn
Copy link
Member

trquinn commented Apr 3, 2024

CkAssert()s don't fire if Charm is compiled "--with-production"

@spencerw
Copy link
Member Author

spencerw commented Apr 4, 2024

CkAssert()s don't fire if Charm is compiled "--with-production"

That was indeed the issue. Just re-ran the tests on Frontera without '--with-production' and confirmed everything is passing.

Copy link
Member

@trquinn trquinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of code cleanups recommended.

CudaFunctions.h Show resolved Hide resolved
CudaFunctions.h Show resolved Hide resolved
HostCUDA.cu Outdated Show resolved Hide resolved
HostCUDA.cu Outdated Show resolved Hide resolved
HostCUDA.cu Outdated Show resolved Hide resolved
ParallelGravity.cpp Outdated Show resolved Hide resolved
ParallelGravity.cpp Outdated Show resolved Hide resolved
ParallelGravity.h Outdated Show resolved Hide resolved
ParallelGravity.h Outdated Show resolved Hide resolved
ParallelGravity.h Outdated Show resolved Hide resolved
@spencerw
Copy link
Member Author

Requested changes have been made and I think this is ready for merge.

See comment about docstrings for TreePieceDataTransferBasic. Following the old convention, I added these to HostCUDA.cu, but let me know if you would prefer to move this and the other function documentation to the respective .h files instead.

@trquinn trquinn merged commit 6b41f91 into N-BodyShop:master Apr 22, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants