Remove workRequest layer from GPU code #166

spencerw · 2024-02-07T07:26:10Z

This gets rid of the workRequest functions which allowed ChaNGa to asynchronously interact with the GPU. Instead, GPU requests are more directly handled by submitting memory transfer/allocation requests, kernel launches and callback assignments to CUDA streams.

Each TreePiece is assigned a CUDA stream in a round-robin fashion. The amount of available streams can be controlled via the runtime parameter 'numStreams'.

…naged by DataManager

Remove WorkRequest code from TreePiecePartListDataTransferLocal

spencerw · 2024-02-23T22:06:57Z

@trquinn I'm having a look through the PR changes to see where it would be helpful to add comments and further documentation. Obviously, a docstring with a @brief and some @param descriptions would probably be useful above any functions I added, especially ones that get called from another module.

I'm wondering what to do about most of the functions in HostCUDA.cu, such as TreePieceCellListDataTransferLocal. The function parameters are all contained with CudaRequest objects. I'm not sure what the best way to use doxygen here would be. Would just a '@brief' comment above each be sufficient? I didn't any examples elsewhere in the code to follow.

On a related note: do you know how doxygen handles ifdefs (such as TreePiece::commenceCalculateGravityLocal)? If there are multiple versions of a function, does it work to have multiple docstrings?

Incorporate gpu_multicopy changes

…list calculation

spencerw · 2024-04-01T05:56:51Z

@trquinn I think this PR is ready for review and merge. I misplaced an #ifdef inside of commenceCalculateGravityLocal that was preventing the dual tree walk from initializing properly when CUDA was defined, but GPU_LOCAL_TREE_WALK was not, hence some TreePieces skipping their gravity calculation.

'teststep' and 'testcosmo' are now passing on Frontera and Lonestar, with and without the local gpu tree walk enabled, on both single and multiple nodes.

trquinn · 2024-04-02T00:03:18Z

I'm getting an abort on line 844 of DataManager.cpp. "partIndex" is vestigial, and its declaration at line 791 should also be deleted.
This problem must have been introduced in an earlier pull request, but we should take the opportunity to fix it.

spencerw · 2024-04-02T00:45:47Z

I'm getting an abort on line 844 of DataManager.cpp. "partIndex" is vestigial, and its declaration at line 791 should also be deleted. This problem must have been introduced in an earlier pull request, but we should take the opportunity to fix it.

For some reason that 'CkAssert' on line 844 isn't firing when I run my tests. I just tried stepping through the function with GDB and it goes straight from line 839 to line 848 without a clear reason why. This was compiled with gcc/12.2.0 on Fontera with optimization disabled.

Breakpoint 4, DataManager::serializeLocal (this=0x2aaab80c8d90, nodeRoot=0x2aaab8dbd6d0) at DataManager.cpp:838
838	  savedNumTotalParticles = numParticles;
(gdb) n
839	  savedNumTotalNodes = localMoments.length();
(gdb) n
848	  size_t sLocalParts = numParticles*sizeof(CompactPartData);
(gdb) list
843	#endif
844	  CkAssert(partIndex == numParticles);
845	#if COSMO_PRINT_BK > 1
846	  CkPrintf("(%d): DM->GPU local tree\n", CkMyPe());
847	#endif
848	  size_t sLocalParts = numParticles*sizeof(CompactPartData);
849	  size_t sLocalMoments = localMoments.length()*sizeof(CudaMultipoleMoments);
850	  allocatePinnedHostMemory((void **)&bufLocalParts, sLocalParts);
851	  allocatePinnedHostMemory((void **)&bufLocalMoments, sLocalMoments);

In any case, I've removed the unneeded references to partIndex in serializeLocal. I already did an upstream merge of the gpu_multicopy changes into this PR and the auto-merge created a bit of a mess I had to clean up.

trquinn · 2024-04-03T16:39:53Z

CkAssert()s don't fire if Charm is compiled "--with-production"

spencerw · 2024-04-04T01:56:33Z

CkAssert()s don't fire if Charm is compiled "--with-production"

That was indeed the issue. Just re-ran the tests on Frontera without '--with-production' and confirmed everything is passing.

trquinn

A bunch of code cleanups recommended.

CudaFunctions.h

HostCUDA.cu

ParallelGravity.cpp

ParallelGravity.h

…erBasicCleanup

spencerw · 2024-04-21T21:23:02Z

Requested changes have been made and I think this is ready for merge.

See comment about docstrings for TreePieceDataTransferBasic. Following the old convention, I added these to HostCUDA.cu, but let me know if you would prefer to move this and the other function documentation to the respective .h files instead.

spencerw added 30 commits November 17, 2023 13:12

Invoke gpuLocalTreeWalk without WorkRequest, device memory pointer ma…

bce639e

…naged by DataManager

Stream handled by DataManager

82551eb

Transfer particle data back to host and deallocate device memory

2ef0f19

Rewrite Ewald calculation without WR

a5831db

Finish Ewald GPU calculation without workRequest

a0a764d

Use double-pointers for device memory

75dd54a

All kernel launches now managed by a single stream

3f741ac

cudaStreamSynchronize before data transfer callback

d3ef55b

Finish ewald implementation

bd3bb6a

Fix segfaults during ewald kernel

8596999

Decouple streams from TreePieces

7b38be6

Finish rewrite of TreePieceCellListDataTransferLocal

ead9ab9

Device pointer and stream information now stored by TreePieces

19619c9

Remove WorkRequest code from TreePiecePartListDataTransferLocal

GPU device pointers for remote functions handled explicitly

347d56c

Various cleanup

67677c4

Add user events

62a6e2d

Add EwaldGPU trace event

083f8c6

Merge in gpu_multicopy changes

bacd69a

Missing callback for TreePieceCellListDataTransferRemote

8366958

Update verbose HostCUDA function outputs

bacfb5f

Uncomment and rewrite TreePiecePartListDataTransferLocalSmallPhase

260be44

Consistent placing of HAPI_TRACE functions

081ae76

Tracing event for TreePiecePartListDataTransferLocalSmallPhase

6209ce0

Missing callback for TreePiecePartListDataTransferRemoteResume

4909005

Update docstrings for HostCUDA functions

b0b2e1f

Rearrange HostCUDA functions for consistency

fa65cc5

Error checking for CUDA functions

a7594a5

Remove workRequest instrumentation

1398d22

Error checking for cudaFree calls outside of HostCUDA

b112203

Cleanup formatting and indentation

a00bd1a

spencerw added 6 commits February 6, 2024 16:15

Remove unnecessary cudaStreamSynchronize calls

627f884

Device bucket and list pointers handled by HostCUDA functions

92044d8

Remove unused code associated with freeing memory after GPU phase

a8bea29

Remove unused arguments for TransferParticleVarsBack

b60bd10

d_missedParts, d_missedNodes handled entirely within HostCUDA.cu

b8203bc

Misplaced function header from prior upstream merge

16bf606

spencerw added 3 commits February 27, 2024 12:01

Add more comments

3a7c12a

Remove more unused code

0ca8444

Merge remote-tracking branch 'upstream/master' into nowr

611ed4b

Incorporate gpu_multicopy changes

spencerw force-pushed the nowr branch from 8b4f999 to 611ed4b Compare March 2, 2024 06:33

spencerw added 2 commits March 29, 2024 18:17

TreeWalk state not properly initalized when GPU used for interaction …

50c3ac5

…list calculation

Remove unnecessary cudaStreamSynchronize calls

83bb251

Remove unused partIndex variable in DataManager::serializeLocal

6768f6d

trquinn requested changes Apr 18, 2024

View reviewed changes

spencerw added 6 commits April 21, 2024 15:46

CudaDevPtr allocation handled by local stack

fff7110

Add docstrings for TreePieceDataTransferBasic and TreePieceDataTransf…

9cc0d46

…erBasicCleanup

Whitespace cleanup

3bcf1be

Fix out of order comments in adjust()

9c38c5b

Fix docstring for TransferParticleVarsBack

47b32e1

Remove unused function freeDeviceMemory

0ee3d4d

trquinn approved these changes Apr 22, 2024

View reviewed changes

trquinn merged commit 6b41f91 into N-BodyShop:master Apr 22, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove workRequest layer from GPU code #166

Remove workRequest layer from GPU code #166

spencerw commented Feb 7, 2024

spencerw commented Feb 23, 2024 •

edited

spencerw commented Apr 1, 2024

trquinn commented Apr 2, 2024

spencerw commented Apr 2, 2024 •

edited

trquinn commented Apr 3, 2024

spencerw commented Apr 4, 2024

trquinn left a comment

spencerw commented Apr 21, 2024

Remove workRequest layer from GPU code #166

Remove workRequest layer from GPU code #166

Conversation

spencerw commented Feb 7, 2024

spencerw commented Feb 23, 2024 • edited

spencerw commented Apr 1, 2024

trquinn commented Apr 2, 2024

spencerw commented Apr 2, 2024 • edited

trquinn commented Apr 3, 2024

spencerw commented Apr 4, 2024

trquinn left a comment

Choose a reason for hiding this comment

spencerw commented Apr 21, 2024

spencerw commented Feb 23, 2024 •

edited

spencerw commented Apr 2, 2024 •

edited