Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue with road_usa when using SSSP algorithm. #938

Open
neoblizz opened this issue Jan 31, 2022 · 0 comments
Open

Performance issue with road_usa when using SSSP algorithm. #938

neoblizz opened this issue Jan 31, 2022 · 0 comments
Labels
🐲 enhancement Add or request enhancements to existing functionalities within gunrock. 🍻 help wanted Extra attention is needed ❓ question Usage or code base related questions.

Comments

@neoblizz
Copy link
Member

Only tested and profiled for road_usa. The problem is that the algorithm will take 7-8 seconds on my machine with a GTX 1080, much much slower than the CPU equivalent. When profiled during that range (so, just timing enactor.enact()), the nvtx range reports the same time, around 8 seconds. However, the GPU metrics only total to around 800 ms (10x faster).

Time(%) Time Calls Avg Min Max Name
Range:
100.00% 8312.74 1 8.31274s 8.31274s 8.31274s SSSP
GPU activities:
40.98% 102.32 6262 16.339us 6.2720us 28.737us ZN7gunrock9operators7advance12block_mapped19block_mapped_kernelILj256ELj1ELNS0_17advance_io_type_tE1ELS4_1ENS_5graph7graph_tILNS_6memory14memory_space_tE0EiifJNS5_11graph_csr_tIiifEENS5_11empty_csc_tENS5_11empty_coo_tEEEEiiZNS_4sssp9enactor_tINSE_9problem_tISD_NSE_7param_tIiEENSE_8result_tIifEEEEE4loopERNS_4cuda15multi_context_tEEUlRKiSR_SR_RKfE_EEvT3_T6_PT4_SY_mPT5
26.43% 65.998 5129 12.867us 11.296us 14.625us ZN3cub18DeviceReduceKernelINS_18DeviceReducePolicyIiiiN6thrust4plusIiEEE9Policy600ENS2_8cuda_cub26transform_input_iterator_tIiNS2_17counting_iteratorImNS2_11use_defaultESA_SA_EEZN7gunrock9operators7advance21compute_output_lengthINSC_5graph7graph_tILNSC_6memory14memory_space_tE0EiifJNSG_11graph_csr_tIiifEENSG_11empty_csc_tENSG_11empty_coo_tEEEENSC_8frontier10frontier_tIiiLNSP_15frontier_kind_tE0ELNSP_15frontier_view_tE0EEEEEmRT_RT0_RNSC_4cuda18standard_context_tEbEUlRKmE_EEPiiS4_EEvSW_T1_T2_NS_13GridEvenShareIS17_EET3
17.13% 42.779 6262 6.8310us 2.2080us 14.145us ZN6thrust8cuda_cub4core13_kernel_agentINS0_14__parallel_for16ParallelForAgentINS0_11__transform17unary_transform_fINS_17counting_iteratorImNS_11use_defaultES8_S8_EEPiNS5_14no_stencil_tagEZN7gunrock9operators6filter6bypass7executeINSC_5graph7graph_tILNSC_6memory14memory_space_tE0EiifJNSH_11graph_csr_tIiifEENSH_11empty_csc_tENSH_11empty_coo_tEEEEZNSC_4sssp9enactor_tINSQ_9problem_tISP_NSQ_7param_tIiEENSQ_8result_tIifEEEEE4loopERNSC_4cuda15multi_context_tEEUlRKiE_NSC_8frontier10frontier_tIiiLNS15_15frontier_kind_tE0ELNS15_15frontier_view_tE0EEEEEvRT_T0_PT1_S1E_RNSZ_18standard_context_tEEUlRKmE_NS5_21always_true_predicateEEExEES1L_xEEvS1C_S1D
5.60% 13.987 6263 2.2330us 1.8240us 2.9440us void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr, int>, unsigned long>, thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr, int>, unsigned long>(thrust::device_ptr, int)
4.39% 10.959 5129 2.1360us 1.8880us 3.6160us void cub::DeviceReduceSingleTileKernel<cub::DeviceReducePolicy<int, int, int, thrust::plus>::Policy600, int*, int*, int, thrust::plus, int>(int, int, int, thrust::plus, cub::DeviceReducePolicy<int, int, int, thrust::plus>::Policy600)
3.10% 7.7297 1134 6.8160us 2.6880us 13.953us ZN3cub28DeviceReduceSingleTileKernelINS_18DeviceReducePolicyIiiiN6thrust4plusIiEEE9Policy600ENS2_8cuda_cub26transform_input_iterator_tIiNS2_17counting_iteratorImNS2_11use_defaultESA_SA_EEZN7gunrock9operators7advance21compute_output_lengthINSC_5graph7graph_tILNSC_6memory14memory_space_tE0EiifJNSG_11graph_csr_tIiifEENSG_11empty_csc_tENSG_11empty_coo_tEEEENSC_8frontier10frontier_tIiiLNSP_15frontier_kind_tE0ELNSP_15frontier_view_tE0EEEEEmRT_RT0_RNSC_4cuda18standard_context_tEbEUlRKmE_EEPiiS4_iEEvSW_T1_T2_T3_T4
2.37% 5.9053 6263 942ns 832ns 2.4320us [CUDA memcpy DtoH]
API calls:
50.11% 290.03 30179 9.6100us 3.1000us 1.0450ms cudaLaunchKernel
49.89% 288.71 6263 46.097us 33.000us 706.20us cudaMemcpyAsync
Total 828.418

This issue was pointed out by @jdwapman. I think further profiling is required to figure out where in the CPU activity most of this slowdown is coming from (that is if I am understanding the profiled result above correctly). My initial understanding was that there's ~30,000 calls to cudaLaunchKernel, and the launch overhead may be the cause. But that should be around 10 us each in the worst case, which is 3 seconds?

@neoblizz neoblizz added ❓ question Usage or code base related questions. 🐲 enhancement Add or request enhancements to existing functionalities within gunrock. 🍻 help wanted Extra attention is needed labels Jan 31, 2022
@neoblizz neoblizz transferred this issue from gunrock/essentials Nov 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐲 enhancement Add or request enhancements to existing functionalities within gunrock. 🍻 help wanted Extra attention is needed ❓ question Usage or code base related questions.
Projects
None yet
Development

No branches or pull requests

1 participant