Performance issue with `road_usa` when using SSSP algorithm. #938

neoblizz · 2022-01-31T18:59:40Z

Only tested and profiled for road_usa. The problem is that the algorithm will take 7-8 seconds on my machine with a GTX 1080, much much slower than the CPU equivalent. When profiled during that range (so, just timing enactor.enact()), the nvtx range reports the same time, around 8 seconds. However, the GPU metrics only total to around 800 ms (10x faster).

	Time(%)	Time	Calls	Avg	Min	Max Name
Range:
	100.00%	8312.74	1	8.31274s	8.31274s	8.31274s	SSSP

GPU activities:
	40.98%	102.32	6262	16.339us	6.2720us	28.737us	ZN7gunrock9operators7advance12block_mapped19block_mapped_kernelILj256ELj1ELNS0_17advance_io_type_tE1ELS4_1ENS_5graph7graph_tILNS_6memory14memory_space_tE0EiifJNS5_11graph_csr_tIiifEENS5_11empty_csc_tENS5_11empty_coo_tEEEEiiZNS_4sssp9enactor_tINSE_9problem_tISD_NSE_7param_tIiEENSE_8result_tIifEEEEE4loopERNS_4cuda15multi_context_tEEUlRKiSR_SR_RKfE_EEvT3_T6_PT4_SY_mPT5
	26.43%	65.998	5129	12.867us	11.296us	14.625us	ZN3cub18DeviceReduceKernelINS_18DeviceReducePolicyIiiiN6thrust4plusIiEEE9Policy600ENS2_8cuda_cub26transform_input_iterator_tIiNS2_17counting_iteratorImNS2_11use_defaultESA_SA_EEZN7gunrock9operators7advance21compute_output_lengthINSC_5graph7graph_tILNSC_6memory14memory_space_tE0EiifJNSG_11graph_csr_tIiifEENSG_11empty_csc_tENSG_11empty_coo_tEEEENSC_8frontier10frontier_tIiiLNSP_15frontier_kind_tE0ELNSP_15frontier_view_tE0EEEEEmRT_RT0_RNSC_4cuda18standard_context_tEbEUlRKmE_EEPiiS4_EEvSW_T1_T2_NS_13GridEvenShareIS17_EET3
	17.13%	42.779	6262	6.8310us	2.2080us	14.145us	ZN6thrust8cuda_cub4core13_kernel_agentINS0_14__parallel_for16ParallelForAgentINS0_11__transform17unary_transform_fINS_17counting_iteratorImNS_11use_defaultES8_S8_EEPiNS5_14no_stencil_tagEZN7gunrock9operators6filter6bypass7executeINSC_5graph7graph_tILNSC_6memory14memory_space_tE0EiifJNSH_11graph_csr_tIiifEENSH_11empty_csc_tENSH_11empty_coo_tEEEEZNSC_4sssp9enactor_tINSQ_9problem_tISP_NSQ_7param_tIiEENSQ_8result_tIifEEEEE4loopERNSC_4cuda15multi_context_tEEUlRKiE_NSC_8frontier10frontier_tIiiLNS15_15frontier_kind_tE0ELNS15_15frontier_view_tE0EEEEEvRT_T0_PT1_S1E_RNSZ_18standard_context_tEEUlRKmE_NS5_21always_true_predicateEEExEES1L_xEEvS1C_S1D
	5.60%	13.987	6263	2.2330us	1.8240us	2.9440us	void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr, int>, unsigned long>, thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr, int>, unsigned long>(thrust::device_ptr, int)
	4.39%	10.959	5129	2.1360us	1.8880us	3.6160us	void cub::DeviceReduceSingleTileKernel<cub::DeviceReducePolicy<int, int, int, thrust::plus>::Policy600, int, int, int, thrust::plus, int>(int, int, int, thrust::plus, cub::DeviceReducePolicy<int, int, int, thrust::plus>::Policy600)
	3.10%	7.7297	1134	6.8160us	2.6880us	13.953us	ZN3cub28DeviceReduceSingleTileKernelINS_18DeviceReducePolicyIiiiN6thrust4plusIiEEE9Policy600ENS2_8cuda_cub26transform_input_iterator_tIiNS2_17counting_iteratorImNS2_11use_defaultESA_SA_EEZN7gunrock9operators7advance21compute_output_lengthINSC_5graph7graph_tILNSC_6memory14memory_space_tE0EiifJNSG_11graph_csr_tIiifEENSG_11empty_csc_tENSG_11empty_coo_tEEEENSC_8frontier10frontier_tIiiLNSP_15frontier_kind_tE0ELNSP_15frontier_view_tE0EEEEEmRT_RT0_RNSC_4cuda18standard_context_tEbEUlRKmE_EEPiiS4_iEEvSW_T1_T2_T3_T4
	2.37%	5.9053	6263	942ns	832ns	2.4320us	[CUDA memcpy DtoH]
API calls:
	50.11%	290.03	30179	9.6100us	3.1000us	1.0450ms	cudaLaunchKernel
	49.89%	288.71	6263	46.097us	33.000us	706.20us	cudaMemcpyAsync
Total		828.418

This issue was pointed out by @jdwapman. I think further profiling is required to figure out where in the CPU activity most of this slowdown is coming from (that is if I am understanding the profiled result above correctly). My initial understanding was that there's ~30,000 calls to cudaLaunchKernel, and the launch overhead may be the cause. But that should be around 10 us each in the worst case, which is 3 seconds?

The text was updated successfully, but these errors were encountered:

neoblizz added ❓ question Usage or code base related questions. 🐲 enhancement Add or request enhancements to existing functionalities within gunrock. 🍻 help wanted Extra attention is needed labels Jan 31, 2022

neoblizz transferred this issue from gunrock/essentials Nov 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue with `road_usa` when using SSSP algorithm. #938

Performance issue with `road_usa` when using SSSP algorithm. #938

neoblizz commented Jan 31, 2022

Performance issue with road_usa when using SSSP algorithm. #938

Performance issue with road_usa when using SSSP algorithm. #938

Comments

neoblizz commented Jan 31, 2022

Performance issue with `road_usa` when using SSSP algorithm. #938

Performance issue with `road_usa` when using SSSP algorithm. #938