[QST] Strided dgrad conv epilogue does not use fast divmod #1436

ZelboK · 2024-03-28T14:02:17Z

https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/epilogue/threadblock/predicated_tile_iterator_strided_dgrad.h#L315-L318

This piece of code can be replaced by using fast divmod. The same can be applied to the store function below.

ZelboK · 2024-03-28T14:02:29Z

@hwu36

hwu36 · 2024-03-28T14:10:23Z

@manishucsd

hwu36 · 2024-03-28T14:21:41Z

example code is https://github.com/NVIDIA/cutlass/pull/1437/files#diff-3e442e06009337880fcf6b4f9286783994269e20d10392afa4a1fa981626376dR388-R390

hwu36 · 2024-03-28T14:58:44Z

https://github.com/NVIDIA/cutlass/blob/main/media/docs/efficient_gemm.md
https://github.com/NVIDIA/cutlass/blob/main/media/docs/implicit_gemm_convolution.md

also our gtc cutlass talks mentioned the basic concepts, conv, strided dgrad in details.

manishucsd · 2024-03-28T16:42:14Z

Yes, you can replace these lines with fast_divmod. I remember doing this change and surprisingly the performance dropped a couple of TFLOP/s. Feel free to make a change and report before/after performance numbers.

Direct link to strided dgrad GTC talk.

mammoth831 · 2024-03-28T17:03:47Z

Yes, you can replace these lines with fast_divmod.

hi, the p_ and q_ are computed from start_r and start_s (start_r->start_h_->p_), while start_r and start_s are decided by the thread block idx which is dynamic.

Do you mean to construct the fast_divmod object on the device directly? If so, I think maybe we have to do the raw divmod here. Int32 divmod only takes about 20 instructions which is much cheaper than calculating the multiplier and shift of fast_divmod.

ZelboK · 2024-04-03T02:02:52Z

On my 3080
BEFORE
line:
int n = npq_offset / (p_ * q_);
translates to
before_first_line_sass.txt
line:
int residual = npq_offset % (p_ * q_);
translates to
before_second_line_sass.txt

(i'll omit the other two lines assembly for brevity for now)
AFTER
this code:

params_.divmod(n, residual, npq_offset);
params_.divmod_two(p, q, residual);

leads to

2651	0000000f 00c699a0	      ISETP.NE.AND P4, PT, R9, 0x1, PT 	133	0	0									


2720	0000000f 00c69df0	      ISETP.NE.AND P0, PT, R42, 0x1, PT 	149	0	0									
2721	0000000f 00c69e00	      IMAD.MOV.U32 R40, RZ, RZ, R11 	150	0	0									
2722	0000000f 00c69e10	@P0   IMAD.HI.U32 R2, R40, R2, RZ 	149	0	0									
2723	0000000f 00c69e20	      MOV R11, R7 	150	0	0									
2724	0000000f 00c69e30	      IMAD.MOV.U32 R7, RZ, RZ, R0 	150	0	0									
2725	0000000f 00c69e40	      IMAD.MOV.U32 R0, RZ, RZ, R40 	150	0	0									
2726	0000000f 00c69e50	@P0   SHF.R.U32.HI R0, RZ, R43, R2 	150	0	0

assembly

Last 3 columns are:
Live Registers, Warp Stall Sampling, Instructions Executed

the FastDivmod was formed like this:

    params_.divmod = FastDivmod(p_*q_);
    params_.divmod_two = FastDivmod(params_.problem_size.Q);

all tests pass. @hwu36

Tangential: Does anyone know of a more convenient way to extract relevant assembly in ncu-ui? I like how it will correlate your source code to the assembly but it doesn't give you an option to extract exclusively those lines... Feels like an easy feature

ZelboK · 2024-04-03T02:06:44Z

Like @manishucsd has stated I am consistently seeing that these changes are performing more poorly.
edit: If I refactor store_with_byte_offset to also use FastDivmod the performance improves
(GFLOPS)

original | load only | Load & store
--      | --       | --
42438.4 | 42347.2 | 44057.6
45593.2 | 40910 | 47485.4
46520.2 | 35192.5 | 45204.9
51508.7 | 50302.6 | 51564
50227 | 49132.7 | 50302.6
46430.3 | 45555.7 | 46096.4
50418 | 47765.6 | 49985.4
37307.6 | 34911.3 | 37978.4
3561.23 | 3234.04 | 3644.71
47676.7 | 47130.3 | 49490
29084.6 | 28995.8 | 29655.5
3403.02 | 3120.96 | 3496.45
52486.1 | 49904.3 | 51803.7
50597.1 | 48242 | 50220.4
36889 | 38074.3 | 39168.1
53879.7 | 56330.6 | 56537.7
49410.5 | 51800.2 | 52172.7
44888.3 | 40922.8 | 46548.4
48512.8 | 47529.4 | 50144.7
43376.9 | 41792.6 | 43077.8
34642.4 | 33419.9 | 34349.9
50375.1 | 48202.7 | 49962.7
41098 | 41014.7 | 42323.9
33170.5 | 33063.6 | 33800.4
44329.3 | 42541.7 | 43949.3
41890.5 | 38951.8 | 41537.5
30802.9 | 29376.8 | 30572.7
45720.9 | 43954.4 | 47196.9
43627.7 | 42055.4 | 45114.7
39782 | 38482.1 | 39485.8
35453.7 | 33411.2 | 37222.9
33715.9 | 30371.1 | 33430.1
26631.1 | 24955.4 | 27061.2
47365.5 | 47304.3 | 49073.2
40151.7 | 36906.7 | 41618.4
6243.1 | 6177.41 | 6145.52
44085.4 | 42379.9 | 45434.6
40475 | 39074.5 | 41674.8
8948.19 | 8967.81 | 8928.76
47597.1 | 44222.3 | 47315.9
43240.1 | 41711 | 44627.8
37502.5 | 36286.8 | 38464.7
37327.7 | 34334.6 | 37970.9
36970.6 | 33005.4 | 35697.2
26558.4 | 25751.8 | 27999.2

from running

./tools/profiler/cutlass_profiler  --kernels=cutlass_tensorop_f16_s16816dgrad_optimized_f16_* --n=34 --h=28 --w=28 --c=256 --k=256 --r=3 --s=3 --pad_h=1 --pad_w=1 --stride_h=2 --stride_w=2 --dilation_h=1 --dilation_w=1

with these variables

cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS="cutlass_tensorop_f16_s16816dgrad_optimized_f16_*" -DCMAKE_CUDA_FLAGS="-lineinfo"

(no debug flag)

hwu36 · 2024-04-03T02:41:25Z

Thank you @ZelboK.

I refactor store_with_byte_offset to also use FastDivmod the performance improves

We need it for both store and load. store is actually more important.

On my 3080

FP32 accumulation is throttled. So let us just use fp16 accumulation. Kernel name is cutlass_tensorop_h16816dgrad_optimized_*

cmake .. -DCUTLASS_NVCC_ARCHS=80

IIRC, 3080 can use 86 to compile.

What problem size or kernel does every line in your performance table use?

Also, could you please run this problem size

--n=34 --h=28 --w=28 --c=512 --k=1024 --r=1 --s=1 --pad_h=0 --pad_w=0 --stride_h=2 --stride_w=2 --dilation_h=1 --dilation_w=1

manishucsd · 2024-04-03T03:58:52Z

Is it also worth to have only store numbers in the above table?

hwu36 · 2024-04-03T18:57:35Z

The followup needs some exploration from yourself. You may not need to do anything, or you may need to make some small changes like the first one. I have not looked into it very deep myself.

Dgrad can be used as deconv (or transposed conv). @masahi contributed deconv in cutlass 2.9 (https://github.com/NVIDIA/cutlass/tree/main/examples/34_transposed_conv2d). We want the output of deconv is not packed. For example, the output problem size is 34x28x28x256, but the output tensor specified by the user can be 34x32x32x512. The user wants to have some bubbles (e.g. 0s) in the output data. This maybe already supported or maybe not. That is the first thing needs to be figured out.

We have the similar requirements to the regular fprop conv. #1437 was written to meet this request. When we use packed output, fprop conv can reuse the same epilogue as gemm which is https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/epilogue/threadblock/predicated_tile_iterator.h (Ignore scatter and permute code in this file). The main difference between the packed and non packed cod is what you are already familiar with

          Stride tensor_coord = CoordinateDecompositionLittleEndian<kStrideRank>(row_offset + thread_start_row_, params_.divmod);

          LongIndex tensor_offset = dot(tensor_coord, params_.tensor_stride);

It decomposes the row number into n, p, q and use the stride to compute the new row number.

github-actions · 2024-05-03T19:04:30Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

ZelboK added ? - Needs Triage question Question labels Mar 28, 2024

hwu36 changed the title ~~[QST] Which strided conv dgrad kernels need to be refactored to using fast_divmod?~~ [QST] Strided dgrad conv epilogue does not use fast divmod Mar 28, 2024

mnicely removed the ? - Needs Triage label Apr 3, 2024

github-actions bot added the inactive-30d label May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Strided dgrad conv epilogue does not use fast divmod #1436

[QST] Strided dgrad conv epilogue does not use fast divmod #1436

ZelboK commented Mar 28, 2024 •

edited by hwu36

ZelboK commented Mar 28, 2024

hwu36 commented Mar 28, 2024

hwu36 commented Mar 28, 2024

hwu36 commented Mar 28, 2024

manishucsd commented Mar 28, 2024 •

edited

mammoth831 commented Mar 28, 2024 •

edited

ZelboK commented Apr 3, 2024 •

edited

ZelboK commented Apr 3, 2024 •

edited

hwu36 commented Apr 3, 2024

manishucsd commented Apr 3, 2024

hwu36 commented Apr 3, 2024

github-actions bot commented May 3, 2024

[QST] Strided dgrad conv epilogue does not use fast divmod #1436

[QST] Strided dgrad conv epilogue does not use fast divmod #1436

Comments

ZelboK commented Mar 28, 2024 • edited by hwu36

ZelboK commented Mar 28, 2024

hwu36 commented Mar 28, 2024

hwu36 commented Mar 28, 2024

hwu36 commented Mar 28, 2024

manishucsd commented Mar 28, 2024 • edited

mammoth831 commented Mar 28, 2024 • edited

ZelboK commented Apr 3, 2024 • edited

ZelboK commented Apr 3, 2024 • edited

hwu36 commented Apr 3, 2024

manishucsd commented Apr 3, 2024

hwu36 commented Apr 3, 2024

github-actions bot commented May 3, 2024

ZelboK commented Mar 28, 2024 •

edited by hwu36

manishucsd commented Mar 28, 2024 •

edited

mammoth831 commented Mar 28, 2024 •

edited

ZelboK commented Apr 3, 2024 •

edited

ZelboK commented Apr 3, 2024 •

edited