Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Strided dgrad conv epilogue does not use fast divmod #1436

Open
ZelboK opened this issue Mar 28, 2024 · 12 comments
Open

[QST] Strided dgrad conv epilogue does not use fast divmod #1436

ZelboK opened this issue Mar 28, 2024 · 12 comments

Comments

@ZelboK
Copy link

ZelboK commented Mar 28, 2024

https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/epilogue/threadblock/predicated_tile_iterator_strided_dgrad.h#L315-L318

This piece of code can be replaced by using fast divmod. The same can be applied to the store function below.

@ZelboK
Copy link
Author

ZelboK commented Mar 28, 2024

@hwu36

@hwu36 hwu36 changed the title [QST] Which strided conv dgrad kernels need to be refactored to using fast_divmod? [QST] Strided dgrad conv epilogue does not use fast divmod Mar 28, 2024
@hwu36
Copy link
Collaborator

hwu36 commented Mar 28, 2024

@manishucsd

@hwu36
Copy link
Collaborator

hwu36 commented Mar 28, 2024

https://github.com/NVIDIA/cutlass/blob/main/media/docs/efficient_gemm.md
https://github.com/NVIDIA/cutlass/blob/main/media/docs/implicit_gemm_convolution.md

also our gtc cutlass talks mentioned the basic concepts, conv, strided dgrad in details.

@manishucsd
Copy link
Contributor

manishucsd commented Mar 28, 2024

Yes, you can replace these lines with fast_divmod. I remember doing this change and surprisingly the performance dropped a couple of TFLOP/s. Feel free to make a change and report before/after performance numbers.

Direct link to strided dgrad GTC talk.

@mammoth831
Copy link

mammoth831 commented Mar 28, 2024

Yes, you can replace these lines with fast_divmod.

hi, the p_ and q_ are computed from start_r and start_s (start_r->start_h_->p_), while start_r and start_s are decided by the thread block idx which is dynamic.

Do you mean to construct the fast_divmod object on the device directly? If so, I think maybe we have to do the raw divmod here. Int32 divmod only takes about 20 instructions which is much cheaper than calculating the multiplier and shift of fast_divmod.

@ZelboK
Copy link
Author

ZelboK commented Apr 3, 2024

On my 3080
BEFORE
line:
int n = npq_offset / (p_ * q_);
translates to
before_first_line_sass.txt
line:
int residual = npq_offset % (p_ * q_);
translates to
before_second_line_sass.txt

(i'll omit the other two lines assembly for brevity for now)
AFTER
this code:

params_.divmod(n, residual, npq_offset);
params_.divmod_two(p, q, residual);

leads to

2651	0000000f 00c699a0	      ISETP.NE.AND P4, PT, R9, 0x1, PT 	133	0	0									


2720	0000000f 00c69df0	      ISETP.NE.AND P0, PT, R42, 0x1, PT 	149	0	0									
2721	0000000f 00c69e00	      IMAD.MOV.U32 R40, RZ, RZ, R11 	150	0	0									
2722	0000000f 00c69e10	@P0   IMAD.HI.U32 R2, R40, R2, RZ 	149	0	0									
2723	0000000f 00c69e20	      MOV R11, R7 	150	0	0									
2724	0000000f 00c69e30	      IMAD.MOV.U32 R7, RZ, RZ, R0 	150	0	0									
2725	0000000f 00c69e40	      IMAD.MOV.U32 R0, RZ, RZ, R40 	150	0	0									
2726	0000000f 00c69e50	@P0   SHF.R.U32.HI R0, RZ, R43, R2 	150	0	0

assembly

Last 3 columns are:
Live Registers, Warp Stall Sampling, Instructions Executed

the FastDivmod was formed like this:

    params_.divmod = FastDivmod(p_*q_);
    params_.divmod_two = FastDivmod(params_.problem_size.Q);

all tests pass. @hwu36

Tangential: Does anyone know of a more convenient way to extract relevant assembly in ncu-ui? I like how it will correlate your source code to the assembly but it doesn't give you an option to extract exclusively those lines... Feels like an easy feature

@ZelboK
Copy link
Author

ZelboK commented Apr 3, 2024

Like @manishucsd has stated I am consistently seeing that these changes are performing more poorly.
edit: If I refactor store_with_byte_offset to also use FastDivmod the performance improves
(GFLOPS)

original | load only | Load & store
--      | --       | --
42438.4 | 42347.2 | 44057.6
45593.2 | 40910 | 47485.4
46520.2 | 35192.5 | 45204.9
51508.7 | 50302.6 | 51564
50227 | 49132.7 | 50302.6
46430.3 | 45555.7 | 46096.4
50418 | 47765.6 | 49985.4
37307.6 | 34911.3 | 37978.4
3561.23 | 3234.04 | 3644.71
47676.7 | 47130.3 | 49490
29084.6 | 28995.8 | 29655.5
3403.02 | 3120.96 | 3496.45
52486.1 | 49904.3 | 51803.7
50597.1 | 48242 | 50220.4
36889 | 38074.3 | 39168.1
53879.7 | 56330.6 | 56537.7
49410.5 | 51800.2 | 52172.7
44888.3 | 40922.8 | 46548.4
48512.8 | 47529.4 | 50144.7
43376.9 | 41792.6 | 43077.8
34642.4 | 33419.9 | 34349.9
50375.1 | 48202.7 | 49962.7
41098 | 41014.7 | 42323.9
33170.5 | 33063.6 | 33800.4
44329.3 | 42541.7 | 43949.3
41890.5 | 38951.8 | 41537.5
30802.9 | 29376.8 | 30572.7
45720.9 | 43954.4 | 47196.9
43627.7 | 42055.4 | 45114.7
39782 | 38482.1 | 39485.8
35453.7 | 33411.2 | 37222.9
33715.9 | 30371.1 | 33430.1
26631.1 | 24955.4 | 27061.2
47365.5 | 47304.3 | 49073.2
40151.7 | 36906.7 | 41618.4
6243.1 | 6177.41 | 6145.52
44085.4 | 42379.9 | 45434.6
40475 | 39074.5 | 41674.8
8948.19 | 8967.81 | 8928.76
47597.1 | 44222.3 | 47315.9
43240.1 | 41711 | 44627.8
37502.5 | 36286.8 | 38464.7
37327.7 | 34334.6 | 37970.9
36970.6 | 33005.4 | 35697.2
26558.4 | 25751.8 | 27999.2

from running

./tools/profiler/cutlass_profiler  --kernels=cutlass_tensorop_f16_s16816dgrad_optimized_f16_* --n=34 --h=28 --w=28 --c=256 --k=256 --r=3 --s=3 --pad_h=1 --pad_w=1 --stride_h=2 --stride_w=2 --dilation_h=1 --dilation_w=1

with these variables

cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS="cutlass_tensorop_f16_s16816dgrad_optimized_f16_*" -DCMAKE_CUDA_FLAGS="-lineinfo"

(no debug flag)

@hwu36
Copy link
Collaborator

hwu36 commented Apr 3, 2024

Thank you @ZelboK.

I refactor store_with_byte_offset to also use FastDivmod the performance improves

We need it for both store and load. store is actually more important.

On my 3080

FP32 accumulation is throttled. So let us just use fp16 accumulation. Kernel name is cutlass_tensorop_h16816dgrad_optimized_*

cmake .. -DCUTLASS_NVCC_ARCHS=80

IIRC, 3080 can use 86 to compile.

What problem size or kernel does every line in your performance table use?

Also, could you please run this problem size

--n=34 --h=28 --w=28 --c=512 --k=1024 --r=1 --s=1 --pad_h=0 --pad_w=0 --stride_h=2 --stride_w=2 --dilation_h=1 --dilation_w=1

@manishucsd
Copy link
Contributor

Is it also worth to have only store numbers in the above table?

@hwu36
Copy link
Collaborator

hwu36 commented Apr 3, 2024

The followup needs some exploration from yourself. You may not need to do anything, or you may need to make some small changes like the first one. I have not looked into it very deep myself.

Dgrad can be used as deconv (or transposed conv). @masahi contributed deconv in cutlass 2.9 (https://github.com/NVIDIA/cutlass/tree/main/examples/34_transposed_conv2d). We want the output of deconv is not packed. For example, the output problem size is 34x28x28x256, but the output tensor specified by the user can be 34x32x32x512. The user wants to have some bubbles (e.g. 0s) in the output data. This maybe already supported or maybe not. That is the first thing needs to be figured out.

We have the similar requirements to the regular fprop conv. #1437 was written to meet this request. When we use packed output, fprop conv can reuse the same epilogue as gemm which is https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/epilogue/threadblock/predicated_tile_iterator.h (Ignore scatter and permute code in this file). The main difference between the packed and non packed cod is what you are already familiar with

          Stride tensor_coord = CoordinateDecompositionLittleEndian<kStrideRank>(row_offset + thread_start_row_, params_.divmod);

          LongIndex tensor_offset = dot(tensor_coord, params_.tensor_stride);

It decomposes the row number into n, p, q and use the stride to compute the new row number.

Copy link

github-actions bot commented May 3, 2024

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants