-
Notifications
You must be signed in to change notification settings - Fork 819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] Strided dgrad conv epilogue does not use fast divmod #1436
Comments
https://github.com/NVIDIA/cutlass/blob/main/media/docs/efficient_gemm.md also our gtc cutlass talks mentioned the basic concepts, conv, strided dgrad in details. |
Yes, you can replace these lines with Direct link to strided dgrad GTC talk. |
hi, the Do you mean to construct the |
On my 3080 (i'll omit the other two lines assembly for brevity for now)
leads to
assembly Last 3 columns are: the
all tests pass. @hwu36 Tangential: Does anyone know of a more convenient way to extract relevant assembly in ncu-ui? I like how it will correlate your source code to the assembly but it doesn't give you an option to extract exclusively those lines... Feels like an easy feature |
Like @manishucsd has stated I am consistently seeing that these changes are performing more poorly.
from running
with these variables
(no debug flag) |
Thank you @ZelboK.
We need it for both store and load. store is actually more important.
FP32 accumulation is throttled. So let us just use fp16 accumulation. Kernel name is
IIRC, 3080 can use 86 to compile. What problem size or kernel does every line in your performance table use? Also, could you please run this problem size
|
Is it also worth to have only store numbers in the above table? |
The followup needs some exploration from yourself. You may not need to do anything, or you may need to make some small changes like the first one. I have not looked into it very deep myself. Dgrad can be used as deconv (or transposed conv). @masahi contributed deconv in cutlass 2.9 (https://github.com/NVIDIA/cutlass/tree/main/examples/34_transposed_conv2d). We want the output of deconv is not packed. For example, the output problem size is 34x28x28x256, but the output tensor specified by the user can be 34x32x32x512. The user wants to have some bubbles (e.g. 0s) in the output data. This maybe already supported or maybe not. That is the first thing needs to be figured out. We have the similar requirements to the regular fprop conv. #1437 was written to meet this request. When we use packed output, fprop conv can reuse the same epilogue as gemm which is https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/epilogue/threadblock/predicated_tile_iterator.h (Ignore scatter and permute code in this file). The main difference between the packed and non packed cod is what you are already familiar with
It decomposes the row number into n, p, q and use the stride to compute the new row number. |
This issue has been labeled |
https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/epilogue/threadblock/predicated_tile_iterator_strided_dgrad.h#L315-L318
This piece of code can be replaced by using fast divmod. The same can be applied to the store function below.
The text was updated successfully, but these errors were encountered: