Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor to use FastDivmod for predicated strided dgrad iterators. #1453

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ZelboK
Copy link

@ZelboK ZelboK commented Apr 3, 2024

On my 3080
BEFORE
line:
int n = npq_offset / (p_ * q_);
translates to
before_first_line_sass.txt
line:
int residual = npq_offset % (p_ * q_);
translates to
before_second_line_sass.txt
(i'll omit the other two lines assembly for brevity for now)
AFTER
this code:

params_.divmod(n, residual, npq_offset);
params_.divmod_two(p, q, residual);

leads to

2651	0000000f 00c699a0	      ISETP.NE.AND P4, PT, R9, 0x1, PT 	133	0	0									


2720	0000000f 00c69df0	      ISETP.NE.AND P0, PT, R42, 0x1, PT 	149	0	0									
2721	0000000f 00c69e00	      IMAD.MOV.U32 R40, RZ, RZ, R11 	150	0	0									
2722	0000000f 00c69e10	@P0   IMAD.HI.U32 R2, R40, R2, RZ 	149	0	0									
2723	0000000f 00c69e20	      MOV R11, R7 	150	0	0									
2724	0000000f 00c69e30	      IMAD.MOV.U32 R7, RZ, RZ, R0 	150	0	0									
2725	0000000f 00c69e40	      IMAD.MOV.U32 R0, RZ, RZ, R40 	150	0	0									
2726	0000000f 00c69e50	@P0   SHF.R.U32.HI R0, RZ, R43, R2 	150	0	0

assembly

Last 3 columns are:
Live Registers, Warp Stall Sampling, Instructions Executed

the FastDivmod was formed like this:

    params_.divmod = FastDivmod(p_*q_);
    params_.divmod_two = FastDivmod(params_.problem_size.Q);

all tests pass. @hwu36

Here are the benchmarks from cutlass_profiler
from running

./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_f16_s16816dgrad_optimized_f16_* --n=34 --h=28 --w=28 --c=512 --k=1024 --r=1 --s=1 --pad_h=0 --pad_w=0 --stride_h=2 --stride_w=2 --dilation_h=1 --dilation_w=1 --output=load_store_k1024.csv

GFLOPS

Operation,normal,Load_Store,Load,Store
cutlass_tensorop_f16_s16816dgrad_optimized_f16_256x128_32x3_nhwc_align8,36093.7,35460.4,34779.9,35521.1
cutlass_tensorop_f16_s16816dgrad_optimized_f16_256x128_32x3_nhwc_align4,38092.7,35749.1,33229.5,33216.6
cutlass_tensorop_f16_s16816dgrad_optimized_f16_256x128_32x3_nhwc_align2,37974.5,28202.3,26924.2,38842.2
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x256_32x3_nhwc_align8,46247.2,45844.8,46530.2,46529.9
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x256_32x3_nhwc_align4,45534,44948.4,45967.9,46057.5
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x256_32x3_nhwc_align2,42966.6,41820.4,41601.9,43779.3
cutlass_tensorop_f16_s16816dgrad_optimized_f16_256x64_32x3_nhwc_align8,32767,27551.1,31058.6,31742.2
cutlass_tensorop_f16_s16816dgrad_optimized_f16_256x64_32x3_nhwc_align4,27299.7,20540.9,24321.3,26288.8
cutlass_tensorop_f16_s16816dgrad_optimized_f16_256x64_32x3_nhwc_align2,3102.06,3124.94,3107.72,2785.73
cutlass_tensorop_f16_s16816dgrad_optimized_f16_256x64_32x4_nhwc_align8,32597,26568.1,29991.2,30956.1
cutlass_tensorop_f16_s16816dgrad_optimized_f16_256x64_32x4_nhwc_align4,21633.2,18499.1,21158.8,21662
cutlass_tensorop_f16_s16816dgrad_optimized_f16_256x64_32x4_nhwc_align2,3086.17,3099.21,3102.85,2779.92
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x256_32x4_nhwc_align8,46073,44842.5,44539.3,44795.6
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x256_32x4_nhwc_align4,44521.9,43695.6,43267.6,43731.8
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x256_32x4_nhwc_align2,35434.7,33203.7,33205.3,35853.8
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x128_32x3_nhwc_align8,43928.1,43993,43341.7,44614.7
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x128_32x3_nhwc_align4,40625,39896.3,39928.8,40649.1
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x128_32x3_nhwc_align2,37330.9,29725.7,30495.9,29953.2
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x128_32x4_nhwc_align8,40452.3,44186,40779.8,44001.5
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x128_32x4_nhwc_align4,36466.5,38333.6,37859,37978.7
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x128_32x4_nhwc_align2,28703.2,23201.9,28286,23159.5
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x128_32x5_nhwc_align8,44606,43841,43317.1,43667.7
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x128_32x5_nhwc_align4,38175.5,37856.9,37391.8,38196.8
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x128_32x5_nhwc_align2,25967.9,23727.5,26534.2,27765.5
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x64_32x6_nhwc_align8,32081.1,30121.8,29639.6,28661.2
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x64_32x6_nhwc_align4,30862.7,28736.9,27380.5,28620.5
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x64_32x6_nhwc_align2,25429.1,26662.2,23412,26904.1
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x128_32x6_nhwc_align8,43407.6,38710.5,37886.3,37235.5
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x128_32x6_nhwc_align4,40653.9,36616.6,35717.4,36235
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x128_32x6_nhwc_align2,38451.9,34824.1,33954.2,34735.8
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x64_32x10_nhwc_align8,27577.8,23630.1,23171.3,23665.9
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x64_32x10_nhwc_align4,25456.5,22367.1,21874.3,21462.4
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x64_32x10_nhwc_align2,23168.2,20486.2,20086.8,19702
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x128_64x3_nhwc_align8,44608.9,39704.4,43358.2,43835.1
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x128_64x3_nhwc_align4,37134.6,25871.8,33237.6,32667
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x128_64x3_nhwc_align2,5909.46,5473.34,5598.99,5362.67
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x64_64x3_nhwc_align8,32847.1,30383.6,29649.8,30323.1
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x64_64x3_nhwc_align4,30918.4,28999.7,27692.6,28686.4
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x64_64x3_nhwc_align2,7214.04,6697.29,7109.96,6556.47
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x128_64x3_nhwc_align8,43945,39028,38636.2,39030.1
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x128_64x3_nhwc_align4,40536,36048.2,34889.7,37002.2
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x128_64x3_nhwc_align2,33131.5,32488.8,28507.3,32475
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x64_64x5_nhwc_align8,28371.5,22888.7,23848.9,23171.3
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x64_64x5_nhwc_align4,27334.6,23543.2,23209.8,23260.3
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x64_64x5_nhwc_align2,22061.7,21158.2,19387,21181.7

load_store_k1024.conv2d.csv
loadk1024.conv2d.csv
normal_k1024.conv2d.csv
store_k1024.conv2d.csv
the_four.csv

@ZelboK
Copy link
Author

ZelboK commented Apr 3, 2024

@manishucsd

@ZelboK ZelboK marked this pull request as draft April 3, 2024 17:11
@ZelboK ZelboK marked this pull request as ready for review April 3, 2024 18:10
@ZelboK
Copy link
Author

ZelboK commented Apr 5, 2024

@hwu36

comparison_hgrad.csv

Not seeing benefits from this one either.

ran

./tools/profiler/cutlass_profiler  --kernels=cutlass_tensorop_h16816dgrad_optimized_* --n=34 --h=28 --w=28 --c=512 --k=1024 --r=1 --s=1 --pad_h=0 --pad_w=0 --stride_h=2 --stride_w=2 --dilation_h=1 --dilation_w=1

@manishucsd
Copy link
Contributor

manishucsd commented Apr 5, 2024

@ZelboK , You can compile and run only align8 kernels for this shape. Use string "cutlass_tensorop_h16816dgrad_optimized*align8" for cmake and running the cultass_profiler.

The results in comparison_hgrad.csv are with both loads and stores with fast_divmod?

@ZelboK
Copy link
Author

ZelboK commented Apr 5, 2024

@manishucsd Sorry that file isn't complete, please ignore. I'll paste the complete one here(also with running align8 only) THis one will have load, store, load and store, and normal GFLOPS benchmarks.
I'm using a 3080. Could we test this on an A100? I do not have access to one. I'm hoping the pipeline can?
hgrad.csv

@manishucsd
Copy link
Contributor

Thanks @ZelboK for the work on this and analysis. The hgrad.csv present one problem size running with different tile configurations. Looking at the data in hgrad.csv the FastDivMod refactoring in both load and store gives significant speedup for the fastest tile.

@hwu36, are you working on this profiling further with more problem sizes on A100 and potentially merging this?

Copy link

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants