Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to overlap the share2register and computing process? #14

Open
YijiaZhao opened this issue May 18, 2022 · 6 comments
Open

how to overlap the share2register and computing process? #14

YijiaZhao opened this issue May 18, 2022 · 6 comments
Labels
question Further information is requested

Comments

@YijiaZhao
Copy link

YijiaZhao commented May 18, 2022

I have another question about MMult_cuda_12.cu
Honestly, I don't know how to overlap the share2register and computing process. Is it the asm(PTX) that make them run parallelly? The instructions are sequantially, so how could these two parts of code hide each other?
part1: loading shared-memory to panel
lds128(panelA[pp][0], panelA[pp][1], panelA[pp][2], panelA[pp][3],
aptr_base + ((subk + 1) % 8) * SMEM_LDA * sizeof(float));
lds128(panelA[pp][4], panelA[pp][5], panelA[pp][6], panelA[pp][7],
aptr_base + (((subk + 1) % 8) * SMEM_LDA + 64) * sizeof(float));
lds128(panelB[pp][0], panelB[pp][1], panelB[pp][2], panelB[pp][3],
bptr_base + ((subk + 1) % 8) * SMEM_LDB * sizeof(float));
lds128(panelB[pp][4], panelB[pp][5], panelB[pp][6], panelB[pp][7],
bptr_base + (((subk + 1) % 8) * SMEM_LDB + 64) * sizeof(float));

part2: computing the result of panel-data
#pragma unroll
for (int i = 0; i < 8; ++i) {
#pragma unroll
for (int j = 0; j < 8; ++j) {
sum[i][j] += panelA[subk % 2][i] * panelB[subk % 2][j];
}
}

@tpoisonooo
Copy link
Owner

tpoisonooo commented May 22, 2022

事情有点弯弯绕,用中文表达力好点,勿怪。

做 ping-pong 的前提是要有两个相互独立的主体。在计算机体系结构里,运算用 ALU、搬数据用 MMU,这俩就是两个独立主体——一边算、一边搬。

字面上,发命令本身只需要 1 一个cycle,搬数据整个动作要 100 个 cycle。

形象一点举例:你有个马仔,你命令马仔去卖冰,你本人负责制造冰。伪代码里你的工作:

造冰(0) // 100 个 cycle
卖冰(马仔,ptr_冰0)  // 1 个 cycle 发指令
造冰(1) // 100 个 cycle
卖冰(马仔,ptr_冰1)  // 1 个 cycle 发指令

马仔的工作:

recv_卖冰_cmd(ptr_冰0)
do_卖冰(ptr_冰0) // 100 个 cycle

recv_卖冰_cmd(ptr_冰1)
do_卖冰(ptr_冰1) // 100 个 cycle

这时候完成了并行化, 302 (202 + 100) 个 cycle 后任务结束,两个主体一共做了 402 cycle 的工作.

回到原始问题上, part1 和 part2 代码上是串行, 执行由不同硬件来.

@tpoisonooo
Copy link
Owner

冰 == 冰粉, 成都美食.

@YijiaZhao
Copy link
Author

Thank you for your reply. There's no Sync between the part1 and part2, so I think that part1 and part2 run sequentially. I asked my colleague and he said that part1 and part2 are parallel in the hardware and it is register that ensure s2r is finished before computing. His explanation is same as what you said.

@YijiaZhao
Copy link
Author

I asked him by using cutlass code which has same pipeline as your code. I also want to know why you use PTX, what's the advantage of asm code?

@tpoisonooo
Copy link
Owner

The PTX on cuda is not powerful as __asm__ on CPU.

@tpoisonooo
Copy link
Owner

You can just use C code, the gflops should be same.

@tpoisonooo tpoisonooo added the question Further information is requested label Nov 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants