Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support RVV x32-packw #6277

Merged
merged 3 commits into from May 13, 2024
Merged

Support RVV x32-packw #6277

merged 3 commits into from May 13, 2024

Conversation

bhbruce
Copy link
Contributor

@bhbruce bhbruce commented Apr 12, 2024

Goal

Enable x32-packw to speed up dynamic fully connected layer for LLM model.

Background

GEMM u-kernel uses input and packed_weight(weight and bias) to calculate output value.
Our GEMM implementations use LMUL & VLEN to determine Nr size. This PR is to provide RVV x32-packw implementations to speed up packing.

XNNPACK originally provided xnn_pack_f32_gemm_goi_w & xnn_pack_f32_gemm_gio_w to preprocess static weight in offline. However, the language models usually use GEMM with dynamic weight.
To speed up the packing process, XNNPACK provides x32-packw u-kernels.

x32-packw aims to pack weight(col-major or OI) & bias into packed_weight buffers.

Parameters

There are two parameters NR & KBlock for x32-packw.
NR is determined by VLEN & LMUL. If VLEN=512 & LMUL=4, NR = 64.
KBlock is to determine the largest rows to transpose in a single iteration.
image

The image above is an example of NR=8 & KBlock=2.

X32-packw naming

RVV naming: x${LMUL}v_u${KBLOCK}
Others naming: x${NR}v_u${KBLOCK}

@bhbruce
Copy link
Contributor Author

bhbruce commented Apr 12, 2024

Hi @fbarchard @alankelly
This PR is to support rvv x32-packw. If you have free time, please help to review.

Copy link
Contributor

@fbarchard fbarchard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice diagram in the description. I know these pack functions are a little complicated.... just wait til you see how 4 bit works :-)

Could you use a strided load to read each vector with a single instruction?

the names are weird but for now unavoidable
packw-x2v means a 4x2v gemm kernel would use this packing.
For gemm kernels we sometimes put MRx16 (in upper case) to make that clear, but thats the template. The function name and file generated have the actual value.

For float we dont usually have KR, which is typically for dot product.
For 8 bit and sometimes 16 bit, it is more common.
If you have fp16 or bf16 with a dot product that does 2 values, the actual packing would be the same as x32. and x8 with KR=4 would also be x32 packing. But the parameter kc is measured in elements.
I made a hacked x8 packw that calls x32 and I think it works except the x32 packw cant handle less than 4 bytes
The correct solution is probably branch the code, but the main loop will do a multiple of 4 bytes, so only remainder code needs to handle KC of 1 to 3.

goi is the most common case but gio is relatively simple to implement. In a model the author has the option of supplying transposed weights.
x16 should be identical, aside from datatype. x8 is different, depending. there is a simple x8 used for fp32_qc8w, but that format is not commonly used.
qs8 qnd qd8 need 8 bit packing with the sum of the weights per NR in the bias. I havent done neon yet, but we could really use it... its slow.

@bhbruce
Copy link
Contributor Author

bhbruce commented Apr 26, 2024

Could you use a strided load to read each vector with a single instruction?

Using strided segment load can have better performance.

packw-x2v means a 4x2v gemm kernel would use this packing.

Yes, I am on the same page.

The correct solution is probably to branch the code, but the main loop will do a multiple of 4 bytes, so only remainder code needs to handle KC of 1 to 3.

Could you guide me where to find it?

@fbarchard
Copy link
Contributor

For the x8-packw that calls x32-packw I made a hack PR
#6356
where you can see the idea. But I think instead of calling a common function, it will need a custom x8-packw that does 4 bytes at a time in the main loop, but handles KC remainder

and most of the 8 bit packing functions need to per channel sum... in the packing.c it sums up weights for each NR
if (kc_idx < kc) {
const int8_t kv = k[(nr_block_start + nr_block_offset) * kc + kc_idx];
ksum += (uint32_t) kv;
((int8_t*) packed_weights)[kr_block_offset] = kv;
}
unaligned_indexed_store_u32(packed_b, nr_block_offset, unaligned_indexed_load_u32(packed_b, nr_block_offset) - ksum * izp);
And then adjusts the bias by the sum * the input zero point, which is a parameter.

The current x8-packw is for f32-qc8w-gemm which doesnt need the sum, and I've only done scalar.

@alankelly
Copy link
Collaborator

Can you please rebase and I will land first thing on Monday?

@bhbruce
Copy link
Contributor Author

bhbruce commented May 11, 2024

Hi @fbarchard
I got the idea.

But I think instead of calling a common function, it will need a custom x8-packw that does 4 bytes at a time in the main loop, but handles KC remainder

Adding the code to tackle on tail part (KC remainder) could be a better idea.
Also, we need to concern about unaligned memory access problem on some architectures.

Signed-off-by: Bruce Lai <bruce.lai@sifive.com>
Add src/x32-packw/Nv-rvv.in. Nr depends on LMUL.
If LMUL=4 & kblock=8 & VLEN=512, then NR=64 and each tile has 64 bias & K x 64 weights.
kblock is to determine largest block to pack weight.

Signed-off-by: Bruce Lai <bruce.lai@sifive.com>
@bhbruce
Copy link
Contributor Author

bhbruce commented May 11, 2024

@alankelly I've rebased it.

@alankelly
Copy link
Collaborator

Thanks this will land today

@copybara-service copybara-service bot merged commit e5b8377 into google:master May 13, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants