Support RVV x32-packw #6277

bhbruce · 2024-04-12T08:03:06Z

Goal

Enable x32-packw to speed up dynamic fully connected layer for LLM model.

Background

GEMM u-kernel uses input and packed_weight(weight and bias) to calculate output value.
Our GEMM implementations use LMUL & VLEN to determine Nr size. This PR is to provide RVV x32-packw implementations to speed up packing.

XNNPACK originally provided xnn_pack_f32_gemm_goi_w & xnn_pack_f32_gemm_gio_w to preprocess static weight in offline. However, the language models usually use GEMM with dynamic weight.
To speed up the packing process, XNNPACK provides x32-packw u-kernels.

x32-packw aims to pack weight(col-major or OI) & bias into packed_weight buffers.

Parameters

There are two parameters NR & KBlock for x32-packw.
NR is determined by VLEN & LMUL. If VLEN=512 & LMUL=4, NR = 64.
KBlock is to determine the largest rows to transpose in a single iteration.

The image above is an example of NR=8 & KBlock=2.

X32-packw naming

RVV naming: x${LMUL}v_u${KBLOCK}
Others naming: x${NR}v_u${KBLOCK}

bhbruce · 2024-04-12T08:04:07Z

Hi @fbarchard @alankelly
This PR is to support rvv x32-packw. If you have free time, please help to review.

fbarchard

Nice diagram in the description. I know these pack functions are a little complicated.... just wait til you see how 4 bit works :-)

Could you use a strided load to read each vector with a single instruction?

the names are weird but for now unavoidable
packw-x2v means a 4x2v gemm kernel would use this packing.
For gemm kernels we sometimes put MRx16 (in upper case) to make that clear, but thats the template. The function name and file generated have the actual value.

For float we dont usually have KR, which is typically for dot product.
For 8 bit and sometimes 16 bit, it is more common.
If you have fp16 or bf16 with a dot product that does 2 values, the actual packing would be the same as x32. and x8 with KR=4 would also be x32 packing. But the parameter kc is measured in elements.
I made a hacked x8 packw that calls x32 and I think it works except the x32 packw cant handle less than 4 bytes
The correct solution is probably branch the code, but the main loop will do a multiple of 4 bytes, so only remainder code needs to handle KC of 1 to 3.

goi is the most common case but gio is relatively simple to implement. In a model the author has the option of supplying transposed weights.
x16 should be identical, aside from datatype. x8 is different, depending. there is a simple x8 used for fp32_qc8w, but that format is not commonly used.
qs8 qnd qd8 need 8 bit packing with the sum of the weights per NR in the bias. I havent done neon yet, but we could really use it... its slow.

bhbruce · 2024-04-26T07:57:09Z

Could you use a strided load to read each vector with a single instruction?

Using strided segment load can have better performance.

packw-x2v means a 4x2v gemm kernel would use this packing.

Yes, I am on the same page.

The correct solution is probably to branch the code, but the main loop will do a multiple of 4 bytes, so only remainder code needs to handle KC of 1 to 3.

Could you guide me where to find it?

fbarchard · 2024-05-02T00:08:56Z

For the x8-packw that calls x32-packw I made a hack PR
#6356
where you can see the idea. But I think instead of calling a common function, it will need a custom x8-packw that does 4 bytes at a time in the main loop, but handles KC remainder

and most of the 8 bit packing functions need to per channel sum... in the packing.c it sums up weights for each NR
if (kc_idx < kc) {
const int8_t kv = k[(nr_block_start + nr_block_offset) * kc + kc_idx];
ksum += (uint32_t) kv;
((int8_t*) packed_weights)[kr_block_offset] = kv;
}
unaligned_indexed_store_u32(packed_b, nr_block_offset, unaligned_indexed_load_u32(packed_b, nr_block_offset) - ksum * izp);
And then adjusts the bias by the sum * the input zero point, which is a parameter.

The current x8-packw is for f32-qc8w-gemm which doesnt need the sum, and I've only done scalar.

alankelly · 2024-05-10T11:52:07Z

Can you please rebase and I will land first thing on Monday?

bhbruce · 2024-05-11T01:06:29Z

Hi @fbarchard
I got the idea.

But I think instead of calling a common function, it will need a custom x8-packw that does 4 bytes at a time in the main loop, but handles KC remainder

Adding the code to tackle on tail part (KC remainder) could be a better idea.
Also, we need to concern about unaligned memory access problem on some architectures.

Signed-off-by: Bruce Lai <bruce.lai@sifive.com>

Add src/x32-packw/Nv-rvv.in. Nr depends on LMUL. If LMUL=4 & kblock=8 & VLEN=512, then NR=64 and each tile has 64 bias & K x 64 weights. kblock is to determine largest block to pack weight. Signed-off-by: Bruce Lai <bruce.lai@sifive.com>

bhbruce · 2024-05-11T01:13:56Z

@alankelly I've rebased it.

alankelly · 2024-05-13T08:50:12Z

Thanks this will land today

fbarchard approved these changes Apr 20, 2024

View reviewed changes

bhbruce added 3 commits May 10, 2024 18:07

Support vector_tile in tools/generate-packw-test.py

1e8dc5d

Signed-off-by: Bruce Lai <bruce.lai@sifive.com>

Add RVV x32-packw

e6718ed

Add src/x32-packw/Nv-rvv.in. Nr depends on LMUL. If LMUL=4 & kblock=8 & VLEN=512, then NR=64 and each tile has 64 bias & K x 64 weights. kblock is to determine largest block to pack weight. Signed-off-by: Bruce Lai <bruce.lai@sifive.com>

Add RVV x32-packw test & benchmark

3036eee

bhbruce force-pushed the rvv-packw-up branch from 44e315d to 3036eee Compare May 11, 2024 01:11

copybara-service bot merged commit e5b8377 into google:master May 13, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support RVV x32-packw #6277

Support RVV x32-packw #6277

bhbruce commented Apr 12, 2024

bhbruce commented Apr 12, 2024 •

edited

fbarchard left a comment

bhbruce commented Apr 26, 2024 •

edited

fbarchard commented May 2, 2024

alankelly commented May 10, 2024

bhbruce commented May 11, 2024

bhbruce commented May 11, 2024

alankelly commented May 13, 2024

Support RVV x32-packw #6277

Support RVV x32-packw #6277

Conversation

bhbruce commented Apr 12, 2024

Goal

Background

Parameters

X32-packw naming

bhbruce commented Apr 12, 2024 • edited

fbarchard left a comment

Choose a reason for hiding this comment

bhbruce commented Apr 26, 2024 • edited

fbarchard commented May 2, 2024

alankelly commented May 10, 2024

bhbruce commented May 11, 2024

bhbruce commented May 11, 2024

alankelly commented May 13, 2024

bhbruce commented Apr 12, 2024 •

edited

bhbruce commented Apr 26, 2024 •

edited