New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support RVV x32-packw #6277
Support RVV x32-packw #6277
Conversation
Hi @fbarchard @alankelly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice diagram in the description. I know these pack functions are a little complicated.... just wait til you see how 4 bit works :-)
Could you use a strided load to read each vector with a single instruction?
the names are weird but for now unavoidable
packw-x2v means a 4x2v gemm kernel would use this packing.
For gemm kernels we sometimes put MRx16 (in upper case) to make that clear, but thats the template. The function name and file generated have the actual value.
For float we dont usually have KR, which is typically for dot product.
For 8 bit and sometimes 16 bit, it is more common.
If you have fp16 or bf16 with a dot product that does 2 values, the actual packing would be the same as x32. and x8 with KR=4 would also be x32 packing. But the parameter kc is measured in elements.
I made a hacked x8 packw that calls x32 and I think it works except the x32 packw cant handle less than 4 bytes
The correct solution is probably branch the code, but the main loop will do a multiple of 4 bytes, so only remainder code needs to handle KC of 1 to 3.
goi is the most common case but gio is relatively simple to implement. In a model the author has the option of supplying transposed weights.
x16 should be identical, aside from datatype. x8 is different, depending. there is a simple x8 used for fp32_qc8w, but that format is not commonly used.
qs8 qnd qd8 need 8 bit packing with the sum of the weights per NR in the bias. I havent done neon yet, but we could really use it... its slow.
Using strided segment load can have better performance.
Yes, I am on the same page.
Could you guide me where to find it? |
For the x8-packw that calls x32-packw I made a hack PR and most of the 8 bit packing functions need to per channel sum... in the packing.c it sums up weights for each NR The current x8-packw is for f32-qc8w-gemm which doesnt need the sum, and I've only done scalar. |
Can you please rebase and I will land first thing on Monday? |
Hi @fbarchard
Adding the code to tackle on tail part (KC remainder) could be a better idea. |
Signed-off-by: Bruce Lai <bruce.lai@sifive.com>
Add src/x32-packw/Nv-rvv.in. Nr depends on LMUL. If LMUL=4 & kblock=8 & VLEN=512, then NR=64 and each tile has 64 bias & K x 64 weights. kblock is to determine largest block to pack weight. Signed-off-by: Bruce Lai <bruce.lai@sifive.com>
@alankelly I've rebased it. |
Thanks this will land today |
Goal
Enable x32-packw to speed up dynamic fully connected layer for LLM model.
Background
GEMM u-kernel uses input and packed_weight(weight and bias) to calculate output value.
Our GEMM implementations use LMUL & VLEN to determine Nr size. This PR is to provide RVV x32-packw implementations to speed up packing.
XNNPACK originally provided
xnn_pack_f32_gemm_goi_w
&xnn_pack_f32_gemm_gio_w
to preprocess static weight in offline. However, the language models usually use GEMM with dynamic weight.To speed up the packing process, XNNPACK provides x32-packw u-kernels.
x32-packw aims to pack weight(col-major or OI) & bias into packed_weight buffers.
Parameters
There are two parameters
NR
&KBlock
for x32-packw.NR
is determined by VLEN & LMUL. If VLEN=512 & LMUL=4, NR = 64.KBlock
is to determine the largest rows to transpose in a single iteration.The image above is an example of NR=8 & KBlock=2.
X32-packw naming
RVV naming:
x${LMUL}v_u${KBLOCK}
Others naming:
x${NR}v_u${KBLOCK}