Weight Packing Format #169

jeromeku · 2024-03-31T20:20:50Z

@ys-2020 @Sakits @kentang-mit

Thank you for making large models accessible across all devices!

Had a few questions regarding the v2 gemm / gemv kernels and the weight prepacking required for these kernels.

Can you explain the interleaving logic in the offline weight packer?

My understanding is that the individual int4 values need to first be rearranged so as to use the Faster Transformer numeric converter.
Is the subsequent interleaving such that ldmatrix can be used on these packed values such that each thread holds the necessary values for mma.sync? Typically ldmatrix is used on fp16 but in this case the weights are packed int4, hence the additional preprocessing.

Theoretically, if I were to preprocess the weights of a non-AWQ int4 model -- quantized through another method that yields groupwise scales / zeros -- to the required format, would I be able to use v2 kernels on such weights, scales, and zeros?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weight Packing Format #169

Weight Packing Format #169

jeromeku commented Mar 31, 2024 •

edited

Weight Packing Format #169

Weight Packing Format #169

Comments

jeromeku commented Mar 31, 2024 • edited

jeromeku commented Mar 31, 2024 •

edited