Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weight Packing Format #169

Open
jeromeku opened this issue Mar 31, 2024 · 0 comments
Open

Weight Packing Format #169

jeromeku opened this issue Mar 31, 2024 · 0 comments

Comments

@jeromeku
Copy link

jeromeku commented Mar 31, 2024

@ys-2020 @Sakits @kentang-mit

Thank you for making large models accessible across all devices!

Had a few questions regarding the v2 gemm / gemv kernels and the weight prepacking required for these kernels.

Can you explain the interleaving logic in the offline weight packer?

  • My understanding is that the individual int4 values need to first be rearranged so as to use the Faster Transformer numeric converter.
  • Is the subsequent interleaving such that ldmatrix can be used on these packed values such that each thread holds the necessary values for mma.sync? Typically ldmatrix is used on fp16 but in this case the weights are packed int4, hence the additional preprocessing.

Theoretically, if I were to preprocess the weights of a non-AWQ int4 model -- quantized through another method that yields groupwise scales / zeros -- to the required format, would I be able to use v2 kernels on such weights, scales, and zeros?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant