You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is the subsequent interleaving such that ldmatrix can be used on these packed values such that each thread holds the necessary values for mma.sync? Typically ldmatrix is used on fp16 but in this case the weights are packed int4, hence the additional preprocessing.
Theoretically, if I were to preprocess the weights of a non-AWQ int4 model -- quantized through another method that yields groupwise scales / zeros -- to the required format, would I be able to use v2 kernels on such weights, scales, and zeros?
The text was updated successfully, but these errors were encountered:
@ys-2020 @Sakits @kentang-mit
Thank you for making large models accessible across all devices!
Had a few questions regarding the v2
gemm
/gemv
kernels and the weight prepacking required for these kernels.Can you explain the interleaving logic in the offline weight packer?
int4
values need to first be rearranged so as to use the Faster Transformer numeric converter.ldmatrix
can be used on these packed values such that each thread holds the necessary values formma.sync
? Typicallyldmatrix
is used onfp16
but in this case the weights are packedint4
, hence the additional preprocessing.Theoretically, if I were to preprocess the weights of a non-AWQ int4 model -- quantized through another method that yields groupwise scales / zeros -- to the required format, would I be able to use
v2
kernels on such weights, scales, and zeros?The text was updated successfully, but these errors were encountered: