Add SmoothQuant for T5 (decoder only right now) #1366
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Add SmoothQuant quantization for T5
Test Plan
Built the engine with t5-small using the following commands
Tried this with TP=4 as well, and each combination of
--per_token
and--per_channel
(i.e. only per_token, only --per_channel, and both).Then measured the performance using:
Results
Hugging Face FP32
TRT-LLM SmoothQuant (vanilla, no per_token or per_channel)
TRT-LLM SmoothQuant with --per_token and --per_channel
NOTE: to get --per_token to work, I had to disable the quantize_per_token plugin since this gave incorrect results. But this greatly slows down the inference time.
Remaining Issues
SmoothQuantBertAttention
module yet, so this quantizes the decoder only. I can add that once we confirm the decoder logic is correct. The existing output seems quite poor compared to HF, so I wanted to double check my implementation before adding more unknowns.quantize_per_token
plugin is enabled, the output is nonsensical. So this needs to be disabled to get--per_token
to run, but this makes the model much slower.