Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SmoothQuant for T5 (decoder only right now) #1366

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

eycheung
Copy link

Summary

Add SmoothQuant quantization for T5

Test Plan

Built the engine with t5-small using the following commands

python t5/convert.py -i t5-large -o $CONVERTED_WEIGHT_DIR --weight_data_type float32 --inference_tensor_para_size 1
and then

python build.py --model_type t5 \
                --weight_dir ${CONVERTED_WEIGHT_DIR}/tp1 \
                -o $TRT_ENGINE_DIR \
                --engine_name t5 \
                --use_bert_attention_plugin  \
                --use_gpt_attention_plugin  \
                --use_gemm_plugin  \
                --dtype float16 \
                --max_beam_width 4 \
                --max_batch_size 32 \
                --max_output_len 512 \
                --use_weight_only \
                --world_size 1 \
                --tp_size 1 \
                --use_smooth_quant \
                --per_token \
                --per_channel

Tried this with TP=4 as well, and each combination of --per_token and --per_channel (i.e. only per_token, only --per_channel, and both).

Then measured the performance using:

python run.py --engine_dir $TRT_ENGINE_DIR --engine_name t5 --model_name t5-large --max_new_token=64 --num_beams=1 --compare_hf_fp32

Results

Hugging Face FP32

HF E2E time 2135.8585357666016m
['Das Haus ist wunderbar.', 'i am a high-performance inference optimizer and runtime.', ", the Eiffel Tower was the world's tallest man-made structure.... During its construction, the Eiffel Tower surpassed the Washington Monument to become the world's tallest man-made structure......."]

TRT-LLM SmoothQuant (vanilla, no per_token or per_channel)

TRT-LLM E2E time 447.0314979553222
['Das Haus ist wunderbar.', 'a high-performance inference optimizer and runtime.', '.: i.g. : i.t.: i.t.: : : : : : : : : : : : : : : :
 : : : : ']

TRT-LLM SmoothQuant with --per_token and --per_channel
NOTE: to get --per_token to work, I had to disable the quantize_per_token plugin since this gave incorrect results. But this greatly slows down the inference time.

TRT-LLM E2E time 1073.2574462890625ms
Precision: float16
['Das Haus ist wunderbar.', 'i am a high-performance inference optimizer and runtime.', 'the Eiffel Tower was built. The Eiffel Tower is the symbol of Paris and its people. The Eiffel Tower is the symbol of its people. The symbol of its people. The symbol of its people. The symbol of its people. The symbol of its people. The symbol of its people. The symbol']

Remaining Issues

  1. I did not add the SmoothQuantBertAttention module yet, so this quantizes the decoder only. I can add that once we confirm the decoder logic is correct. The existing output seems quite poor compared to HF, so I wanted to double check my implementation before adding more unknowns.
  2. Flan-T5 models look very bad, so I am not sure I quantized the gate part correctly.
  3. When the quantize_per_token plugin is enabled, the output is nonsensical. So this needs to be disabled to get --per_token to run, but this makes the model much slower.
  4. The existing speed is slower than weight-only int8 quantization, even at larger batches (batch_size = 32). From profiling, this seems like it's compute bound at this point, so I feel like my implementation must be suboptimal in places.

@eycheung
Copy link
Author

cc: @symphonylyh for some help in reviewing the current logic.

if self.quant_mode.has_act_and_weight_quant():
if self.quant_mode.has_act_static_scaling():
# Avoid quantiztion layers as it breaks int8 plugins
encoder_output = quantize_tensor(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am really uncertain if this is the right scaling factor for quantization, and even more uncertain about the factors for the dequantization step.

This also seems like it would be quite slow to do online, so I'm wondering if there is a better way of implementing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant