Add SmoothQuant for T5 (decoder only right now) #1366

eycheung · 2024-03-27T21:50:43Z

Summary

Add SmoothQuant quantization for T5

Test Plan

Built the engine with t5-small using the following commands

python t5/convert.py -i t5-large -o $CONVERTED_WEIGHT_DIR --weight_data_type float32 --inference_tensor_para_size 1
and then

python build.py --model_type t5 \
                --weight_dir ${CONVERTED_WEIGHT_DIR}/tp1 \
                -o $TRT_ENGINE_DIR \
                --engine_name t5 \
                --use_bert_attention_plugin  \
                --use_gpt_attention_plugin  \
                --use_gemm_plugin  \
                --dtype float16 \
                --max_beam_width 4 \
                --max_batch_size 32 \
                --max_output_len 512 \
                --use_weight_only \
                --world_size 1 \
                --tp_size 1 \
                --use_smooth_quant \
                --per_token \
                --per_channel

Tried this with TP=4 as well, and each combination of --per_token and --per_channel (i.e. only per_token, only --per_channel, and both).

Then measured the performance using:

python run.py --engine_dir $TRT_ENGINE_DIR --engine_name t5 --model_name t5-large --max_new_token=64 --num_beams=1 --compare_hf_fp32

Results

Hugging Face FP32

HF E2E time 2135.8585357666016m
['Das Haus ist wunderbar.', 'i am a high-performance inference optimizer and runtime.', ", the Eiffel Tower was the world's tallest man-made structure.... During its construction, the Eiffel Tower surpassed the Washington Monument to become the world's tallest man-made structure......."]

TRT-LLM SmoothQuant (vanilla, no per_token or per_channel)

TRT-LLM E2E time 447.0314979553222
['Das Haus ist wunderbar.', 'a high-performance inference optimizer and runtime.', '.: i.g. : i.t.: i.t.: : : : : : : : : : : : : : : :
 : : : : ']

TRT-LLM SmoothQuant with --per_token and --per_channel
NOTE: to get --per_token to work, I had to disable the quantize_per_token plugin since this gave incorrect results. But this greatly slows down the inference time.

TRT-LLM E2E time 1073.2574462890625ms
Precision: float16
['Das Haus ist wunderbar.', 'i am a high-performance inference optimizer and runtime.', 'the Eiffel Tower was built. The Eiffel Tower is the symbol of Paris and its people. The Eiffel Tower is the symbol of its people. The symbol of its people. The symbol of its people. The symbol of its people. The symbol of its people. The symbol of its people. The symbol']

Remaining Issues

I did not add the SmoothQuantBertAttention module yet, so this quantizes the decoder only. I can add that once we confirm the decoder logic is correct. The existing output seems quite poor compared to HF, so I wanted to double check my implementation before adding more unknowns.
Flan-T5 models look very bad, so I am not sure I quantized the gate part correctly.
When the quantize_per_token plugin is enabled, the output is nonsensical. So this needs to be disabled to get --per_token to run, but this makes the model much slower.
The existing speed is slower than weight-only int8 quantization, even at larger batches (batch_size = 32). From profiling, this seems like it's compute bound at this point, so I feel like my implementation must be suboptimal in places.

eycheung · 2024-03-27T21:51:04Z

cc: @symphonylyh for some help in reviewing the current logic.

eycheung · 2024-03-27T21:55:07Z

tensorrt_llm/quantization/layers.py

+            if self.quant_mode.has_act_and_weight_quant():
+                if self.quant_mode.has_act_static_scaling():
+                    # Avoid quantiztion layers as it breaks int8 plugins
+                    encoder_output = quantize_tensor(


I am really uncertain if this is the right scaling factor for quantization, and even more uncertain about the factors for the dequantization step.

This also seems like it would be quite slow to do online, so I'm wondering if there is a better way of implementing this.

Add SmoothQuant for T5 (decoder only right now)

7a50763

remove the clip_qkv test

8ae3b5a

eycheung commented Mar 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SmoothQuant for T5 (decoder only right now) #1366

Add SmoothQuant for T5 (decoder only right now) #1366

eycheung commented Mar 27, 2024

eycheung commented Mar 27, 2024

eycheung Mar 27, 2024

Add SmoothQuant for T5 (decoder only right now) #1366

Are you sure you want to change the base?

Add SmoothQuant for T5 (decoder only right now) #1366

Conversation

eycheung commented Mar 27, 2024

Summary

Test Plan

Results

Remaining Issues

eycheung commented Mar 27, 2024

eycheung Mar 27, 2024

Choose a reason for hiding this comment