Prototype FP8Linear W8A8 runtime quantization #190

mgoin · 2024-04-15T16:30:28Z

Adds FP8 quantization at runtime for both weights and activations using torch.float8_e4m3fn

torch._scaled_mm provides an W8A8 linear kernel for FP8, but is only supported on CUDA devices with compute capability >= 9.0 for torch==2.2.1.

RuntimeError: torch._scaled_mm is only supported on devices with compute capability >= 9.0)

It has been expanded to CUDA 8.9, or ROCm MI300+ on main, but won't be on a stable release for a while.

This means for CUDA devices with compute capability < 9.0 (currently everything below Hopper), the weights will be dequantized into higher precision offering no compute savings.

Original precision bfloat16:

from vllm import LLM, SamplingParams

model = LLM("teknium/OpenHermes-2.5-Mistral-7B", enforce_eager=True)
sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Once upon a time, ", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

"""
INFO 04-15 18:29:45 model_runner.py:166] Loading model weights took 13.4976 GB

10 years ago, I was a young, naive, and inexperienced 20-year-old. I had just graduated from college and was about to embark on my first job as a software engineer. I was excited, nervous, and scared all at the same time. I had no idea what to expect, but I was ready to take on the world.
"""

Quantized to FP8, specifically float8_e4m3fn:

from vllm import LLM, SamplingParams

model = LLM("teknium/OpenHermes-2.5-Mistral-7B", enforce_eager=True, quantization="fp8")
sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Once upon a time, ", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

"""
INFO 04-15 18:34:48 model_runner.py:166] Loading model weights took 6.9976 GB
WARNING 04-15 18:34:48 fp8.py:20] FP8 hardware support doesn't exist for NVIDIA SM < 9.0. Up-conversion to original dtype will be used.

10 years ago, I was a young, naive, and inexperienced 20-year-old. I had just graduated from college and was about to embark on a new journey in my life. I was about to start my first job as a software engineer.

I was excited and nervous at the same time. I had never worked in a professional environment before, and I didn’t know what to expect. I had heard stories of long hours, difficult
"""

mgoin added 4 commits April 15, 2024 16:29

Prototype FP8Linear W8A8 runtime quantization

0e02f93

Fix minimum version

ef48909

Fix dtype

46b49ac

Add naive dequant

0373b69

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prototype FP8Linear W8A8 runtime quantization #190

Prototype FP8Linear W8A8 runtime quantization #190

mgoin commented Apr 15, 2024 •

edited

Prototype FP8Linear W8A8 runtime quantization #190

Are you sure you want to change the base?

Prototype FP8Linear W8A8 runtime quantization #190

Conversation

mgoin commented Apr 15, 2024 • edited

mgoin commented Apr 15, 2024 •

edited