Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype FP8Linear W8A8 runtime quantization #190

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft

Conversation

mgoin
Copy link
Member

@mgoin mgoin commented Apr 15, 2024

Adds FP8 quantization at runtime for both weights and activations using torch.float8_e4m3fn

torch._scaled_mm provides an W8A8 linear kernel for FP8, but is only supported on CUDA devices with compute capability >= 9.0 for torch==2.2.1.

RuntimeError: torch._scaled_mm is only supported on devices with compute capability >= 9.0)

It has been expanded to CUDA 8.9, or ROCm MI300+ on main, but won't be on a stable release for a while.

This means for CUDA devices with compute capability < 9.0 (currently everything below Hopper), the weights will be dequantized into higher precision offering no compute savings.

Original precision bfloat16:

from vllm import LLM, SamplingParams

model = LLM("teknium/OpenHermes-2.5-Mistral-7B", enforce_eager=True)
sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Once upon a time, ", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

"""
INFO 04-15 18:29:45 model_runner.py:166] Loading model weights took 13.4976 GB

10 years ago, I was a young, naive, and inexperienced 20-year-old. I had just graduated from college and was about to embark on my first job as a software engineer. I was excited, nervous, and scared all at the same time. I had no idea what to expect, but I was ready to take on the world.
"""

Quantized to FP8, specifically float8_e4m3fn:

from vllm import LLM, SamplingParams

model = LLM("teknium/OpenHermes-2.5-Mistral-7B", enforce_eager=True, quantization="fp8")
sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Once upon a time, ", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

"""
INFO 04-15 18:34:48 model_runner.py:166] Loading model weights took 6.9976 GB
WARNING 04-15 18:34:48 fp8.py:20] FP8 hardware support doesn't exist for NVIDIA SM < 9.0. Up-conversion to original dtype will be used.

10 years ago, I was a young, naive, and inexperienced 20-year-old. I had just graduated from college and was about to embark on a new journey in my life. I was about to start my first job as a software engineer.

I was excited and nervous at the same time. I had never worked in a professional environment before, and I didn’t know what to expect. I had heard stories of long hours, difficult
"""

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant