Initial `CompressedTensors` config + Activation Quantization support … #219

dsikka · 2024-04-30T17:11:20Z

Summary

Initial implementation for Compressed Config support + Activation Quantization for static per tensor w8a8
Includes fused kernels added by @varun-sundar-rabindranath

Testing/Sample Script:

from vllm import LLM, SamplingParams
import torch

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The US president is",
    "The future of AI is"
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.80, top_p=0.95)

# Create an LLM.
llm = LLM(model="nm-testing/tinyllama-one-shot-static-quant-test", enforce_eager=True, dtype=torch.float32, quantization="sparseml")

outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Next Steps:

Verification of the different inputs expected for targets and ignore --> use functions to parse the layer names which can be shared by sparseml and vllm; would live in compressed tensors (https://github.com/neuralmagic/compressed-tensors/blob/67005d76107d4659787f1efd53fe7e6b1d192818/src/compressed_tensors/quantization/lifecycle/apply.py#L86)

@varun-sundar-rabindranath

…for static W8A8 per tensor (#195) - Depending on how we end up parsing `ignore` and `targets` (layer_name vs layer_type) we may not need layer_name to be added to the linear_method. Will experiment using a compressed-tensors function in a follow-up PR - Initial implementation for Compressed Config support + Activation Quantization for static per tensor w8a8 - Includes fused kernels added by @varun-sundar-rabindranath ```python from vllm import LLM, SamplingParams import torch prompts = [ "Hello, my name is", "The capital of France is", "The US president is", "The future of AI is" ] sampling_params = SamplingParams(temperature=0.80, top_p=0.95) llm = LLM(model="nm-testing/tinyllama-one-shot-static-quant-test", enforce_eager=True, dtype=torch.float32, quantization="sparseml") outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - Verification of the different inputs expected for `targets` and `ignore` --> use functions to parse the layer names which can be shared by sparseml and vllm; would live in compressed tensors (https://github.com/neuralmagic/compressed-tensors/blob/67005d76107d4659787f1efd53fe7e6b1d192818/src/compressed_tensors/quantization/lifecycle/apply.py#L86) - Updates to further optimize fake qunat --------- Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

vllm CI fixes --------- Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

lazy cutlass_gemm_dq import --------- Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

dsikka and others added 2 commits April 30, 2024 18:50

add get_quant method to compressed tensors config

92b3703

dsikka force-pushed the ds-quant branch from 78a3670 to 92b3703 Compare April 30, 2024 18:52

dsikka and others added 27 commits April 30, 2024 19:06

small rebase fixed

2a3eb83

format

3dd1fe8

fix mypy complaints

f2f8c52

Merge branch 'main' into ds-quant

c9308eb

format fixes

d9d49b5

Merge branch 'main' into ds-quant

b111ee6

format fix post rebase

c31a7af

lazy import CompressedTensorsW8A8StaticTensor (#220)

ca01b39

vllm CI fixes --------- Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

lazy cutlass_gemm_dq import (#221)

f0197d4

lazy cutlass_gemm_dq import --------- Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

fix asm

4624b46

update shape change

75757d5

add todo

e1df0eb

Rename quant_per_tensor -> static_scaled_int8_quant

bc0991c

Remove cruft

74ad650

Merge branch 'main' into ds-quant

43c43f3

fixes : typo

cf5600f

py-cutlass temporary hack for num_prompts==1

169ce7f

yapf

03b53e7

add test_int8_quant

f9df31b

call cpp cutlass

ba4b6b3

Merge branch 'main' into ds-quant

3c223c6

remove cutlass py interface

b27f31a

format.sh

b589cdd

remove fake-quant

98159cf

add compressed tensors test

8dbeb31

remove torch.int8

5eeb40a

format

c55e023

dsikka and others added 20 commits May 20, 2024 18:42

fix config parsing to match new model

f5cbbd3

revert parsing to use default pathway

a685957

PR comments

4dfb37f

Fix scales/zero-points device allocation

de81f9e

ruff

15f1863

add better comments

bd53847

add comment

b2926f3

Merge branch 'main' into ds-quant

1274386

clang format

18640c8

clang format again

5c5dc84

address PR comments

a44b4a0

clang-format

6f0e6e1

remove layer name

0090454

remove unused import

4b10fd7

remove parent name

68a59c7

Fix rounding

b0afe67

comment

4f4951e

cruft

869de3f

yapf

e68e391

remove unquantized check

d77cf50

dsikka closed this May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial `CompressedTensors` config + Activation Quantization support … #219

Initial `CompressedTensors` config + Activation Quantization support … #219

dsikka commented Apr 30, 2024 •

edited

Initial CompressedTensors config + Activation Quantization support … #219

Initial CompressedTensors config + Activation Quantization support … #219

Conversation

dsikka commented Apr 30, 2024 • edited

Summary

Testing/Sample Script:

Next Steps:

Initial `CompressedTensors` config + Activation Quantization support … #219

Initial `CompressedTensors` config + Activation Quantization support … #219

dsikka commented Apr 30, 2024 •

edited