Initial `CompressedTensors` config + Activation Quantization support for static W8A8 per tensor #195

dsikka · 2024-04-18T22:16:06Z

Adding `layer_name`

Depending on how we end up parsing ignore and targets (layer_name vs layer_type) we may not need layer_name to be added to the linear_method. Will experiment using a compressed-tensors function in a follow-up PR

Summary

Initial implementation for Compressed Config support + Activation Quantization for static per tensor w8a8
Includes fused kernels added by @varun-sundar-rabindranath

Testing/Sample Script:

from vllm import LLM, SamplingParams
import torch

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The US president is",
    "The future of AI is"
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.80, top_p=0.95)

# Create an LLM.
llm = LLM(model="nm-testing/tinyllama-one-shot-static-quant-test", enforce_eager=True, dtype=torch.float32, quantization="sparseml")

outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Next Steps:

Verification of the different inputs expected for targets and ignore --> use functions to parse the layer names which can be shared by sparseml and vllm; would live in compressed tensors (https://github.com/neuralmagic/compressed-tensors/blob/67005d76107d4659787f1efd53fe7e6b1d192818/src/compressed_tensors/quantization/lifecycle/apply.py#L86)
Updates to further optimize fake qunat

Use cutlass kernels. --------- Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

Description: Remove logging triggers a device-to-host copy. --------- Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

csrc/quantization/smoothquant/fused_kernels.cu

robertgshaw2-neuralmagic

Minor nits. Generally looks good. Biggest things are:

Clean up the StaticW8A8Scheme
Remove unnecessary cuda stuff

@varun-sundar-rabindranath can you do a quick audit of the kernels and let us know which can be removed?

csrc/pybind.cpp

csrc/attention/dtype_int8.cuh

csrc/attention/dtype_float32.cuh

csrc/reduction_utils.cuh

vllm/config.py

vllm/model_executor/layers/quantization/compressed_tensors/data/quantization_args.py

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py

robertgshaw2-neuralmagic · 2024-04-26T01:12:33Z

vllm/model_executor/layers/quantization/__init__.py

@@ -16,6 +18,7 @@
    "gptq": GPTQConfig,
    "squeezellm": SqueezeLLMConfig,
    "marlin": MarlinConfig,
+    "sparseml": CompressedTensorsConfig


why sparseml and not compressed-tensors?

To comply with the vllm input handling. The quantization method listed in the sparsmel model config is sparseml so we can change this if we change the value listed in the config.

Can you coordinate with Sara on this?

I think it should be compressed-tensors in the HF config

I'm fine with the update, but it is currently called "sparseml" in the compressed-tensors repo so it will need to be updated there too

csrc/quantization/smoothquant/quant_utils.cuh

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py

Description: - rename `csrc/quantization/smoothquant/fused_kernels.cu` -> `csrc/quantization/compressed_tensors/int8_quant_kernels.cu` - Remove `csrc/attention/dtype_int8.cuh` - Remove unused quant_per_token kernel. Rename `ops.quant` to `ops.quant_per_tensor` - Remove unused `quant_utils.cuh` - Remove unused `blockReduceMax` code from reduction_utils.cuh --------- Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

@varun-sundar-rabindranath

…for static W8A8 per tensor (#195) - Depending on how we end up parsing `ignore` and `targets` (layer_name vs layer_type) we may not need layer_name to be added to the linear_method. Will experiment using a compressed-tensors function in a follow-up PR - Initial implementation for Compressed Config support + Activation Quantization for static per tensor w8a8 - Includes fused kernels added by @varun-sundar-rabindranath ```python from vllm import LLM, SamplingParams import torch prompts = [ "Hello, my name is", "The capital of France is", "The US president is", "The future of AI is" ] sampling_params = SamplingParams(temperature=0.80, top_p=0.95) llm = LLM(model="nm-testing/tinyllama-one-shot-static-quant-test", enforce_eager=True, dtype=torch.float32, quantization="sparseml") outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - Verification of the different inputs expected for `targets` and `ignore` --> use functions to parse the layer names which can be shared by sparseml and vllm; would live in compressed tensors (https://github.com/neuralmagic/compressed-tensors/blob/67005d76107d4659787f1efd53fe7e6b1d192818/src/compressed_tensors/quantization/lifecycle/apply.py#L86) - Updates to further optimize fake qunat --------- Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

dsikka marked this pull request as ready for review April 24, 2024 16:57

dsikka changed the title ~~[WIP] sparseml compression config support~~ [WIP] Initial CompressedTensors config support Apr 24, 2024

dsikka requested review from varun-sundar-rabindranath and robertgshaw2-neuralmagic April 24, 2024 16:59

dsikka and others added 17 commits April 24, 2024 19:03

initial commit

18adcee

add quant/dequant functions

38dcd67

add csrc files needed for cuda kernels

263749a

add updated model runner

bbe0a70

add more files

5a93cb7

fix model_runner to match upstream main

e822fef

update

0c271e4

update

3b02d6e

fix model loading

e09160b

for fake quant, just use torch

dcb1e59

remove if

48956bc

update to run end-to-end; verify with dense matmul for correctness

35d2d96

update to use ops.quant for weight quantization

1dfa7f6

fix gibberish

b2c39a1

Compression config cutlass (#205)

e8d1886

Use cutlass kernels. --------- Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

clean-up; separate into separate schemes; add to scheme checking

b840eae

format

6868f97

dsikka force-pushed the compression_config branch from 36f302b to 6868f97 Compare April 24, 2024 19:25

dsikka changed the title ~~[WIP] Initial CompressedTensors config support~~ Initial CompressedTensors config + Activation Quantization support Apr 24, 2024

dsikka added 3 commits April 24, 2024 19:37

remove print; update todo

6c89aa9

fix rebase

a0a9a75

update unquant

14d5f25

dsikka changed the title ~~Initial CompressedTensors config + Activation Quantization support~~ Initial CompressedTensors config + Activation Quantization support - DO NOT MERGE Apr 24, 2024

dsikka changed the base branch from upstream-main to ds-quant April 24, 2024 21:42

dsikka changed the title ~~Initial CompressedTensors config + Activation Quantization support - DO NOT MERGE~~ Initial CompressedTensors config + Activation Quantization support Apr 24, 2024

dsikka changed the title ~~Initial CompressedTensors config + Activation Quantization support~~ Initial CompressedTensors config + Activation Quantization support for static W8A8 per tensor Apr 25, 2024

varun-sundar-rabindranath and others added 3 commits April 25, 2024 12:22

Compression config perf fix (#207)

e5f391f

Description: Remove logging triggers a device-to-host copy. --------- Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

add update supported_list; update params_dtype

ddb10d8

Merge branch 'ds-quant' into compression_config

0c5f2a0

robertgshaw2-neuralmagic reviewed Apr 26, 2024

View reviewed changes

csrc/quantization/smoothquant/fused_kernels.cu Outdated Show resolved Hide resolved

robertgshaw2-neuralmagic requested changes Apr 26, 2024

View reviewed changes

robertgshaw2-neuralmagic reviewed Apr 26, 2024

View reviewed changes

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py Show resolved Hide resolved

dsikka and others added 7 commits April 29, 2024 14:49

PR comments

540c159

more comments

677f02c

cleanup

cf61e07

make layer name optional; update create_weights in quant linear methods

96fea65

cleanup

093e688

Merge branch 'ds-quant' into compression_config

681fb3b

dsikka merged commit 78a3670 into ds-quant Apr 30, 2024

dsikka deleted the compression_config branch April 30, 2024 17:10

dsikka mentioned this pull request Apr 30, 2024

Initial CompressedTensors config + Activation Quantization support … #219

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial `CompressedTensors` config + Activation Quantization support for static W8A8 per tensor #195

Initial `CompressedTensors` config + Activation Quantization support for static W8A8 per tensor #195

dsikka commented Apr 18, 2024 •

edited

robertgshaw2-neuralmagic left a comment

robertgshaw2-neuralmagic Apr 26, 2024

dsikka Apr 26, 2024 •

edited

robertgshaw2-neuralmagic Apr 26, 2024

Satrat Apr 29, 2024

Initial CompressedTensors config + Activation Quantization support for static W8A8 per tensor #195

Initial CompressedTensors config + Activation Quantization support for static W8A8 per tensor #195

Conversation

dsikka commented Apr 18, 2024 • edited

Adding layer_name

Summary

Testing/Sample Script:

Next Steps:

robertgshaw2-neuralmagic left a comment

Choose a reason for hiding this comment

robertgshaw2-neuralmagic Apr 26, 2024

Choose a reason for hiding this comment

dsikka Apr 26, 2024 • edited

Choose a reason for hiding this comment

robertgshaw2-neuralmagic Apr 26, 2024

Choose a reason for hiding this comment

Satrat Apr 29, 2024

Choose a reason for hiding this comment

Initial `CompressedTensors` config + Activation Quantization support for static W8A8 per tensor #195

Initial `CompressedTensors` config + Activation Quantization support for static W8A8 per tensor #195

dsikka commented Apr 18, 2024 •

edited

Adding `layer_name`

dsikka Apr 26, 2024 •

edited