IPEX LLM serving example #3068

bbhattar · 2024-04-03T01:29:21Z

Adding an example for deploying the text generation with Large Language Models (LLMs) with IPEX. It can use (1) IPEX Weight-only Quantization to convert the model to INT8 precision, (2) IPEX Smoothquant quantization, or (3) default bfloat16 optimization.

Files:

README.md
llm_handler.py - custom handler for quantizing and deploying the model
model-config-llama2-7b-bf16.yaml - config file for bfloat16 optimizations
model-config-llama2-7b-int8-sq.yaml - config file for smooth-quant quantization
model-config-llama2-7b-int8-woq.yaml - config file for weight-only quantization
sample_text_0.txt - A sample prompt you can use to test the text generation model.

Type of change

Please delete options that are not relevant.

New feature (non-breaking change which adds functionality)

Checklist:

Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

Fixed some markdowns

lxning

Thanks for the contribution. Could you please add test for this example? See pytest example: https://github.com/pytorch/serve/blob/master/test/pytest/test_example_gpt_fast.py.

examples/large_models/ipex_llm_int8/README.md

lxning · 2024-04-04T17:59:39Z

examples/large_models/ipex_llm_int8/llm_handler.py

+                self_jit = torch.jit.trace(converted_model.eval(), example_inputs, strict=False, check_trace=False)
+                self_jit = torch.jit.freeze(self_jit.eval())
+
+                self_jit.save(self.quantized_model_path)


is it possible to add logic to check if the quantized_model_path exist? if exist, skip this step to reduce model loading latency.

Added the logic. Users can choose to re-quantize through clear_cache_dir flag in the config

lxning · 2024-04-04T18:04:32Z

examples/large_models/ipex_llm_int8/llm_handler.py

+            for i, x in enumerate(outputs):
+                inferences.append(self.tokenizer.decode(outputs[i], skip_special_tokens=True))


usually batch_decode is faster.

Changed to batch_decode

mreso

Left some comments, please verify if the smooth quant branch is actually working.

mreso · 2024-05-13T19:11:00Z

examples/large_models/ipex_llm_int8/README.md

+torchserve --ncs --start --model-store model_store
+```
+
+4. From the client, set up batching parameters. I couldn't make it work by putting the max_batch_size and max_batch_delay in config.properties. 


Lets figure this out and update the readme before merging.

Fixed it in the new commit

examples/large_models/ipex_llm_int8/llm_handler.py

mreso · 2024-05-13T19:18:01Z

examples/large_models/ipex_llm_int8/llm_handler.py

+    try:
+        import intel_extension_for_pytorch as ipex
+        try:
+            ipex._C.disable_jit_linear_repack()


Is this LLM specific or is this something we should also do in the ipex integration in the basehandler?

mreso · 2024-05-13T19:21:39Z

examples/large_models/ipex_llm_int8/llm_handler.py

+    def initialize(self, ctx: Context):
+        model_name = ctx.model_yaml_config["handler"]["model_name"]
+        # path to quantized model, if we are quantizing on the fly, we'll use this path to save the model
+        self.clear_cache_dir = ctx.model_yaml_config["handler"]["clear_cache_dir"]


clear_cache_dir does not seem to be set in any of the example model config yaml. Better to use .get("clear_cache_dir", DEFAULT_VALUE) here and replace default value to what you think is appropriate.

Added default value for every parameters, except model name

mreso · 2024-05-13T19:24:58Z

examples/large_models/ipex_llm_int8/llm_handler.py

+        model_name = ctx.model_yaml_config["handler"]["model_name"]
+        # path to quantized model, if we are quantizing on the fly, we'll use this path to save the model
+        self.clear_cache_dir = ctx.model_yaml_config["handler"]["clear_cache_dir"]
+        self.quantized_model_path = ctx.model_yaml_config["handler"]["quantized_model_path"]


Same here and below, would be good to set a default value using .get() and remove from yaml file to concentrate on important settings there.

mreso · 2024-05-13T21:05:24Z

examples/large_models/ipex_llm_int8/llm_handler.py

+                if hasattr(self.user_model.config, n):
+                    return getattr(self.user_model.config, n)
+            logger.error(f"Not found target {names[0]}")
+            exit(0)


Better to exit with 1 here as this is an error condition.

Thanks, changed!

mreso · 2024-05-13T21:54:53Z

examples/large_models/ipex_llm_int8/llm_handler.py

+                            # need to recompute these
+                            def _get_target_nums(names):
+                                for n in names:
+                                    if hasattr(self.user_model.config, n):


Is this code tested? self here seems to be Evaluator which does not have user_model as an attribute.

added the test for smooth-quant path and fixed the scope issue for user_model

mreso · 2024-05-13T21:57:29Z

examples/large_models/ipex_llm_int8/llm_handler.py

+                                        torch.zeros(1, 0, 0, 1, dtype=torch.long).contiguous(),
+                                        torch.zeros([1, n_heads, 1, head_dim]).contiguous(),
+                                        torch.zeros([1, n_heads, 1, head_dim]).contiguous(),
+                                        self.beam_idx_tmp,


Same here, self (Evaluator) will not have beam_idx_tmp which is part of IpexLLMHandler

recomputed beam_idx_tmp here

mreso · 2024-05-13T22:07:02Z

examples/large_models/ipex_llm_int8/llm_handler.py

+
+                example_inputs = self.get_example_inputs()
+
+                with torch.no_grad(), torch.cpu.amp.autocast(


nit: The following lines (tracing and saving the model) are equal to all 3 conditions and could be replace by a single appearance after the if-else

replaced by single trace_and_export function

mreso · 2024-05-13T22:10:14Z

examples/large_models/ipex_llm_int8/llm_handler.py

+        example_inputs = None
+        input_ids = torch.ones(32).to(torch.long)
+        attention_mask = torch.ones(len(input_ids))
+        if self.example_inputs_mode == "MASK_POS_KV":


nit: code could be shared with collate_batch by moving this into a utility function.

Removed this part. Now utilizing the dataloader built with collate_batch for generating an example input

mreso

LGTM now, please address the linting issue before merging and as our CI worker does not have access to the llama weights, please skip execution of the test for now. Thanks!

mreso · 2024-05-15T17:29:23Z

examples/large_models/ipex_llm_int8/README.md

+## Model Config
+In addition to usual torchserve configurations, you need to enable ipex specific optimization arguments.
+
+In order to enable IPEX, ```ipex_enable=true``` in the ```config.parameters``` file. If not enabled it will run with default PyTorch with ```auto_mixed_precision``` if enabled. In order to enable ```auto_mixed_precision```, you need to set ```auto_mixed_precision: true``` in model-config file.


config.properties?

Ubuntu and others added 3 commits April 3, 2024 01:02

adding the files for ipex int8 serving of llms

c1f6f46

Update README.md

055de91

Fixed some markdowns

Fix handler name

5fc1d67

lxning reviewed Apr 4, 2024

View reviewed changes

bbhattar and others added 2 commits May 7, 2024 14:43

Merge branch 'pytorch:master' into ipex_llm_int8

7de44f2

Adding default PyTorch support

4c617e0

lxning requested a review from mreso May 8, 2024 04:21

mreso requested changes May 13, 2024

View reviewed changes

Ubuntu added 3 commits May 15, 2024 05:32

Fixing some issues with handler, added test to verify smooth-quant

2328295

adding auto_mixed_precision flag to config

85ba194

Removing min_new_tokens from generation config

eca2b0a

bbhattar requested review from mreso and lxning May 15, 2024 15:29

mreso approved these changes May 15, 2024

View reviewed changes

Merge branch 'master' into ipex_llm_int8

e82a30c

lxning enabled auto-merge May 15, 2024 19:48

lxning and others added 4 commits May 15, 2024 12:58

fix lint

37f16c9

lint

2e852f5

lint

12b75cc

Fixing unit tests with different model that doesn't require license

03f8be8

auto-merge was automatically disabled May 15, 2024 20:12
Head branch was pushed to by a user without write access

mreso and others added 6 commits May 15, 2024 20:28

Fix lint error

3586115

Fix lint error in test

02f885f

Adding requirements.txt

07d8ca9

adding datasets to the requirements

b14d402

upgrading the ipex version to 2.3.0 to match that of pytorch

3cebfeb

Skipping ipex llm tests if accelerate is not present

c6ad7a6

mreso enabled auto-merge May 16, 2024 04:09

mreso added this pull request to the merge queue May 16, 2024

Merged via the queue into pytorch:master with commit 34bc370 May 16, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IPEX LLM serving example #3068

IPEX LLM serving example #3068

bbhattar commented Apr 3, 2024

lxning left a comment

lxning Apr 4, 2024

bbhattar May 15, 2024

lxning Apr 4, 2024

bbhattar May 15, 2024

mreso left a comment

mreso May 13, 2024

bbhattar May 15, 2024

mreso May 13, 2024

mreso May 13, 2024

bbhattar May 15, 2024

mreso May 13, 2024

bbhattar May 15, 2024

mreso May 13, 2024

bbhattar May 15, 2024

mreso May 13, 2024

bbhattar May 15, 2024

mreso May 13, 2024

bbhattar May 15, 2024

mreso May 13, 2024

bbhattar May 15, 2024

mreso May 13, 2024

bbhattar May 15, 2024

mreso left a comment

mreso May 15, 2024

		for i, x in enumerate(outputs):
		inferences.append(self.tokenizer.decode(outputs[i], skip_special_tokens=True))


		example_inputs = self.get_example_inputs()

		with torch.no_grad(), torch.cpu.amp.autocast(

IPEX LLM serving example #3068

IPEX LLM serving example #3068

Conversation

bbhattar commented Apr 3, 2024

Type of change

Checklist:

lxning left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mreso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mreso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment