Support automatically calculate max_total_token_num #81

singularity-s0 · 2023-08-17T05:46:26Z

In ApiServerArgs.md, an algorithm was introduced to calculate the optimal max_total_token_num argument. This process can be automated, and this PR introduces this feature.

The max_total_token_num argument now defaults to None. If not set, the API server will automatically calculate the optimal setting according to total GPU RAM and model size. A ratio of 0.8 will also be applied to ensure enough memory is reserved for inference.

Docs have also been updated.

XHPlus · 2023-08-17T14:03:47Z

Thanks for your great PR! We are refactoring part of our code and will merge your PR as soon as the refactored version is ready. Besides, hope to add a WeChat friend with you. (hao95111)

hiworldwzj · 2023-08-21T04:36:55Z

@singularity-s0 Hello, Can this feature be modified to support all models? Because different models may have different calculation methods（GQA model is different）, should the implementation of this feature be bound to each individual model instance?

singularity-s0 · 2023-08-21T06:24:17Z

Hi,

I'm not entirely sure how GQA or other implementations affect the use of GPU memory, could you please elaborate?

Generally, the formula is max_total_token_num = (total_free_gpu_memory - model_parameter_size) * 0.8 / kv_cache_size according to the docs.

total_free_gpu_memory is read using PyTorch CUDA API. This should be the ideal implementation.
model_parameter_size is estimated from the size of weight files on disk. This should mostly be accurate, unless some kind of compression is used, which I'm unaware of.
kv_cache_size should be dependent on model. If config.json provide enough information to calculate this value for each model, then model-specific implementations are not required. However I'm not sure if this is always the case (maybe GQA somehow affects this?)
Some implementations may require additional memory (maybe GQA?). Either config.json tell us enough information or we need model-specific implementations.

hiworldwzj · 2023-08-21T06:32:33Z

@singularity-s0 kv_cache_size is more different in the model that use GQA. "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints"

singularity-s0 · 2023-08-21T07:47:12Z

From my understanding of the paper mentioned above, GQA reduces the kv_cache_size by num_attention_heads / num_key_value_heads times. These values are available from config.json so the value of kv_cache_size can always be calculated.

The new formula will be
max_total_token_num = (total_free_gpu_memory - model_parameter_size) * 0.8 / original_kv_cache_size * num_attention_heads / num_key_value_heads

For models that do not use GQA, simply default num_key_value_heads to num_attention_heads. All current models would be supported this way.

Is my understanding correct?

hiworldwzj · 2023-08-21T08:01:50Z

@singularity-s0 Yes, you are right.

singularity-s0 · 2023-08-21T08:44:58Z

This PR has been updated with changes to how kv_cache_size is calculated. Please review.

hiworldwzj · 2023-08-21T09:57:05Z

lightllm/utils/max_token_num_utils.py

+        with open(config_path, 'r') as f:
+            config = json.load(f)
+        hidden_size = config['hidden_size']
+        layer_num = config['num_hidden_layers']


@singularity-s0 This code may not be very robust when the key name in config.json changes.

hiworldwzj · 2023-08-21T10:02:14Z

lightllm/utils/max_token_num_utils.py

+    total_size = total_size / (1024 ** 3)
+    return total_size
+
+def get_kv_cache_size(model_dir):


"get_kv_cache_size and xxxx" is best implemented as a member function of TpPartBaseModel and should be inherited and implemented by its subclasses.

It seems that max_total_token_num (and batch_max_tokens that depends on it) gets passed to a lot of subsystems before the model is initialized. We need this value to be ready early.

Is there any way to achieve this if implemented as a member function of TpPartBaseModel?

@singularity-s0 You can try to add a method in TpPartBaseModel, but it is not easy to get and set batch_max_tokens in TpPartBaseModel. Let me think about how to implement it elegantly. What are your suggestions?

Ideally, since each instance of LightLLM server is bound to only one model, model configuration can (and should) be loaded before all other subsystems are initialized (because other subsystems may depend on model configuration, as in the case of max_total_token_num). A refactor would be the most elegant way to address this.

Other parameters like max_req_total_len and dtype (which is currently hardcoded to fp16) might also be dependent on model config.json and would benefit from this refactor.

However I imagine such a refactor would not be easy. Hacky solutions are also available but it is ultimately up to you to decide which way is the best.

@singularity-s0 You can write a standalone recommendation program to generate a value for max_total_token_num. that will be more appropriate。

update

support auto calculate total token num

3f6ccda

llehtahw requested a review from hiworldwzj August 17, 2023 06:48

singularity-s0 added 2 commits August 17, 2023 15:12

fix cuda multiprocessing error

e68926a

bug fix

a9359d9

update kv_cache_size calculation

a1ef209

hiworldwzj reviewed Aug 21, 2023

View reviewed changes

Merge pull request #1 from singularity-s0/main

7e35330

update

hiworldwzj self-requested a review December 4, 2023 06:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support automatically calculate max_total_token_num #81

Support automatically calculate max_total_token_num #81

singularity-s0 commented Aug 17, 2023

XHPlus commented Aug 17, 2023

hiworldwzj commented Aug 21, 2023

singularity-s0 commented Aug 21, 2023 •

edited

hiworldwzj commented Aug 21, 2023

singularity-s0 commented Aug 21, 2023

hiworldwzj commented Aug 21, 2023

singularity-s0 commented Aug 21, 2023

hiworldwzj Aug 21, 2023

hiworldwzj Aug 21, 2023

singularity-s0 Aug 21, 2023 •

edited

hiworldwzj Aug 22, 2023

singularity-s0 Aug 22, 2023

hiworldwzj Aug 23, 2023

Support automatically calculate max_total_token_num #81

Are you sure you want to change the base?

Support automatically calculate max_total_token_num #81

Conversation

singularity-s0 commented Aug 17, 2023

XHPlus commented Aug 17, 2023

hiworldwzj commented Aug 21, 2023

singularity-s0 commented Aug 21, 2023 • edited

hiworldwzj commented Aug 21, 2023

singularity-s0 commented Aug 21, 2023

hiworldwzj commented Aug 21, 2023

singularity-s0 commented Aug 21, 2023

hiworldwzj Aug 21, 2023

Choose a reason for hiding this comment

hiworldwzj Aug 21, 2023

Choose a reason for hiding this comment

singularity-s0 Aug 21, 2023 • edited

Choose a reason for hiding this comment

hiworldwzj Aug 22, 2023

Choose a reason for hiding this comment

singularity-s0 Aug 22, 2023

Choose a reason for hiding this comment

hiworldwzj Aug 23, 2023

Choose a reason for hiding this comment

singularity-s0 commented Aug 21, 2023 •

edited

singularity-s0 Aug 21, 2023 •

edited