Mistral7b takes 4 times its size in VRAM on A100 #1863

martinigoyanes · 2024-05-06T10:09:13Z

Environment Setup

Runtime environment:

Target: x86_64-unknown-linux-gnu
Cargo version: 1.75.0
Commit sha: c38a7d7
Docker label: sha-6c4496a
Kubernetes Cluster deployment

1 A100 GPU with 80GB RAM

12 CPU with 32 GB RAM

TGI version: 2.0.0

TGI Parameters:
MAX_INPUT_LENGTH: "8000"
MAX_TOTAL_TOKENS: "8512"
MAX_CONCURRENT_REQUESTS: "128"
LOG_LEVEL: "INFO"
MAX_BATCH_TOTAL_TOKENS: "4294967295"
WAITING_SERVED_RATIO: "0.3"
MAX_WAITING_TOKENS: "0"
MAX_BATCH_PREFILL_TOKENS: "32768"

Question

I am trying to run Mistral7b on A100GB but I see it taking up 67GB of VRAM.

Mon May  6 10:02:50 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          On  |   00000000:17:00.0 Off |                    0 |
| N/A   47C    P0             87W /  300W |   67815MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

Why is that? Mistral7b is a 7b param model, and with 16-bit precision that should lead to 14GB of VRAM to store the model, right?
What are the extra 53815MB in VRAM coming from?

What seems to be the issue? Am I doing something wrong?

The text was updated successfully, but these errors were encountered:

martinigoyanes · 2024-05-09T09:35:06Z

Could you please explain why this happens? @OlivierDehaene @Narsil Thank you!

Venkat2811 · 2024-05-11T19:04:09Z

~~@martinigoyanes Could you share TGI startup logs for serving Mistral7b ? Are you sure no other process is running and using GPU ? nvtop is super useful to see more details of processes using GPU.~~

Update: Apologies, I misread the question & assumed kv cache was already taken into account. See: #1863 (comment)

martinigoyanes · 2024-05-12T13:06:50Z

Hello @Venkat2811 , I do not have access to the logs right now. But I am confident no other processes were running on that GPU. The GPU was reserved for serving TGI.

I think I found what is "the problem". Basically, when TGI warms up the model it allocates memory for the KV cache so, those 67 GB RAM are coming from the 14GB of Mistral7B and the remaining are the GB allocated for KV Cache.

This is the piece of code I refer to, from flash_causal_lm.py:

 try:
            cache_manager = set_cache_manager(
                batch.blocks,
                self.num_layers,
                self.num_kv_heads,
                self.head_size,
                self.sliding_window is not None,
                self.dtype,
                self.device,
            )
            max_bt = batch.max_blocks
            max_s = max_bt * get_cache_manager().block_size
            _, batch, _ = self.generate_token(batch)
        except torch.cuda.OutOfMemoryError as e:
            raise RuntimeError(
                f"Not enough memory to handle {len(batch.input_ids)} prefill tokens. "
                f"You need to decrease `--max-batch-prefill-tokens`"
            ) from e

       ...

        # Inspired by the original implementation in [vllm](https://github.com/vllm-project/vllm)
        # Calculate the number of blocks that can be allocated with the free memory
        dtype_size = torch.tensor([], dtype=self.dtype).element_size()
        cache_block_size = BLOCK_SIZE * self.num_kv_heads * self.head_size
        total_cache_size = self.num_layers * cache_block_size * 2 * dtype_size

        if IS_CUDA_SYSTEM or IS_ROCM_SYSTEM:
            total_free_memory, _ = torch.cuda.mem_get_info(self.device)
            total_gpu_memory = torch.cuda.get_device_properties(
                self.device
            ).total_memory

            free_memory = max(
                0, total_free_memory - (1 - MEMORY_FRACTION) * total_gpu_memory
            )
        elif IS_XPU_SYSTEM:
            total_gpu_memory = torch.xpu.get_device_properties(self.device).total_memory
            free_memory = int(total_gpu_memory * 0.5)
        else:
            raise NotImplementedError("FlashModel is only available on GPU")

        num_blocks = (
            # Leave 5% for some wiggle room
            int((free_memory * 0.95) // total_cache_size)
            # Add batch.blocks as we allocated it above, so it is included in the peak memory.
            + cache_manager.num_blocks
        )

        del batch
        del cache_manager

        set_cache_manager(
            num_blocks,
            self.num_layers,
            self.num_kv_heads,
            self.head_size,
            self.sliding_window is not None,
            self.dtype,
            self.device,
        )

        if CUDA_GRAPHS:
            try:
                logger.info(f"Cuda Graphs are enabled for sizes {CUDA_GRAPHS}")
                # Warmup cuda graphs
                for bs in CUDA_GRAPHS:
                    if self.speculate is None or self.speculate + 1 <= bs:
                        self.cuda_graph_warmup(bs, max_s, max_bt)
            except torch.cuda.OutOfMemoryError:
                logger.exception(f"Decode cuda graph warmup failed")
        else:
            logger.info(f"Cuda Graphs are disabled (CUDA_GRAPHS={CUDA_GRAPHS}).")

        return int(num_blocks * BLOCK_SIZE)

However, I am missing a couple things:

What is BLOCK_SIZE referring to? Why is it hardcoded to 16 and why is it used to scale all calculations?
What is exactly total_cache_size?
Why are max_batch_total_tokens computed by num_blocks * BLOCK_SIZE? Is it because one block of memory able to store 16 (block_size) tokens?

Venkat2811 · 2024-05-15T09:19:09Z

Hey @martinigoyanes

Basically, when TGI warms up the model it allocates memory for the KV cache so, those 67 GB RAM are coming from the 14GB of Mistral7B and the remaining are the GB allocated for KV Cache.

Yes, as discussed in another threads, KV cache is essential in inference and without it, there is no inference. So there is no problem here.

KV cache memory calculation - Out of Memory Errors When Running text-generation-benchmark Despite Compliant Batch Token Limit #1831 (comment).
How to share memory among 2 GPUS for distributed inference? #1875 (comment)

Would you consider closing this issue as it's no longer an issue ?

However, I am missing a couple things:

What is BLOCK_SIZE referring to? Why is it hardcoded to 16 and why is it used to scale all calculations?

What is exactly total_cache_size?

Why are max_batch_total_tokens computed by num_blocks * BLOCK_SIZE? Is it because one block of memory able to store 16 (block_size) tokens?

Regarding your questions, I don't know CUDA, but my high level understanding is that it's related to efficient GPU computation (thread block, warp, thread). Depending on underlying hardware architecture, and model architecture, BLOCK_SIZE needs to be adjusted for GPU efficient memory bandwidth and compute utilization. Maybe move this to discussions ?

martinigoyanes · 2024-05-15T09:45:18Z

Thank you so much for your reply! I will close the issue and rely on the discussion #1897

Venkat2811 mentioned this issue May 15, 2024

Question about KV cache #1883

Closed

martinigoyanes closed this as completed May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mistral7b takes 4 times its size in VRAM on A100 #1863

Mistral7b takes 4 times its size in VRAM on A100 #1863

martinigoyanes commented May 6, 2024

martinigoyanes commented May 9, 2024

Venkat2811 commented May 11, 2024 •

edited

martinigoyanes commented May 12, 2024

Venkat2811 commented May 15, 2024 •

edited

martinigoyanes commented May 15, 2024

Mistral7b takes 4 times its size in VRAM on A100 #1863

Mistral7b takes 4 times its size in VRAM on A100 #1863

Comments

martinigoyanes commented May 6, 2024

Environment Setup

Question

martinigoyanes commented May 9, 2024

Venkat2811 commented May 11, 2024 • edited

martinigoyanes commented May 12, 2024

Venkat2811 commented May 15, 2024 • edited

martinigoyanes commented May 15, 2024

Venkat2811 commented May 11, 2024 •

edited

Venkat2811 commented May 15, 2024 •

edited