Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mistral7b takes 4 times its size in VRAM on A100 #1863

Closed
martinigoyanes opened this issue May 6, 2024 · 5 comments
Closed

Mistral7b takes 4 times its size in VRAM on A100 #1863

martinigoyanes opened this issue May 6, 2024 · 5 comments

Comments

@martinigoyanes
Copy link
Contributor

Environment Setup

Runtime environment:

Target: x86_64-unknown-linux-gnu
Cargo version: 1.75.0
Commit sha: c38a7d7
Docker label: sha-6c4496a
Kubernetes Cluster deployment

1 A100 GPU with 80GB RAM

12 CPU with 32 GB RAM

TGI version: 2.0.0

TGI Parameters:
MAX_INPUT_LENGTH: "8000"
MAX_TOTAL_TOKENS: "8512"
MAX_CONCURRENT_REQUESTS: "128"
LOG_LEVEL: "INFO"
MAX_BATCH_TOTAL_TOKENS: "4294967295"
WAITING_SERVED_RATIO: "0.3"
MAX_WAITING_TOKENS: "0"
MAX_BATCH_PREFILL_TOKENS: "32768"

Question

I am trying to run Mistral7b on A100GB but I see it taking up 67GB of VRAM.

Mon May  6 10:02:50 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          On  |   00000000:17:00.0 Off |                    0 |
| N/A   47C    P0             87W /  300W |   67815MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

Why is that? Mistral7b is a 7b param model, and with 16-bit precision that should lead to 14GB of VRAM to store the model, right?
What are the extra 53815MB in VRAM coming from?

What seems to be the issue? Am I doing something wrong?

@martinigoyanes
Copy link
Contributor Author

Could you please explain why this happens? @OlivierDehaene @Narsil Thank you!

@Venkat2811
Copy link

Venkat2811 commented May 11, 2024

@martinigoyanes Could you share TGI startup logs for serving Mistral7b ? Are you sure no other process is running and using GPU ? nvtop is super useful to see more details of processes using GPU.

Update: Apologies, I misread the question & assumed kv cache was already taken into account. See: #1863 (comment)

@martinigoyanes
Copy link
Contributor Author

Hello @Venkat2811 , I do not have access to the logs right now. But I am confident no other processes were running on that GPU. The GPU was reserved for serving TGI.

I think I found what is "the problem". Basically, when TGI warms up the model it allocates memory for the KV cache so, those 67 GB RAM are coming from the 14GB of Mistral7B and the remaining are the GB allocated for KV Cache.

This is the piece of code I refer to, from flash_causal_lm.py:

 try:
            cache_manager = set_cache_manager(
                batch.blocks,
                self.num_layers,
                self.num_kv_heads,
                self.head_size,
                self.sliding_window is not None,
                self.dtype,
                self.device,
            )
            max_bt = batch.max_blocks
            max_s = max_bt * get_cache_manager().block_size
            _, batch, _ = self.generate_token(batch)
        except torch.cuda.OutOfMemoryError as e:
            raise RuntimeError(
                f"Not enough memory to handle {len(batch.input_ids)} prefill tokens. "
                f"You need to decrease `--max-batch-prefill-tokens`"
            ) from e

       ...

        # Inspired by the original implementation in [vllm](https://github.com/vllm-project/vllm)
        # Calculate the number of blocks that can be allocated with the free memory
        dtype_size = torch.tensor([], dtype=self.dtype).element_size()
        cache_block_size = BLOCK_SIZE * self.num_kv_heads * self.head_size
        total_cache_size = self.num_layers * cache_block_size * 2 * dtype_size

        if IS_CUDA_SYSTEM or IS_ROCM_SYSTEM:
            total_free_memory, _ = torch.cuda.mem_get_info(self.device)
            total_gpu_memory = torch.cuda.get_device_properties(
                self.device
            ).total_memory

            free_memory = max(
                0, total_free_memory - (1 - MEMORY_FRACTION) * total_gpu_memory
            )
        elif IS_XPU_SYSTEM:
            total_gpu_memory = torch.xpu.get_device_properties(self.device).total_memory
            free_memory = int(total_gpu_memory * 0.5)
        else:
            raise NotImplementedError("FlashModel is only available on GPU")

        num_blocks = (
            # Leave 5% for some wiggle room
            int((free_memory * 0.95) // total_cache_size)
            # Add batch.blocks as we allocated it above, so it is included in the peak memory.
            + cache_manager.num_blocks
        )

        del batch
        del cache_manager

        set_cache_manager(
            num_blocks,
            self.num_layers,
            self.num_kv_heads,
            self.head_size,
            self.sliding_window is not None,
            self.dtype,
            self.device,
        )

        if CUDA_GRAPHS:
            try:
                logger.info(f"Cuda Graphs are enabled for sizes {CUDA_GRAPHS}")
                # Warmup cuda graphs
                for bs in CUDA_GRAPHS:
                    if self.speculate is None or self.speculate + 1 <= bs:
                        self.cuda_graph_warmup(bs, max_s, max_bt)
            except torch.cuda.OutOfMemoryError:
                logger.exception(f"Decode cuda graph warmup failed")
        else:
            logger.info(f"Cuda Graphs are disabled (CUDA_GRAPHS={CUDA_GRAPHS}).")

        return int(num_blocks * BLOCK_SIZE)

However, I am missing a couple things:

  • What is BLOCK_SIZE referring to? Why is it hardcoded to 16 and why is it used to scale all calculations?
  • What is exactly total_cache_size?
  • Why are max_batch_total_tokens computed by num_blocks * BLOCK_SIZE? Is it because one block of memory able to store 16 (block_size) tokens?

@Venkat2811
Copy link

Venkat2811 commented May 15, 2024

Hey @martinigoyanes

Basically, when TGI warms up the model it allocates memory for the KV cache so, those 67 GB RAM are coming from the 14GB of Mistral7B and the remaining are the GB allocated for KV Cache.

Yes, as discussed in another threads, KV cache is essential in inference and without it, there is no inference. So there is no problem here.

Would you consider closing this issue as it's no longer an issue ?

However, I am missing a couple things:

  • What is BLOCK_SIZE referring to? Why is it hardcoded to 16 and why is it used to scale all calculations?
  • What is exactly total_cache_size?
  • Why are max_batch_total_tokens computed by num_blocks * BLOCK_SIZE? Is it because one block of memory able to store 16 (block_size) tokens?

Regarding your questions, I don't know CUDA, but my high level understanding is that it's related to efficient GPU computation (thread block, warp, thread). Depending on underlying hardware architecture, and model architecture, BLOCK_SIZE needs to be adjusted for GPU efficient memory bandwidth and compute utilization. Maybe move this to discussions ?

@martinigoyanes
Copy link
Contributor Author

Thank you so much for your reply! I will close the issue and rely on the discussion #1897

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants