Cuda Out of memory while loading PEFT weights using accelerate on multi gpu #2760

sidtandon2014 · 2024-05-10T10:14:13Z

System Info

bitsandbytes==0.43.0
peft==0.10.0
trl==0.7.10
accelerate==0.27.1
transformers==4.38.1

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

I have fine tuned gemma 7B model, but getting "Cuda Out of memory error" while inferencing on 8 GPUs. As per the error trace this is happening while loading peft weights. While debugging I have observed that main GPU is creating multiple processes for each child GPU to load peft weights, and that's when Out of memory issue happening. If I reduce the number of GPUs to 2 this works fine

Script for loading model

accelerator = Accelerator()

model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
         token=os.environ['HF_TOKEN'],
        device_map={"": accelerator.process_index},
    )

model = PeftModel.from_pretrained(model
                                      , adapter_checkpoint_dir
                                      , is_trainable=False
                                     )

Error trace

Traceback (most recent call last):
  File "/home/jupyter/llm-supervised-finetuning/inference.py", line 262, in <module>
    fire.Fire(main)
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/jupyter/llm-supervised-finetuning/inference.py", line 233, in main
Traceback (most recent call last):
      File "/home/jupyter/llm-supervised-finetuning/inference.py", line 262, in <module>
model = get_model(model_name, adapter_checkpoint_dir, model_cfg, bnb_config,
  File "/home/jupyter/llm-supervised-finetuning/inference.py", line 140, in get_model
    model = PeftModel.from_pretrained(model
  File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 356, in from_pretrained
    fire.Fire(main)
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
    model.load_adapter(model_id, adapter_name, is_trainable=is_trainable, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 727, in load_adapter
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
        adapters_weights = load_peft_weights(model_id, device=torch_device, **hf_hub_download_kwargs)component = fn(*varargs, **kwargs)

  File "/opt/conda/lib/python3.10/site-packages/peft/utils/save_and_load.py", line 326, in load_peft_weights
  File "/home/jupyter/llm-supervised-finetuning/inference.py", line 233, in main
    model = get_model(model_name, adapter_checkpoint_dir, model_cfg, bnb_config,
  File "/home/jupyter/llm-supervised-finetuning/inference.py", line 140, in get_model
    model = PeftModel.from_pretrained(model
  File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 356, in from_pretrained
    adapters_weights = safe_load_file(filename, device=device)
  File "/opt/conda/lib/python3.10/site-packages/safetensors/torch.py", line 310, in load_file
    result[k] = f.get_tensor(k)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.98 GiB. GPU 0 has a total capacity of 21.96 GiB of which 1.06 GiB is free. Process 3491179 has 13.83 GiB memory in use. Process 3491186 has 3.16 GiB memory in use. Process 3491180 has 3.16 GiB memory in use. Including non-PyTorch memory, this process has 184.00 MiB memory in use. Process 3491183 has 184.00 MiB memory in use. Process 3491185 has 184.00 MiB memory in use. Process 3491182 has 184.00 MiB memory in use. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.

nvidia-smi output while loading base model

nvidia-smi output with peft weights

Expected behavior

Inference script should run on all GPUs

The text was updated successfully, but these errors were encountered:

jweihe · 2024-06-06T05:26:43Z

same problems, have you solved it?

BenjaminBossan · 2024-06-06T10:00:56Z

Could you show the LoRA config that you used for this model? How large is the LoRA adapter (the adapter_model.safetensors or adapter_model.bin file)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda Out of memory while loading PEFT weights using accelerate on multi gpu #2760

Cuda Out of memory while loading PEFT weights using accelerate on multi gpu #2760

sidtandon2014 commented May 10, 2024

jweihe commented Jun 6, 2024

BenjaminBossan commented Jun 6, 2024

Cuda Out of memory while loading PEFT weights using accelerate on multi gpu #2760

Cuda Out of memory while loading PEFT weights using accelerate on multi gpu #2760

Comments

sidtandon2014 commented May 10, 2024

System Info

Information

Tasks

Reproduction

Expected behavior

jweihe commented Jun 6, 2024

BenjaminBossan commented Jun 6, 2024