You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)
Reproduction
I have fine tuned gemma 7B model, but getting "Cuda Out of memory error" while inferencing on 8 GPUs. As per the error trace this is happening while loading peft weights. While debugging I have observed that main GPU is creating multiple processes for each child GPU to load peft weights, and that's when Out of memory issue happening. If I reduce the number of GPUs to 2 this works fine
Script for loading model
accelerator = Accelerator()
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
token=os.environ['HF_TOKEN'],
device_map={"": accelerator.process_index},
)
model = PeftModel.from_pretrained(model
, adapter_checkpoint_dir
, is_trainable=False
)
Error trace
Traceback (most recent call last):
File "/home/jupyter/llm-supervised-finetuning/inference.py", line 262, in <module>
fire.Fire(main)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/jupyter/llm-supervised-finetuning/inference.py", line 233, in main
Traceback (most recent call last):
File "/home/jupyter/llm-supervised-finetuning/inference.py", line 262, in <module>
model = get_model(model_name, adapter_checkpoint_dir, model_cfg, bnb_config,
File "/home/jupyter/llm-supervised-finetuning/inference.py", line 140, in get_model
model = PeftModel.from_pretrained(model
File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 356, in from_pretrained
fire.Fire(main)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
model.load_adapter(model_id, adapter_name, is_trainable=is_trainable, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 727, in load_adapter
component, remaining_args = _CallAndUpdateTrace(
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
adapters_weights = load_peft_weights(model_id, device=torch_device, **hf_hub_download_kwargs)component = fn(*varargs, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/peft/utils/save_and_load.py", line 326, in load_peft_weights
File "/home/jupyter/llm-supervised-finetuning/inference.py", line 233, in main
model = get_model(model_name, adapter_checkpoint_dir, model_cfg, bnb_config,
File "/home/jupyter/llm-supervised-finetuning/inference.py", line 140, in get_model
model = PeftModel.from_pretrained(model
File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 356, in from_pretrained
adapters_weights = safe_load_file(filename, device=device)
File "/opt/conda/lib/python3.10/site-packages/safetensors/torch.py", line 310, in load_file
result[k] = f.get_tensor(k)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.98 GiB. GPU 0 has a total capacity of 21.96 GiB of which 1.06 GiB is free. Process 3491179 has 13.83 GiB memory in use. Process 3491186 has 3.16 GiB memory in use. Process 3491180 has 3.16 GiB memory in use. Including non-PyTorch memory, this process has 184.00 MiB memory in use. Process 3491183 has 184.00 MiB memory in use. Process 3491185 has 184.00 MiB memory in use. Process 3491182 has 184.00 MiB memory in use. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.
nvidia-smi output while loading base model
nvidia-smi output with peft weights
Expected behavior
Inference script should run on all GPUs
The text was updated successfully, but these errors were encountered:
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
I have fine tuned gemma 7B model, but getting "Cuda Out of memory error" while inferencing on 8 GPUs. As per the error trace this is happening while loading peft weights. While debugging I have observed that main GPU is creating multiple processes for each child GPU to load peft weights, and that's when Out of memory issue happening. If I reduce the number of GPUs to 2 this works fine
Script for loading model
Error trace
nvidia-smi output while loading base model
nvidia-smi output with peft weights
Expected behavior
Inference script should run on all GPUs
The text was updated successfully, but these errors were encountered: