You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I would like to conduct dpo training on my 2 a6000 (48GB) gpus. Specifically, the model was based on qlora and reference model was based on quantized one. I would like to utilize the deepspeed zero stage 3 to accelerate training time.
During the training process, I encountered errors related to the model and reference model integration with Deepspeed. Below is the relevant code snippet and the encountered error:
The model and reference model both were loaded with
bnb_config=BitsAndBytesConfig(
load_in_8bit=True,
)
device_index=accelerator.local_process_indexdevice_map= {"": device_index} # force data-parallel training.model=AutoModelForCausalLM.from_pretrained(
model_name_or_path,
from_tf=bool(".ckpt"inmodel_name_or_path),
config=config,
load_in_8bit=True,
quantization_config=bnb_config,
torch_dtype=torch.bfloat16,
use_flash_attention_2=Trueifargs.use_flash_attnelseFalse,
)
reference_model=model# some codes about coverting model to lora model...reference_model=prepare_deepspeed(accelerator, reference_model)
File "/root/data1/tulu2/open-instruct/open-instruct-main/open_instruct/dpo_tune.py", line 692, in main
reference_model = prepare_deepspeed(accelerator, reference_model)
File "/root/data1/tulu2/open-instruct/open-instruct-main/open_instruct/dpo_tune.py", line 396, in prepare_deepspeed
model, *_ = deepspeed.initialize(model=model, config=config_kwargs)
File "/conda/envs/tulu_dpo_env/lib/python3.10/site-packages/deepspeed/__init__.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
File "/conda/envs/tulu_dpo_env/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 259, in __init__
self._configure_distributed_model(model)
File "/conda/envs/tulu_dpo_env/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1090, in _configure_distributed_model
self.module.to(self.device)
File "/conda/envs/tulu_dpo_env/lib/python3.10/site-packages/accelerate/big_modeling.py", line 411, in wrapper
return fn(*args, **kwargs)
File "/conda/envs/tulu_dpo_env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2176, in to
raise ValueError(
ValueError: `.to` is not supported for `4-bit` or `8-bit` bitsandbytes models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct `dtype`.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 20940) of binary: /conda/envs/tulu_dpo_env/bin/python
Thanks for your help in advance!
The text was updated successfully, but these errors were encountered:
grayground
changed the title
Seeking for Help: how to incroporate deepspeed zero stage 3 with quantized reference model?
Seeking for Help: how to work deepspeed zero stage 3 with quantized reference model?
Jan 7, 2024
Hi, could you provide the command you're using to run the script? I think there are issues with distributed training and quantized models, so you might have to use regular lora. I can try and reproduce this issue and fix if you give me the command. Thanks!
Hi, I would like to conduct dpo training on my 2 a6000 (48GB) gpus. Specifically, the model was based on qlora and reference model was based on quantized one. I would like to utilize the deepspeed zero stage 3 to accelerate training time.
During the training process, I encountered errors related to the model and reference model integration with Deepspeed. Below is the relevant code snippet and the encountered error:
The model and reference model both were loaded with
Thanks for your help in advance!
The text was updated successfully, but these errors were encountered: