Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeking for Help: how to work deepspeed zero stage 3 with quantized reference model? #98

Closed
grayground opened this issue Jan 7, 2024 · 1 comment

Comments

@grayground
Copy link

Hi, I would like to conduct dpo training on my 2 a6000 (48GB) gpus. Specifically, the model was based on qlora and reference model was based on quantized one. I would like to utilize the deepspeed zero stage 3 to accelerate training time.

During the training process, I encountered errors related to the model and reference model integration with Deepspeed. Below is the relevant code snippet and the encountered error:

The model and reference model both were loaded with

bnb_config = BitsAndBytesConfig(
                    load_in_8bit=True,
                )
device_index = accelerator.local_process_index
device_map = {"": device_index} # force data-parallel training.
model = AutoModelForCausalLM.from_pretrained(
                    model_name_or_path,
                    from_tf=bool(".ckpt" in model_name_or_path),
                    config=config,
                    load_in_8bit=True,
                    quantization_config=bnb_config,
                    torch_dtype=torch.bfloat16,
                    use_flash_attention_2=True if args.use_flash_attn else False,
)
reference_model = model

# some codes about coverting model to lora model...

reference_model = prepare_deepspeed(accelerator, reference_model)
File "/root/data1/tulu2/open-instruct/open-instruct-main/open_instruct/dpo_tune.py", line 692, in main                                                                                                                           
    reference_model = prepare_deepspeed(accelerator, reference_model)                                                                                                                                                              
  File "/root/data1/tulu2/open-instruct/open-instruct-main/open_instruct/dpo_tune.py", line 396, in prepare_deepspeed                                                                                                              
    model, *_ = deepspeed.initialize(model=model, config=config_kwargs)                                                                                                                                                            
  File "/conda/envs/tulu_dpo_env/lib/python3.10/site-packages/deepspeed/__init__.py", line 171, in initialize                                                                                                                      
    engine = DeepSpeedEngine(args=args,                                                                                                                                                                                            
  File "/conda/envs/tulu_dpo_env/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 259, in __init__                                                                                                                  
    self._configure_distributed_model(model)                                                                                                                                                                                       
  File "/conda/envs/tulu_dpo_env/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1090, in _configure_distributed_model                                                                                             
    self.module.to(self.device)                                                                                                                                                                                                    
  File "/conda/envs/tulu_dpo_env/lib/python3.10/site-packages/accelerate/big_modeling.py", line 411, in wrapper                                                                                                                    
    return fn(*args, **kwargs)                                                                                                                                                                                                     
  File "/conda/envs/tulu_dpo_env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2176, in to                                                                                                                    
    raise ValueError(                                                                                                                                                                                                              
ValueError: `.to` is not supported for `4-bit` or `8-bit` bitsandbytes models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct `dtype`.                       
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 20940) of binary: /conda/envs/tulu_dpo_env/bin/python 

Thanks for your help in advance!

@grayground grayground changed the title Seeking for Help: how to incroporate deepspeed zero stage 3 with quantized reference model? Seeking for Help: how to work deepspeed zero stage 3 with quantized reference model? Jan 7, 2024
@hamishivi
Copy link
Collaborator

Hi, could you provide the command you're using to run the script? I think there are issues with distributed training and quantized models, so you might have to use regular lora. I can try and reproduce this issue and fix if you give me the command. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants