Seeking for Help: how to work deepspeed zero stage 3 with quantized reference model? #98

grayground · 2024-01-07T09:31:23Z

Hi, I would like to conduct dpo training on my 2 a6000 (48GB) gpus. Specifically, the model was based on qlora and reference model was based on quantized one. I would like to utilize the deepspeed zero stage 3 to accelerate training time.

During the training process, I encountered errors related to the model and reference model integration with Deepspeed. Below is the relevant code snippet and the encountered error:

The model and reference model both were loaded with

bnb_config = BitsAndBytesConfig(
                    load_in_8bit=True,
                )
device_index = accelerator.local_process_index
device_map = {"": device_index} # force data-parallel training.
model = AutoModelForCausalLM.from_pretrained(
                    model_name_or_path,
                    from_tf=bool(".ckpt" in model_name_or_path),
                    config=config,
                    load_in_8bit=True,
                    quantization_config=bnb_config,
                    torch_dtype=torch.bfloat16,
                    use_flash_attention_2=True if args.use_flash_attn else False,
)
reference_model = model

# some codes about coverting model to lora model...

reference_model = prepare_deepspeed(accelerator, reference_model)

File "/root/data1/tulu2/open-instruct/open-instruct-main/open_instruct/dpo_tune.py", line 692, in main                                                                                                                           
    reference_model = prepare_deepspeed(accelerator, reference_model)                                                                                                                                                              
  File "/root/data1/tulu2/open-instruct/open-instruct-main/open_instruct/dpo_tune.py", line 396, in prepare_deepspeed                                                                                                              
    model, *_ = deepspeed.initialize(model=model, config=config_kwargs)                                                                                                                                                            
  File "/conda/envs/tulu_dpo_env/lib/python3.10/site-packages/deepspeed/__init__.py", line 171, in initialize                                                                                                                      
    engine = DeepSpeedEngine(args=args,                                                                                                                                                                                            
  File "/conda/envs/tulu_dpo_env/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 259, in __init__                                                                                                                  
    self._configure_distributed_model(model)                                                                                                                                                                                       
  File "/conda/envs/tulu_dpo_env/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1090, in _configure_distributed_model                                                                                             
    self.module.to(self.device)                                                                                                                                                                                                    
  File "/conda/envs/tulu_dpo_env/lib/python3.10/site-packages/accelerate/big_modeling.py", line 411, in wrapper                                                                                                                    
    return fn(*args, **kwargs)                                                                                                                                                                                                     
  File "/conda/envs/tulu_dpo_env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2176, in to                                                                                                                    
    raise ValueError(                                                                                                                                                                                                              
ValueError: `.to` is not supported for `4-bit` or `8-bit` bitsandbytes models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct `dtype`.                       
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 20940) of binary: /conda/envs/tulu_dpo_env/bin/python

Thanks for your help in advance!

The text was updated successfully, but these errors were encountered:

hamishivi · 2024-02-06T02:03:22Z

Hi, could you provide the command you're using to run the script? I think there are issues with distributed training and quantized models, so you might have to use regular lora. I can try and reproduce this issue and fix if you give me the command. Thanks!

grayground changed the title ~~Seeking for Help: how to incroporate deepspeed zero stage 3 with quantized reference model?~~ Seeking for Help: how to work deepspeed zero stage 3 with quantized reference model? Jan 7, 2024

hamishivi closed this as completed May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seeking for Help: how to work deepspeed zero stage 3 with quantized reference model? #98

Seeking for Help: how to work deepspeed zero stage 3 with quantized reference model? #98

grayground commented Jan 7, 2024

hamishivi commented Feb 6, 2024

Seeking for Help: how to work deepspeed zero stage 3 with quantized reference model? #98

Seeking for Help: how to work deepspeed zero stage 3 with quantized reference model? #98

Comments

grayground commented Jan 7, 2024

hamishivi commented Feb 6, 2024