Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for data parallel QLoRA training via DeepSpeed Zero stages 0, 1 and 2. #3728

Open
wants to merge 49 commits into
base: master
Choose a base branch
from

Conversation

arnavgarg1
Copy link
Contributor

@arnavgarg1 arnavgarg1 commented Oct 13, 2023

This PR adds support for data parallel QLoRA training using DeepSpeed Stages 0, 1, and 2.

As a refresher, here is what each DeepSpeed Zero stage corresponds to:

  • Stage 0: Disabled, i.e., no partitioning of optimizer state, gradients or model parameters. You can still perform optimizer and parameter offloading, as well training using bf16 or fp16 etc.
  • Stage 1: The optimizer states (e.g., for Adam optimizer, 32-bit weights, and the first, and second moment estimates) are partitioned across the processes, so that each process updates only its partition.
  • Stage 2: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process retains only the gradients corresponding to its portion of the optimizer states.
  • Stage 3: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and partition them during the forward and backward passes.

With this PR, you can now perform QLoRA based training for larger-than-memory datasets. You can find a full attached example in ludwig/examples/llm_qlora_data_parallel.

For e.g, you can now use a config like the following:

model_type: llm
base_model: meta-llama/Llama-2-7b-hf

input_features:
  - name: instruction
    type: text

output_features:
  - name: output
    type: text

adapter:
  type: lora

quantization:
  bits: 4

backend:
  type: ray
  trainer:
    use_gpu: true
    strategy:
      type: deepspeed
      zero_optimization:
        stage: 2

trainer:
  type: finetune
  batch_size: 1
  gradient_accumulation_steps: 4

to:

  1. Create n single GPU workers on a single multi-GPU node
  2. Load a 4-bit version of Llama-2-7b on each of the GPU training workers
  3. Use Ray Datasets to split the total dataset for training across the n workers
  4. Train using gradient partitioning and optimizer state partitioning across workers (not possible with traditional DDP)

In particular, since this uses DeepSpeed Stage 2 with Ray, it lets you stream datasets into memory at training time per-worker.

This is a snapshot of GPU utilization a single-node 4 GPU pod using 4xA5000s:

Mon Oct 16 07:53:11 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05    Driver Version: 510.73.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A5000    On   | 00000000:01:00.0 Off |                  Off |
| 31%   61C    P2   141W / 230W |  10089MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000    On   | 00000000:25:00.0 Off |                  Off |
| 30%   59C    P2   131W / 230W |   8509MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A5000    On   | 00000000:41:00.0 Off |                  Off |
| 31%   60C    P2   133W / 230W |   8905MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A5000    On   | 00000000:61:00.0 Off |                  Off |
| 30%   55C    P2   153W / 230W |   8625MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

@arnavgarg1 arnavgarg1 changed the title [WIP] Add support for data parallel quantized training by supporting DeepSpeed stages 0, 1 and 2. Add support for data parallel quantized training by supporting DeepSpeed stages 0, 1 and 2. Oct 16, 2023
@arnavgarg1 arnavgarg1 marked this pull request as ready for review October 16, 2023 13:30
@arnavgarg1 arnavgarg1 changed the title Add support for data parallel quantized training by supporting DeepSpeed stages 0, 1 and 2. Add support for data parallel quantized training by supporting DeepSpeed Zero stages 0, 1 and 2. Oct 16, 2023
@arnavgarg1 arnavgarg1 changed the title Add support for data parallel quantized training by supporting DeepSpeed Zero stages 0, 1 and 2. Add support for data parallel QLoRA training via DeepSpeed Zero stages 0, 1 and 2. Oct 16, 2023
if self.model.trained_using_adapter:
adapter_ref = ray.put(dist_strategy.extract_adapter_weights_for_serialization(self.model))

optimization_stage = _get_optimization_stage_from_trainer_config(self.trainer_kwargs)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I wonder if this could be moved into the dist_strategy base class

Copy link
Contributor

@jeffkinnison jeffkinnison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, just a few nits. One question that didn't quite fit anywhere: are there any additional validation checks we should add for DS config? Bounding DS stages, DS/qlora, etc.

ludwig/distributed/base.py Outdated Show resolved Hide resolved
ludwig/distributed/base.py Outdated Show resolved Hide resolved
stages, we load the base model back. For LLMs, this recreates either the base model or the PEFT model, depending
on whether a PEFT adapter was specified.
"""
if self.zero_optimization_stage != 3:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This condition shows up in a few places, and it's either run as stage <= 2 or stage != 3. Do we want to pick one as the canonical form? Would it make sense to move the condition itself into a DeepSpeedStrategy property?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffkinnison I think this is a good call out. One more option is to refactor to have DeepSpeedStrategy be a base class and DeepSpeedStage3 be its own other class that overrides some of the methods or something to that effect. Will take a look!

ludwig/distributed/deepspeed.py Outdated Show resolved Hide resolved
ludwig/distributed/deepspeed.py Outdated Show resolved Hide resolved
ludwig/models/llm.py Outdated Show resolved Hide resolved
Comment on lines +94 to +102
"""
backend_type = _get_backend_type_from_config(config_obj)
deepspeed_optimization_strategy = _get_deepspeed_optimization_stage_from_config(config_obj)
if backend_type == "ray" and deepspeed_optimization_strategy is not None and deepspeed_optimization_strategy <= 2:
# If using deepspeed stage 0, 1 or 2, we only load the model into memory once we're actually inside
# of the training workers.
return False
# If using local backend or deepspeed stage 3, we load the model into memory upon class initialization.
return True
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to push this into the backend class and maybe consider passing in the initialized self.backend object in the LudwigModel class into the create_model function so it gets propagated to here, it shouldn't live here like this.

examples/llm_qlora_data_parallel/train.py Outdated Show resolved Hide resolved
examples/llm_qlora_data_parallel/README.md Show resolved Hide resolved
examples/llm_qlora_data_parallel/README.md Outdated Show resolved Hide resolved
examples/llm_qlora_data_parallel/README.md Outdated Show resolved Hide resolved
examples/llm_qlora_data_parallel/train.py Outdated Show resolved Hide resolved
@@ -57,6 +59,7 @@ def __init__(

super().__init__(**kwargs)
self.zero_optimization = zero_optimization or DEFAULT_ZERO_OPTIMIZATION
self.zero_optimization_stage = self.zero_optimization.get("stage", 3)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why default to 3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because in the worst case, it makes sense to assume that the model does not fit into memory and we need to do model parallel + data parallel training since the model will not fit into a single GPU. It also assumes no quantization based fine-tuning, which I feel like is a fair assumption to make. All of this is to say that we want a config as simple as this "to just work"

model_type: llm
base_model: ...
input_features: ...
output_features: ...
trainer:
    type: finetune
backend:
    type: ray
    trainer:
        strategy:
            type: deepspeed

to have the highest chance of succeeding irrespective of LLM model size. This is the worst case scenario that does full fine-tuning with an adapter or without quantization. All of those only reduce the model size, but this is the most useful configuration to default to assuming none of those are set in the config for LLM fine-tuning.

Let me know if this makes sense. I'll also add a comment in the DeepSpeed class explaining why we default to stage 3.

ludwig/models/llm.py Show resolved Hide resolved
ludwig/models/llm.py Show resolved Hide resolved
ludwig/utils/backend_utils.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a full review yet, but something I see coming up repeatedly is references to the optimiation stage outside of the DeepSpeed strategy. This is a red flag that we should be moving this code into the DistributedStrategy interface rather than coupling these two different abstractions (backend and strategy) together.

model_ref = ray.put(dist_strategy.extract_model_for_serialization(self.model))
optimization_stage = _get_optimization_stage_from_trainer_config(self.trainer_kwargs)
model_ref = ray.put(
dist_strategy.extract_model_for_serialization(self.model, optimization_stage=optimization_stage)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the dist_strategy already know the optimization stage? This is coupling the DistributedStrategy interface with the Deepspeed optimization stage, which is not desirable. Would be better to keep this internal to the strategy itself.

dist_model = distributed.prepare_for_inference(model)
if adapter_ref and (distributed_optimization_stage and distributed_optimization_stage <= 2):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be pushed into the DistributedStrategy, again, so we don't need to couple everything to DeepSpeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants