Add support for data parallel QLoRA training via DeepSpeed Zero stages 0, 1 and 2. #3728

arnavgarg1 · 2023-10-13T17:50:55Z

This PR adds support for data parallel QLoRA training using DeepSpeed Stages 0, 1, and 2.

As a refresher, here is what each DeepSpeed Zero stage corresponds to:

Stage 0: Disabled, i.e., no partitioning of optimizer state, gradients or model parameters. You can still perform optimizer and parameter offloading, as well training using bf16 or fp16 etc.
Stage 1: The optimizer states (e.g., for Adam optimizer, 32-bit weights, and the first, and second moment estimates) are partitioned across the processes, so that each process updates only its partition.
Stage 2: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process retains only the gradients corresponding to its portion of the optimizer states.
Stage 3: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and partition them during the forward and backward passes.

With this PR, you can now perform QLoRA based training for larger-than-memory datasets. You can find a full attached example in ludwig/examples/llm_qlora_data_parallel.

For e.g, you can now use a config like the following:

model_type: llm
base_model: meta-llama/Llama-2-7b-hf

input_features:
  - name: instruction
    type: text

output_features:
  - name: output
    type: text

adapter:
  type: lora

quantization:
  bits: 4

backend:
  type: ray
  trainer:
    use_gpu: true
    strategy:
      type: deepspeed
      zero_optimization:
        stage: 2

trainer:
  type: finetune
  batch_size: 1
  gradient_accumulation_steps: 4

to:

Create n single GPU workers on a single multi-GPU node
Load a 4-bit version of Llama-2-7b on each of the GPU training workers
Use Ray Datasets to split the total dataset for training across the n workers
Train using gradient partitioning and optimizer state partitioning across workers (not possible with traditional DDP)

In particular, since this uses DeepSpeed Stage 2 with Ray, it lets you stream datasets into memory at training time per-worker.

This is a snapshot of GPU utilization a single-node 4 GPU pod using 4xA5000s:

Mon Oct 16 07:53:11 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05    Driver Version: 510.73.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A5000    On   | 00000000:01:00.0 Off |                  Off |
| 31%   61C    P2   141W / 230W |  10089MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000    On   | 00000000:25:00.0 Off |                  Off |
| 30%   59C    P2   131W / 230W |   8509MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A5000    On   | 00000000:41:00.0 Off |                  Off |
| 31%   60C    P2   133W / 230W |   8905MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A5000    On   | 00000000:61:00.0 Off |                  Off |
| 30%   55C    P2   153W / 230W |   8625MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

…r local rank 0

arnavgarg1 · 2023-10-16T15:53:01Z

ludwig/backend/ray.py

+            if self.model.trained_using_adapter:
+                adapter_ref = ray.put(dist_strategy.extract_adapter_weights_for_serialization(self.model))
+
+            optimization_stage = _get_optimization_stage_from_trainer_config(self.trainer_kwargs)


hmm I wonder if this could be moved into the dist_strategy base class

jeffkinnison

Overall looks good, just a few nits. One question that didn't quite fit anywhere: are there any additional validation checks we should add for DS config? Bounding DS stages, DS/qlora, etc.

ludwig/distributed/base.py

jeffkinnison · 2023-10-16T17:43:42Z

ludwig/distributed/deepspeed.py

+        stages, we load the base model back. For LLMs, this recreates either the base model or the PEFT model, depending
+        on whether a PEFT adapter was specified.
+        """
+        if self.zero_optimization_stage != 3:


nit: This condition shows up in a few places, and it's either run as stage <= 2 or stage != 3. Do we want to pick one as the canonical form? Would it make sense to move the condition itself into a DeepSpeedStrategy property?

@jeffkinnison I think this is a good call out. One more option is to refactor to have DeepSpeedStrategy be a base class and DeepSpeedStage3 be its own other class that overrides some of the methods or something to that effect. Will take a look!

ludwig/distributed/deepspeed.py

ludwig/models/llm.py

arnavgarg1 · 2023-10-17T23:04:05Z

ludwig/models/llm.py

+    """
+    backend_type = _get_backend_type_from_config(config_obj)
+    deepspeed_optimization_strategy = _get_deepspeed_optimization_stage_from_config(config_obj)
+    if backend_type == "ray" and deepspeed_optimization_strategy is not None and deepspeed_optimization_strategy <= 2:
+        # If using deepspeed stage 0, 1 or 2, we only load the model into memory once we're actually inside
+        # of the training workers.
+        return False
+    # If using local backend or deepspeed stage 3, we load the model into memory upon class initialization.
+    return True


Going to push this into the backend class and maybe consider passing in the initialized self.backend object in the LudwigModel class into the create_model function so it gets propagated to here, it shouldn't live here like this.

…nd compatibility

examples/llm_qlora_data_parallel/train.py

examples/llm_qlora_data_parallel/README.md

examples/llm_qlora_data_parallel/train.py

justinxzhao · 2023-10-19T05:46:06Z

ludwig/distributed/deepspeed.py

@@ -57,6 +59,7 @@ def __init__(

        super().__init__(**kwargs)
        self.zero_optimization = zero_optimization or DEFAULT_ZERO_OPTIMIZATION
+        self.zero_optimization_stage = self.zero_optimization.get("stage", 3)


Why default to 3?

It's because in the worst case, it makes sense to assume that the model does not fit into memory and we need to do model parallel + data parallel training since the model will not fit into a single GPU. It also assumes no quantization based fine-tuning, which I feel like is a fair assumption to make. All of this is to say that we want a config as simple as this "to just work"

model_type: llm base_model: ... input_features: ... output_features: ... trainer: type: finetune backend: type: ray trainer: strategy: type: deepspeed

to have the highest chance of succeeding irrespective of LLM model size. This is the worst case scenario that does full fine-tuning with an adapter or without quantization. All of those only reduce the model size, but this is the most useful configuration to default to assuming none of those are set in the config for LLM fine-tuning.

Let me know if this makes sense. I'll also add a comment in the DeepSpeed class explaining why we default to stage 3.

ludwig/models/llm.py

ludwig/utils/backend_utils.py

tgaddair

Not a full review yet, but something I see coming up repeatedly is references to the optimiation stage outside of the DeepSpeed strategy. This is a red flag that we should be moving this code into the DistributedStrategy interface rather than coupling these two different abstractions (backend and strategy) together.

tgaddair · 2023-10-23T22:00:02Z

ludwig/backend/ray.py

-            model_ref = ray.put(dist_strategy.extract_model_for_serialization(self.model))
+            optimization_stage = _get_optimization_stage_from_trainer_config(self.trainer_kwargs)
+            model_ref = ray.put(
+                dist_strategy.extract_model_for_serialization(self.model, optimization_stage=optimization_stage)


Shouldn't the dist_strategy already know the optimization stage? This is coupling the DistributedStrategy interface with the Deepspeed optimization stage, which is not desirable. Would be better to keep this internal to the strategy itself.

tgaddair · 2023-10-24T19:14:32Z

ludwig/backend/ray.py

        dist_model = distributed.prepare_for_inference(model)
+        if adapter_ref and (distributed_optimization_stage and distributed_optimization_stage <= 2):


This can be pushed into the DistributedStrategy, again, so we don't need to couple everything to DeepSpeed.

tgaddair and others added 30 commits September 25, 2023 11:46

WIP: deepspeed stage 2

af00bba

Place on current device

70bde7d

Fixed device placement

4d89ddb

Merge branch 'master' into ds-stage2

7d6e708

Fix issue with distributed eval metric_fn placement

fcd7d3b

Add workaround for checkpoint saving based on stage

4b43aef

Gate logging statements with is_coordinator() so they only show up fo…

17a71f0

…r local rank 0

Gate more logging with coordinator barrier

7d65619

Latest push

a30ba99

Working e2e, but not everything is correct

c8c273b

Clarification comment

1aa8efe

Clean up

8c525e0

Docstring

b7d68ae

resolve merge conflictts

78da3d9

More cleanup

bd3341f

Set default optimization stage to 3

ada045b

Merge branch 'master' into ds-stage2

c997d33

Merge branch 'master' into ds-stage2

a3546f5

Compatibility, but most of this is not needed with some re-architecting

5284420

Add filelock around model from_pretrained call

d0c174e

Add dynamic device_map setting based on backend and zero stage

e68a02e

Comments

fec3ef4

Merge branch 'master' into ds-stage2

65855ec

Conditional model loading in LLM base class

48f353d

Add TODO to fix issue

9ba3347

Revert to what was working

b4fdc0e

Add utility functions to simplify

eb33afd

Working e2s DS stage 2

b3c2496

Simplify

b14bbeb

Minor modification for ds stage 3 compatibility

f519672

arnavgarg1 added 2 commits October 16, 2023 15:10

Cleanup

6f70747

Cleanup

2eea6a2

arnavgarg1 changed the title ~~[WIP] Add support for data parallel quantized training by supporting DeepSpeed stages 0, 1 and 2.~~ Add support for data parallel quantized training by supporting DeepSpeed stages 0, 1 and 2. Oct 16, 2023

arnavgarg1 marked this pull request as ready for review October 16, 2023 13:30

arnavgarg1 added 2 commits October 16, 2023 15:50

More cleanup

d9d49ce

Refactor

cd73bd1

arnavgarg1 changed the title ~~Add support for data parallel quantized training by supporting DeepSpeed stages 0, 1 and 2.~~ Add support for data parallel quantized training by supporting DeepSpeed Zero stages 0, 1 and 2. Oct 16, 2023

Add Data parallel QloRA example training script

483d752

arnavgarg1 changed the title ~~Add support for data parallel quantized training by supporting DeepSpeed Zero stages 0, 1 and 2.~~ Add support for data parallel QLoRA training via DeepSpeed Zero stages 0, 1 and 2. Oct 16, 2023

arnavgarg1 added 2 commits October 16, 2023 17:19

Add basic unit tests

f0099b4

Log artifact dir

a379712

arnavgarg1 commented Oct 16, 2023

View reviewed changes

Fix example script

855d2c1

jeffkinnison reviewed Oct 16, 2023

View reviewed changes

arnavgarg1 added 4 commits October 16, 2023 23:29

Comments

3a11709

Resolve merge conflicts

e8cdf27

Resolve comments

2cf7f4b

Comment with relevant doc snippets

832eb93

arnavgarg1 requested review from tgaddair and justinxzhao October 17, 2023 22:37

arnavgarg1 commented Oct 17, 2023

View reviewed changes

Add config validation check and more tests for quantization and backe…

fbba8cb

…nd compatibility

justinxzhao reviewed Oct 19, 2023

View reviewed changes

arnavgarg1 added 3 commits October 19, 2023 13:40

Address comments

b5b3dd7

Merge branch 'master' into ds-stage2

2e06e5f

Address nit

46fae60

arnavgarg1 requested review from jeffkinnison and justinxzhao October 19, 2023 15:04

tgaddair requested changes Oct 24, 2023

View reviewed changes

arnavgarg1 mentioned this pull request Dec 11, 2023

Unable to train the llama-7b in a machine with two Tesla T4 GPU's using Ray #3783

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for data parallel QLoRA training via DeepSpeed Zero stages 0, 1 and 2. #3728

Add support for data parallel QLoRA training via DeepSpeed Zero stages 0, 1 and 2. #3728

arnavgarg1 commented Oct 13, 2023 •

edited

arnavgarg1 Oct 16, 2023 •

edited

jeffkinnison left a comment

jeffkinnison Oct 16, 2023

arnavgarg1 Oct 16, 2023

arnavgarg1 Oct 17, 2023 •

edited

justinxzhao Oct 19, 2023

arnavgarg1 Oct 19, 2023 •

edited

tgaddair left a comment

tgaddair Oct 23, 2023

tgaddair Oct 24, 2023

		dist_model = distributed.prepare_for_inference(model)
		if adapter_ref and (distributed_optimization_stage and distributed_optimization_stage <= 2):

Add support for data parallel QLoRA training via DeepSpeed Zero stages 0, 1 and 2. #3728

Are you sure you want to change the base?

Add support for data parallel QLoRA training via DeepSpeed Zero stages 0, 1 and 2. #3728

Conversation

arnavgarg1 commented Oct 13, 2023 • edited

arnavgarg1 Oct 16, 2023 • edited

Choose a reason for hiding this comment

jeffkinnison left a comment

Choose a reason for hiding this comment

jeffkinnison Oct 16, 2023

Choose a reason for hiding this comment

arnavgarg1 Oct 16, 2023

Choose a reason for hiding this comment

arnavgarg1 Oct 17, 2023 • edited

Choose a reason for hiding this comment

justinxzhao Oct 19, 2023

Choose a reason for hiding this comment

arnavgarg1 Oct 19, 2023 • edited

Choose a reason for hiding this comment

tgaddair left a comment

Choose a reason for hiding this comment

tgaddair Oct 23, 2023

Choose a reason for hiding this comment

tgaddair Oct 24, 2023

Choose a reason for hiding this comment

arnavgarg1 commented Oct 13, 2023 •

edited

arnavgarg1 Oct 16, 2023 •

edited

arnavgarg1 Oct 17, 2023 •

edited

arnavgarg1 Oct 19, 2023 •

edited