[BUG] Deepspeed memory allocation estimation different than real! #5484

mmarouen · 2024-04-30T09:48:58Z

LORA llama 13b with gradient checkpointing activated
Total number of LORA parameters: 26M
Using fp32 precision
Training on Azure: MPI + pytorch
HW setup: 256Gb VRAM = 32Gb VRAM x 8 GPUs on a single node
Training data: ~1M tokens
deepspeed zero3 without offload
Batch size per device: 32

Observed behavior

All GPUs are busy to 90%
Training lasts 20h
Batch size 64: leads to OOM
Deepspeed reports a memory requirement per GPU ~ 17Gb
Azure reports memory usage per GPU ~30Gb

Expected behavior

Much faster training because capacity seems to be >> memory requirements
Much larger batch size
Aligned Azure GPU usage with deepspeed estimation
Am I missing something? Are the expectations wrong?

ds_report

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. �[92m[OKAY]�[0m
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
fused_adam ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
cpu_adam ............... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
cpu_adagrad ............ �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
cpu_lion ............... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
�[93m [WARNING] �[0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... �[93m[NO]�[0m ....... �[93m[NO]�[0m
fused_lamb ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
fused_lion ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
quantizer .............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
random_ltd ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
�[93m [WARNING] �[0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
�[93m [WARNING] �[0m using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ �[93m[NO]�[0m ....... �[93m[NO]�[0m
spatial_inference ...... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
transformer ............ �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
stochastic_transformer . �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
transformer_inference .. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/envs/ptca/lib/python3.8/site-packages/torch']
torch version .................... 2.1.2
deepspeed install path ........... ['/opt/conda/envs/ptca/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.11.1, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 2.00 GB

deep speed configuration

{
  "bf16": {
    "enabled": false
  },
  "fp16": {
    "enabled": false,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_bucket_size": 200000000,
    "allgather_bucket_size": 200000000,
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "sub_group_size": 1000000000,
    "stage3_max_live_parameters": 1000000000,
    "stage3_max_reuse_distance": 1000000000,
    "stage3_gather_16bit_weights_on_model_save": false
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": 1,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

Memory consumption estimation estimate_zero3_model_states_mem_needs_all_live

SW: Model with 26M total params, 0M largest layer params.
  per CPU  |  per GPU |   Options
    0.11GB |   0.00GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
    0.15GB |   0.00GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
    0.10GB |   0.01GB | offload_param=none, offload_optimizer=cpu , zero_init=1
    0.15GB |   0.01GB | offload_param=none, offload_optimizer=cpu , zero_init=0
    0.00GB |   0.07GB | offload_param=none, offload_optimizer=none, zero_init=1
    0.15GB |   0.07GB | offload_param=none, offload_optimizer=none, zero_init=0

Memory usage snapshots

[2024-04-30 01:43:14,449] [INFO] [utils.py:802:see_memory_usage] Stage 3 initialize beginning
[2024-04-30 01:43:14,449] [INFO] [utils.py:803:see_memory_usage] MA 6.32 GB         Max_MA 7.45 GB         CA 10.05 GB         Max_CA 22 GB 
[2024-04-30 01:43:14,450] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 23.23 GB, percent = 3.5%
[2024-04-30 01:43:14,454] [INFO] [stage3.py:126:__init__] Reduce bucket size 200000000
[2024-04-30 01:43:14,454] [INFO] [stage3.py:127:__init__] Prefetch bucket size 23592960
[2024-04-30 01:43:14,795] [INFO] [utils.py:802:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2024-04-30 01:43:14,796] [INFO] [utils.py:803:see_memory_usage] MA 6.32 GB         Max_MA 6.32 GB         CA 10.05 GB         Max_CA 10 GB 
[2024-04-30 01:43:14,796] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 23.23 GB, percent = 3.5%
Parameter Offload: Total persistent parameters: 414720 in 81 params
[2024-04-30 01:43:15,204] [INFO] [utils.py:802:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2024-04-30 01:43:15,205] [INFO] [utils.py:803:see_memory_usage] MA 6.24 GB         Max_MA 6.32 GB         CA 10.05 GB         Max_CA 10 GB 
[2024-04-30 01:43:15,205] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 23.23 GB, percent = 3.5%
[2024-04-30 01:43:15,507] [INFO] [utils.py:802:see_memory_usage] Before creating fp16 partitions
[2024-04-30 01:43:15,508] [INFO] [utils.py:803:see_memory_usage] MA 6.24 GB         Max_MA 6.24 GB         CA 10.05 GB         Max_CA 10 GB 
[2024-04-30 01:43:15,508] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 23.23 GB, percent = 3.5%
[2024-04-30 01:43:16,224] [INFO] [utils.py:802:see_memory_usage] After creating fp16 partitions: 1
[2024-04-30 01:43:16,226] [INFO] [utils.py:803:see_memory_usage] MA 6.24 GB         Max_MA 6.24 GB         CA 9.94 GB         Max_CA 10 GB 
[2024-04-30 01:43:16,226] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 23.29 GB, percent = 3.5%
[2024-04-30 01:43:16,505] [INFO] [utils.py:802:see_memory_usage] Before creating fp32 partitions
[2024-04-30 01:43:16,505] [INFO] [utils.py:803:see_memory_usage] MA 6.24 GB         Max_MA 6.24 GB         CA 9.94 GB         Max_CA 10 GB 
[2024-04-30 01:43:16,505] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 23.29 GB, percent = 3.5%
[2024-04-30 01:43:16,814] [INFO] [utils.py:802:see_memory_usage] After creating fp32 partitions
[2024-04-30 01:43:16,815] [INFO] [utils.py:803:see_memory_usage] MA 6.25 GB         Max_MA 6.25 GB         CA 9.94 GB         Max_CA 10 GB 
[2024-04-30 01:43:16,815] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 23.29 GB, percent = 3.5%
[2024-04-30 01:43:17,151] [INFO] [utils.py:802:see_memory_usage] Before initializing optimizer states
[2024-04-30 01:43:17,151] [INFO] [utils.py:803:see_memory_usage] MA 6.25 GB         Max_MA 6.25 GB         CA 9.94 GB         Max_CA 10 GB 
[2024-04-30 01:43:17,152] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 23.25 GB, percent = 3.5%
[2024-04-30 01:43:17,502] [INFO] [utils.py:802:see_memory_usage] After initializing optimizer states
[2024-04-30 01:43:17,502] [INFO] [utils.py:803:see_memory_usage] MA 6.27 GB         Max_MA 6.3 GB         CA 9.94 GB         Max_CA 10 GB 
[2024-04-30 01:43:17,503] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 23.25 GB, percent = 3.5%
[2024-04-30 01:43:17,503] [INFO] [stage3.py:459:_setup_for_real_optimizer] optimizer state initialized
[2024-04-30 01:43:18,065] [INFO] [utils.py:802:see_memory_usage] After initializing ZeRO optimizer
[2024-04-30 01:43:18,066] [INFO] [utils.py:803:see_memory_usage] MA 7.03 GB         Max_MA 7.03 GB         CA 10.7 GB         Max_CA 11 GB 
[2024-04-30 01:43:18,066] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 24.42 GB, percent = 3.7%

Actual memory usage

The text was updated successfully, but these errors were encountered:

mmarouen added bug Something isn't working training labels Apr 30, 2024

mmarouen changed the title ~~[BUG] Memory allocation exceeds expectations~~ [BUG] Deepspeed memory allocation estimation different than real! Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Deepspeed memory allocation estimation different than real! #5484

[BUG] Deepspeed memory allocation estimation different than real! #5484

mmarouen commented Apr 30, 2024 •

edited

[BUG] Deepspeed memory allocation estimation different than real! #5484

[BUG] Deepspeed memory allocation estimation different than real! #5484

Comments

mmarouen commented Apr 30, 2024 • edited

mmarouen commented Apr 30, 2024 •

edited