Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Deepspeed memory allocation estimation different than real! #5484

Open
mmarouen opened this issue Apr 30, 2024 · 0 comments
Open

[BUG] Deepspeed memory allocation estimation different than real! #5484

mmarouen opened this issue Apr 30, 2024 · 0 comments
Labels
bug Something isn't working training

Comments

@mmarouen
Copy link

mmarouen commented Apr 30, 2024

@tjruwase
Scenario:

  • LORA llama 13b with gradient checkpointing activated
  • Total number of LORA parameters: 26M
  • Using fp32 precision
  • Training on Azure: MPI + pytorch
  • HW setup: 256Gb VRAM = 32Gb VRAM x 8 GPUs on a single node
  • Training data: ~1M tokens
  • deepspeed zero3 without offload
  • Batch size per device: 32

Observed behavior

  • All GPUs are busy to 90%
  • Training lasts 20h
  • Batch size 64: leads to OOM
  • Deepspeed reports a memory requirement per GPU ~ 17Gb
  • Azure reports memory usage per GPU ~30Gb

Expected behavior

  • Much faster training because capacity seems to be >> memory requirements
  • Much larger batch size
  • Aligned Azure GPU usage with deepspeed estimation
    Am I missing something? Are the expectations wrong?

ds_report

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. �[92m[OKAY]�[0m
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
fused_adam ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
cpu_adam ............... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
cpu_adagrad ............ �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
cpu_lion ............... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
�[93m [WARNING] �[0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... �[93m[NO]�[0m ....... �[93m[NO]�[0m
fused_lamb ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
fused_lion ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
quantizer .............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
random_ltd ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
�[93m [WARNING] �[0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
�[93m [WARNING] �[0m using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ �[93m[NO]�[0m ....... �[93m[NO]�[0m
spatial_inference ...... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
transformer ............ �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
stochastic_transformer . �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
transformer_inference .. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/envs/ptca/lib/python3.8/site-packages/torch']
torch version .................... 2.1.2
deepspeed install path ........... ['/opt/conda/envs/ptca/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.11.1, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 2.00 GB

deep speed configuration

{
  "bf16": {
    "enabled": false
  },
  "fp16": {
    "enabled": false,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_bucket_size": 200000000,
    "allgather_bucket_size": 200000000,
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "sub_group_size": 1000000000,
    "stage3_max_live_parameters": 1000000000,
    "stage3_max_reuse_distance": 1000000000,
    "stage3_gather_16bit_weights_on_model_save": false
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": 1,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

Memory consumption estimation estimate_zero3_model_states_mem_needs_all_live

SW: Model with 26M total params, 0M largest layer params.
  per CPU  |  per GPU |   Options
    0.11GB |   0.00GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
    0.15GB |   0.00GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
    0.10GB |   0.01GB | offload_param=none, offload_optimizer=cpu , zero_init=1
    0.15GB |   0.01GB | offload_param=none, offload_optimizer=cpu , zero_init=0
    0.00GB |   0.07GB | offload_param=none, offload_optimizer=none, zero_init=1
    0.15GB |   0.07GB | offload_param=none, offload_optimizer=none, zero_init=0

Memory usage snapshots

[2024-04-30 01:43:14,449] [INFO] [utils.py:802:see_memory_usage] Stage 3 initialize beginning
[2024-04-30 01:43:14,449] [INFO] [utils.py:803:see_memory_usage] MA 6.32 GB         Max_MA 7.45 GB         CA 10.05 GB         Max_CA 22 GB 
[2024-04-30 01:43:14,450] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 23.23 GB, percent = 3.5%
[2024-04-30 01:43:14,454] [INFO] [stage3.py:126:__init__] Reduce bucket size 200000000
[2024-04-30 01:43:14,454] [INFO] [stage3.py:127:__init__] Prefetch bucket size 23592960
[2024-04-30 01:43:14,795] [INFO] [utils.py:802:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2024-04-30 01:43:14,796] [INFO] [utils.py:803:see_memory_usage] MA 6.32 GB         Max_MA 6.32 GB         CA 10.05 GB         Max_CA 10 GB 
[2024-04-30 01:43:14,796] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 23.23 GB, percent = 3.5%
Parameter Offload: Total persistent parameters: 414720 in 81 params
[2024-04-30 01:43:15,204] [INFO] [utils.py:802:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2024-04-30 01:43:15,205] [INFO] [utils.py:803:see_memory_usage] MA 6.24 GB         Max_MA 6.32 GB         CA 10.05 GB         Max_CA 10 GB 
[2024-04-30 01:43:15,205] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 23.23 GB, percent = 3.5%
[2024-04-30 01:43:15,507] [INFO] [utils.py:802:see_memory_usage] Before creating fp16 partitions
[2024-04-30 01:43:15,508] [INFO] [utils.py:803:see_memory_usage] MA 6.24 GB         Max_MA 6.24 GB         CA 10.05 GB         Max_CA 10 GB 
[2024-04-30 01:43:15,508] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 23.23 GB, percent = 3.5%
[2024-04-30 01:43:16,224] [INFO] [utils.py:802:see_memory_usage] After creating fp16 partitions: 1
[2024-04-30 01:43:16,226] [INFO] [utils.py:803:see_memory_usage] MA 6.24 GB         Max_MA 6.24 GB         CA 9.94 GB         Max_CA 10 GB 
[2024-04-30 01:43:16,226] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 23.29 GB, percent = 3.5%
[2024-04-30 01:43:16,505] [INFO] [utils.py:802:see_memory_usage] Before creating fp32 partitions
[2024-04-30 01:43:16,505] [INFO] [utils.py:803:see_memory_usage] MA 6.24 GB         Max_MA 6.24 GB         CA 9.94 GB         Max_CA 10 GB 
[2024-04-30 01:43:16,505] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 23.29 GB, percent = 3.5%
[2024-04-30 01:43:16,814] [INFO] [utils.py:802:see_memory_usage] After creating fp32 partitions
[2024-04-30 01:43:16,815] [INFO] [utils.py:803:see_memory_usage] MA 6.25 GB         Max_MA 6.25 GB         CA 9.94 GB         Max_CA 10 GB 
[2024-04-30 01:43:16,815] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 23.29 GB, percent = 3.5%
[2024-04-30 01:43:17,151] [INFO] [utils.py:802:see_memory_usage] Before initializing optimizer states
[2024-04-30 01:43:17,151] [INFO] [utils.py:803:see_memory_usage] MA 6.25 GB         Max_MA 6.25 GB         CA 9.94 GB         Max_CA 10 GB 
[2024-04-30 01:43:17,152] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 23.25 GB, percent = 3.5%
[2024-04-30 01:43:17,502] [INFO] [utils.py:802:see_memory_usage] After initializing optimizer states
[2024-04-30 01:43:17,502] [INFO] [utils.py:803:see_memory_usage] MA 6.27 GB         Max_MA 6.3 GB         CA 9.94 GB         Max_CA 10 GB 
[2024-04-30 01:43:17,503] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 23.25 GB, percent = 3.5%
[2024-04-30 01:43:17,503] [INFO] [stage3.py:459:_setup_for_real_optimizer] optimizer state initialized
[2024-04-30 01:43:18,065] [INFO] [utils.py:802:see_memory_usage] After initializing ZeRO optimizer
[2024-04-30 01:43:18,066] [INFO] [utils.py:803:see_memory_usage] MA 7.03 GB         Max_MA 7.03 GB         CA 10.7 GB         Max_CA 11 GB 
[2024-04-30 01:43:18,066] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 24.42 GB, percent = 3.7%

Actual memory usage
gpu_usage

@mmarouen mmarouen added bug Something isn't working training labels Apr 30, 2024
@mmarouen mmarouen changed the title [BUG] Memory allocation exceeds expectations [BUG] Deepspeed memory allocation estimation different than real! Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

1 participant