Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Memory Leak in Stage 2 Optimizer #5496

Open
chiragjn opened this issue May 2, 2024 · 0 comments
Open

[BUG] Memory Leak in Stage 2 Optimizer #5496

chiragjn opened this issue May 2, 2024 · 0 comments
Labels
bug Something isn't working training

Comments

@chiragjn
Copy link
Contributor

chiragjn commented May 2, 2024

Describe the bug
I am using transformers Trainer + accelerate to tune language models. What I noticed was post training when I call gc.collect() and torch.cuda.empty_cache() the trainable layers stick around on gpu memory (all ranks)

On adding a debugger and weakref to torch parameters I was able to at least trace it down to _hp_mapping and added methods on top of the params

lp_param._hp_mapping = None
lp_param._dp_group = dp_group
lp_param.get_full_hp_param = types.MethodType(get_full_hp_param, lp_param)
lp_param.get_full_hp_grad = types.MethodType(get_full_hp_grad, lp_param)
lp_param.set_full_hp_param = types.MethodType(set_full_hp_param, lp_param)

image

These references never go to zero even when the optimizer is completely destroyed.
I am a bit puzzled why and wondering if the memory is leaking because torch.tensor / torch.nn.Parameter are not entirely python implementations and adding methods on top can cause unaccounted reference leaks

To Reproduce

  1. Train a model with Deepspeed 2 Config - no offload
{
  "fp16": {
    "enabled": false,
    "auto_cast": false,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 32,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "bf16": {
    "enabled": true
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
  },
  "activation_checkpointing": {
    "partition_activations": false,
    "cpu_checkpointing": false
  },
  "zero_optimization": {
    "stage": 2,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "reduce_scatter": true,
    "round_robin_gradients": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 2000,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false,
  "zero_allow_untested_optimizer": true
}

Note: some of the auto values are filled in by the HF Trainer. I'll try and get these values soon

  1. Once the training function finishes and wrapped model goes out of scope, call gc.collect() and torch.cuda.empty_cache()

Expected behavior

  1. All parameters should be garbage collected and deallocated

ds_report output
Please run ds_report to give us details about your setup.

[2024-05-02 16:42:04,970] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fp_quantizer ........... [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/data/v/ft/lib/python3.11/site-packages/torch']
torch version .................... 2.2.1+cu121
deepspeed install path ........... ['/data/v/ft/lib/python3.11/site-packages/deepspeed']
deepspeed info ................... 0.14.2+0866580c, 0866580c, fix-memory-leak
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.2
deepspeed wheel compiled w. ...... torch 2.2, cuda 12.1
shared memory (/dev/shm) size .... 216.48 GB

Screenshots

System info (please complete the following information):

  • OS: [e.g. Ubuntu 18.04]: 22.04
  • GPU count and types [e.g. two machines with x8 A100s each]: 1 VM with 2 x A100 80GB
  • Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
  • Python version: 3.11
  • Any other relevant info about your setup

Launcher context
Launch with accelerate launch --use_deepspeed

Docker context
N/A

Additional context
N/A

@chiragjn chiragjn added bug Something isn't working training labels May 2, 2024
@chiragjn chiragjn changed the title [BUG] Memory Leak in Stage 2 Optimizer and somewhere else? [BUG] Memory Leak in Stage 2 Optimizer May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

1 participant