We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
@tjruwase Scenario:
Observed behavior
Expected behavior
ds_report
-------------------------------------------------- DeepSpeed C++/CUDA extension op report -------------------------------------------------- NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op. -------------------------------------------------- JIT compiled ops requires ninja ninja .................. �[92m[OKAY]�[0m -------------------------------------------------- op name ................ installed .. compatible -------------------------------------------------- async_io ............... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m fused_adam ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m cpu_adam ............... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m cpu_adagrad ............ �[93m[NO]�[0m ....... �[92m[OKAY]�[0m cpu_lion ............... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m �[93m [WARNING] �[0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH evoformer_attn ......... �[93m[NO]�[0m ....... �[93m[NO]�[0m fused_lamb ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m fused_lion ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m quantizer .............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m random_ltd ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m �[93m [WARNING] �[0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 �[93m [WARNING] �[0m using untested triton version (2.1.0), only 1.0.0 is known to be compatible sparse_attn ............ �[93m[NO]�[0m ....... �[93m[NO]�[0m spatial_inference ...... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m transformer ............ �[93m[NO]�[0m ....... �[92m[OKAY]�[0m stochastic_transformer . �[93m[NO]�[0m ....... �[92m[OKAY]�[0m transformer_inference .. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m -------------------------------------------------- DeepSpeed general environment info: torch install path ............... ['/opt/conda/envs/ptca/lib/python3.8/site-packages/torch'] torch version .................... 2.1.2 deepspeed install path ........... ['/opt/conda/envs/ptca/lib/python3.8/site-packages/deepspeed'] deepspeed info ................... 0.11.1, unknown, unknown torch cuda version ............... 12.1 torch hip version ................ None nvcc version ..................... 12.1 deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1 shared memory (/dev/shm) size .... 2.00 GB
deep speed configuration
{ "bf16": { "enabled": false }, "fp16": { "enabled": false, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "zero_optimization": { "stage": 3, "overlap_comm": true, "contiguous_gradients": true, "reduce_bucket_size": 200000000, "allgather_bucket_size": 200000000, "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "sub_group_size": 1000000000, "stage3_max_live_parameters": 1000000000, "stage3_max_reuse_distance": 1000000000, "stage3_gather_16bit_weights_on_model_save": false }, "gradient_accumulation_steps": "auto", "gradient_clipping": 1, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }
Memory consumption estimation estimate_zero3_model_states_mem_needs_all_live
estimate_zero3_model_states_mem_needs_all_live
SW: Model with 26M total params, 0M largest layer params. per CPU | per GPU | Options 0.11GB | 0.00GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1 0.15GB | 0.00GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0 0.10GB | 0.01GB | offload_param=none, offload_optimizer=cpu , zero_init=1 0.15GB | 0.01GB | offload_param=none, offload_optimizer=cpu , zero_init=0 0.00GB | 0.07GB | offload_param=none, offload_optimizer=none, zero_init=1 0.15GB | 0.07GB | offload_param=none, offload_optimizer=none, zero_init=0
Memory usage snapshots
[2024-04-30 01:43:14,449] [INFO] [utils.py:802:see_memory_usage] Stage 3 initialize beginning [2024-04-30 01:43:14,449] [INFO] [utils.py:803:see_memory_usage] MA 6.32 GB Max_MA 7.45 GB CA 10.05 GB Max_CA 22 GB [2024-04-30 01:43:14,450] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 23.23 GB, percent = 3.5% [2024-04-30 01:43:14,454] [INFO] [stage3.py:126:__init__] Reduce bucket size 200000000 [2024-04-30 01:43:14,454] [INFO] [stage3.py:127:__init__] Prefetch bucket size 23592960 [2024-04-30 01:43:14,795] [INFO] [utils.py:802:see_memory_usage] DeepSpeedZeRoOffload initialize [begin] [2024-04-30 01:43:14,796] [INFO] [utils.py:803:see_memory_usage] MA 6.32 GB Max_MA 6.32 GB CA 10.05 GB Max_CA 10 GB [2024-04-30 01:43:14,796] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 23.23 GB, percent = 3.5% Parameter Offload: Total persistent parameters: 414720 in 81 params [2024-04-30 01:43:15,204] [INFO] [utils.py:802:see_memory_usage] DeepSpeedZeRoOffload initialize [end] [2024-04-30 01:43:15,205] [INFO] [utils.py:803:see_memory_usage] MA 6.24 GB Max_MA 6.32 GB CA 10.05 GB Max_CA 10 GB [2024-04-30 01:43:15,205] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 23.23 GB, percent = 3.5% [2024-04-30 01:43:15,507] [INFO] [utils.py:802:see_memory_usage] Before creating fp16 partitions [2024-04-30 01:43:15,508] [INFO] [utils.py:803:see_memory_usage] MA 6.24 GB Max_MA 6.24 GB CA 10.05 GB Max_CA 10 GB [2024-04-30 01:43:15,508] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 23.23 GB, percent = 3.5% [2024-04-30 01:43:16,224] [INFO] [utils.py:802:see_memory_usage] After creating fp16 partitions: 1 [2024-04-30 01:43:16,226] [INFO] [utils.py:803:see_memory_usage] MA 6.24 GB Max_MA 6.24 GB CA 9.94 GB Max_CA 10 GB [2024-04-30 01:43:16,226] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 23.29 GB, percent = 3.5% [2024-04-30 01:43:16,505] [INFO] [utils.py:802:see_memory_usage] Before creating fp32 partitions [2024-04-30 01:43:16,505] [INFO] [utils.py:803:see_memory_usage] MA 6.24 GB Max_MA 6.24 GB CA 9.94 GB Max_CA 10 GB [2024-04-30 01:43:16,505] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 23.29 GB, percent = 3.5% [2024-04-30 01:43:16,814] [INFO] [utils.py:802:see_memory_usage] After creating fp32 partitions [2024-04-30 01:43:16,815] [INFO] [utils.py:803:see_memory_usage] MA 6.25 GB Max_MA 6.25 GB CA 9.94 GB Max_CA 10 GB [2024-04-30 01:43:16,815] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 23.29 GB, percent = 3.5% [2024-04-30 01:43:17,151] [INFO] [utils.py:802:see_memory_usage] Before initializing optimizer states [2024-04-30 01:43:17,151] [INFO] [utils.py:803:see_memory_usage] MA 6.25 GB Max_MA 6.25 GB CA 9.94 GB Max_CA 10 GB [2024-04-30 01:43:17,152] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 23.25 GB, percent = 3.5% [2024-04-30 01:43:17,502] [INFO] [utils.py:802:see_memory_usage] After initializing optimizer states [2024-04-30 01:43:17,502] [INFO] [utils.py:803:see_memory_usage] MA 6.27 GB Max_MA 6.3 GB CA 9.94 GB Max_CA 10 GB [2024-04-30 01:43:17,503] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 23.25 GB, percent = 3.5% [2024-04-30 01:43:17,503] [INFO] [stage3.py:459:_setup_for_real_optimizer] optimizer state initialized [2024-04-30 01:43:18,065] [INFO] [utils.py:802:see_memory_usage] After initializing ZeRO optimizer [2024-04-30 01:43:18,066] [INFO] [utils.py:803:see_memory_usage] MA 7.03 GB Max_MA 7.03 GB CA 10.7 GB Max_CA 11 GB [2024-04-30 01:43:18,066] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 24.42 GB, percent = 3.7%
Actual memory usage
The text was updated successfully, but these errors were encountered:
No branches or pull requests
@tjruwase
Scenario:
Observed behavior
Expected behavior
Am I missing something? Are the expectations wrong?
ds_report
deep speed configuration
Memory consumption estimation
estimate_zero3_model_states_mem_needs_all_live
Memory usage snapshots
Actual memory usage
The text was updated successfully, but these errors were encountered: