partition_activations produces no activation memory improvement with zero3 #693

andrasiani · 2023-05-31T21:49:40Z

Hi, I am trying to run a gpt2 model with blocksize 2048, and I cannot use batchsize larger than 16 because activation memory becomes too large.
To reduce activation memory I already use deepspeed actication checkpointing on each transformer block +amp.
I saw there is an option to partition / shard activations too, advertized by megatron. But when I try it I see no effect at all.

stale · 2023-09-17T16:54:04Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

partition_activations produces no activation memory improvement with zero3 #693

partition_activations produces no activation memory improvement with zero3 #693

andrasiani commented May 31, 2023 •

edited

stale bot commented Sep 17, 2023

partition_activations produces no activation memory improvement with zero3 #693

partition_activations produces no activation memory improvement with zero3 #693

Comments

andrasiani commented May 31, 2023 • edited

stale bot commented Sep 17, 2023

andrasiani commented May 31, 2023 •

edited