[FSDPv1] Only perform cat() during last microbatch backward() within FlattenParamsWrapper #1180

chrisxcai · 2024-04-29T04:35:40Z

If optimize_backward_concat is set to be True, only let the backward() pass propagate to FSDP.flat_params, which will
invoke the FSDP. _post_backward_hook() and concat() op, when FSDP._require_backward_grad_sync
is True (e.g. last microbatch)

Trace comparison

trace before change (SplitWithSizesBackward triggered every microbatch per FSDP module):
https://fburl.com/perfdoctor/qdt32ibh

trace with applied change (SplitWithSizesBackward triggered only in last microbatch per FSDP module):
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/trace.229652302632210.json.gz&bucket=acadia

numerics verification

local run with deterministic mode

TP=2, PP=2, num_layers_per_virtual_pipeline_stage=4, 8 GPUs, batch_size 2, DP = 2, fp8 (no 1F1B) (loss bitwise on par)

baseline

NVTE_DISABLE_NVRTC=1 CUDA_LAUNCH_BLOCKING=1 PYTHONPATH=~/benchmark/fairscale_repos/fairscale/ CRYPTOGRAPHY_OPENSSL_NO_LEGACY=1 torchrun --master_port 1024 --nproc_per_node=8 train.py --dump_dir /tmp/chriscai/xldumps --model_parallel_size 2 --pipeline_parallel_size 2 --num_layers_per_virtual_pipeline_stage=4  --seq_len=1024 --gpu_check_level=-1 --steps=10 --log_all_steps=True --profile_freq=10 --dump_profile_traces=True --profile_with_stack=True --model.n_layers=8 --reshard_after_forward=False --batch_size=4 --model.efficient_attn=cutlass --model.attn_bias_type=causal --model.layer_ckpt=none --model=small --model.sequence_parallel=True --mem_snapshot_stop_step 5 --log_all_steps=True --enable_deterministic_training=True --log_freq=1 --model.use_te_layers=True --optim.use_fp32_copy_optim=True --model.benchmark_perf=False --model.use_fp8=True --model.fp8_wgrad=True --optimize_backward_concat=False

https://www.internalfb.com/intern/paste/P1363180533/

test

NVTE_DISABLE_NVRTC=1 CUDA_LAUNCH_BLOCKING=1 PYTHONPATH=~/benchmark/fairscale_repos/fairscale/ CRYPTOGRAPHY_OPENSSL_NO_LEGACY=1 torchrun --master_port 1024 --nproc_per_node=8 train.py --dump_dir /tmp/chriscai/xldumps --model_parallel_size 2 --pipeline_parallel_size 2 --num_layers_per_virtual_pipeline_stage=4  --seq_len=1024 --gpu_check_level=-1 --steps=10 --log_all_steps=True --profile_freq=10 --dump_profile_traces=True --profile_with_stack=True --model.n_layers=8 --reshard_after_forward=False --batch_size=4 --model.efficient_attn=cutlass --model.attn_bias_type=causal --model.layer_ckpt=none --model=small --model.sequence_parallel=True --mem_snapshot_stop_step 5 --log_all_steps=True --enable_deterministic_training=True --log_freq=1 --model.use_te_layers=True --optim.use_fp32_copy_optim=True --model.benchmark_perf=False --model.use_fp8=True --model.fp8_wgrad=True --optimize_backward_concat=True

https://www.internalfb.com/intern/paste/P1363177870/

TP=2, GPU=8, DP = 4, BF16, non-PP microbatching (loss bitwise on par)

baseline:
https://www.internalfb.com/intern/paste/P1322976356/
test :
https://www.internalfb.com/intern/paste/P1322871976/

TP=2, PP=2, num_layers_per_virtual_pipeline_stage=4, 8 GPUs, batch_size 2, DP = 2, BF16 (no 1F1B) (loss bitwise on par)

baseline
https://www.internalfb.com/intern/paste/P1358660231/

test
https://www.internalfb.com/intern/paste/P1358659328/

TP=2, PP=2, num_layers_per_virtual_pipeline_stage=4, 8 GPUs, batch_size 4, DP = 2, BF16 (1F1B) (loss bitwise on par)

baseline
https://www.internalfb.com/intern/paste/P1358780690

test
https://www.internalfb.com/intern/paste/P1358786994/

E2E MAST tests:

model = small, TP = 2, PP = 2, DP = 2 (loss on par)

baseline:
https://www.internalfb.com/mlhub/pipelines/runs/mast/conda-xlformers-tl66r0qd

test:
https://www.internalfb.com/mlhub/pipelines/runs/mast/conda-xlformers-km46966

Perf evaluation

model= llama3_kv8_balance2_ffn12, n_layers = 1, non-PP microbatching, bs = 128, fp8, TP 4, CP = 8

baseline:
e2e TFLOPS/s: 339.53
comp TFLOPS/s: 625.64

https://www.internalfb.com/mlhub/pipelines/runs/mast/conda-xlformers-f7cdn9q
trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/trace.27299292624533.json.gz&bucket=acadia

test:
e2e TFLOPS/s: 387.98 (~15%)
comp TFLOPS/s: 817.5 (~30%)

https://www.internalfb.com/mlhub/pipelines/runs/mast/conda-xlformers-t56xpf
trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/trace.71951644521316.json.gz&bucket=acadia

…t for last microbatch

… flatten_parameter.unsharded_main_grad in last microbatch backward()

awgu

This approach makes sense to me!

awgu · 2024-05-15T15:04:52Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

+            If True, only let backward pass propagate to self.params, which will
+            invoke the _post_backward_hook() and concat() op, when self._require_backward_grad_sync
+            is True (e.g. last microbatch)
+            NOTE: this likely will incur more GPU memory usage


Could you explain why there will be more GPU memory usage?

hi @awgu, currently by testing results it shows the GPU memory overhead could be non-trivial (20% of 80G), we will follow up on reducing the memory usage

awgu · 2024-05-15T15:07:25Z

fairscale/nn/misc/flatten_params_wrapper.py

+        if self.fp32_grads[param_index] is None:
+            self.fp32_grads[param_index] = grad.to(torch.float32)
+        else:
+            self.fp32_grads[param_index].add_(grad.data)


nit: I think grad.data can just be grad (save one aten.detach call)

Co-authored-by: Jie Wang <jiewang@meta.com>

* Changed to only run reshard hook if all gradients computed * Fix decreasing it/s with multi-grad hook

Co-authored-by: Jie Wang <jiewang@meta.com>

use torch.no_grad() to avoid calling cat() during FSDP backward excep…

d1102ce

…t for last microbatch

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 29, 2024

remove logging

9a22628

chrisxcai requested review from GD06 and yuchenhao April 29, 2024 22:36

chrisxcai added 14 commits April 29, 2024 21:52

logging

f787532

logging

3429f33

use new field to accumulate per-parameter grads in fp32 and copy into…

4b5abe2

… flatten_parameter.unsharded_main_grad in last microbatch backward()

clean up accumulated fp32 grads between data batches

c97bfd9

logging

d2a88b7

logging

901fb86

return grad in post_backward_hook()

ad40f24

correct param_index

14499fe

logging

ad7aa1f

add torch.testing.assert_allclose() to compare baseline and new grads

b835770

logging

d689f38

logging

e8df583

honor optimize_backward_concat flag

5926a79

documentation

5d08aa3

chrisxcai changed the title ~~[WIP] Make FSDPv1 only perform cat() during last microbatch backward() within FlattenParamsWrapper~~ [FSDPv1] Only perform cat() during last microbatch backward() within FlattenParamsWrapper May 15, 2024

chrisxcai added 2 commits May 15, 2024 00:38

update documentation

c91cb72

update documentation

fd3f3fc

chrisxcai requested a review from awgu May 15, 2024 07:46

awgu approved these changes May 15, 2024

View reviewed changes

chrisxcai changed the base branch from ngoyal_changes_for_pp_fp8 to ngoyal_changes_for_pp_fp8_jiecaoyu_free_fp16_shard May 15, 2024 21:51

chrisxcai and others added 5 commits May 15, 2024 15:02

use grad instead of grad.data

7678503

clean up

c55a0d1

Added reshard hook for frozen params in backward

688b902

Avoid calling _free_fp16_param_shard() too early with PR 1159

a3ff5c4

Added requires_grad check for params_with_grad method (#1171)

9d0e41e

Co-authored-by: Jie Wang <jiewang@meta.com>

awgu and others added 4 commits May 15, 2024 15:17

Changed to only run reshard hook if all gradients computed (#1166)

e43a22f

* Changed to only run reshard hook if all gradients computed * Fix decreasing it/s with multi-grad hook

Add cast input argument (#1175)

f039a3a

Co-authored-by: Jie Wang <jiewang@meta.com>

honor optimize_backward_concat flag

5299982

use grad instead of grad.data

b5e138f

chrisxcai mentioned this pull request May 15, 2024

[FSDPv1] Only perform cat() during last microbatch backward() within FlattenParamsWrapper #1184

Merged

chrisxcai changed the base branch from ngoyal_changes_for_pp_fp8_jiecaoyu_free_fp16_shard to ngoyal_changes_for_pp_fp8 May 15, 2024 23:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDPv1] Only perform cat() during last microbatch backward() within FlattenParamsWrapper #1180

[FSDPv1] Only perform cat() during last microbatch backward() within FlattenParamsWrapper #1180

chrisxcai commented Apr 29, 2024 •

edited

awgu left a comment

awgu May 15, 2024

chrisxcai May 15, 2024

awgu May 15, 2024

[FSDPv1] Only perform cat() during last microbatch backward() within FlattenParamsWrapper #1180

Are you sure you want to change the base?

[FSDPv1] Only perform cat() during last microbatch backward() within FlattenParamsWrapper #1180

Conversation

chrisxcai commented Apr 29, 2024 • edited

Trace comparison

numerics verification

local run with deterministic mode

E2E MAST tests:

Perf evaluation

awgu left a comment

Choose a reason for hiding this comment

awgu May 15, 2024

Choose a reason for hiding this comment

chrisxcai May 15, 2024

Choose a reason for hiding this comment

awgu May 15, 2024

Choose a reason for hiding this comment

chrisxcai commented Apr 29, 2024 •

edited