Fp8 all gather hack #1136

jspark1105 · 2023-09-17T18:40:05Z

This is based on ngoyal_added_zero2_shard_modelparams_multiple_gpus and adding hacks to use fp8 all-gather with Nvidia's transformer engine (see the latest commit for the changes on top of ngoyal_added_zero2_shard_modelparams_multiple_gpus branch).

This depends on transformer engine changes in https://github.com/facebookresearch/TransformerEngine/pull/20
See https://github.com/fairinternal/xlformers/pull/1403 for an example how to use.
Also depends on PyTorch changes in pytorch/pytorch#109654

To use fp8 allgather, set compute_dtype=torch.float8_e4m3fn and mixed_precision=True
We separate out precision critical parameters like affine weights for norm as non flattened params and hard-code to use bf16.
We update scale/scale_inv inside forward before _rebuild_full_params that calls _cast_fp32_param_shards_to_fp16 vs. TE baseline that updates scale/scale_inv in prepare_forward. This because we need fp8 quantization of weights earlier before allgather. (One can consider doing this in post backward but this has a problem since updating bwd amax update is done after bwd of all layers are finished which can be later than post backward so we won't be using the latest bwd amax info for scale/scale_inv update).
We hard-code special handling for a couple of TransformerEngine layers like Linear, LayerNormLinear, and LayerNormMLP in _cast_fp32_param_shards_to_fp16 to access their fp8 meta data to quantize with right scales (TODO: we may want to extract this as a user customizable call back functions?)

CC @awgu @ngoyal2707 @vedanuj @jiecaoyu @yf225 @GD06

jspark1105 · 2023-10-04T23:32:21Z

Will merge main_grad related changes with #1142

jianyuh · 2023-10-06T00:37:42Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

+                    # Cast grad to FP32.
+                    grad_or_main_grad.data = grad_or_main_grad.to(param.dtype)
+                elif self._is_fp8_dtype():
+                    # Use bf16 wgrad for fp8 weights (TODO: handle fp8 wgrad)


Currently this is not working with the latest FP8 wgrad ?

This meant to be for future work when we have fp8 reduce-scatter. I'll update the comment.

jianyuh · 2023-10-06T00:37:58Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

@@ -1393,7 +1447,11 @@ def forward(self, *args: Any, **kwargs: Any) -> torch.Tensor:

        # For root and mixed precision, we convert the input to FP16 (no_grad is needed for
        # the conversion).
-        is_bf16 = self.compute_dtype == torch.bfloat16
+        is_bf16 = self.compute_dtype in [


nit: is_bf16_or_fp8

jspark1105 · 2023-10-07T21:46:42Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

@@ -2265,8 +2361,7 @@ def local_metadata_dict(self) -> Dict[str, Any]:
                        backing_param_name = m.module.flat_param_names[i]
                        names, shapes, numels = m.module.metadata(i)
                    else:
-                        assert len(m._param_name_groups[i]) == 1
-                        backing_param_name = m._param_name_groups[i][0]
+                        backing_param_name = m._param_name_groups[m._num_flatten_params][i - m._num_flatten_params]


Need to make sure checkpointing works properly with this.

… during post_backward_hook

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 17, 2023

jspark1105 marked this pull request as ready for review September 18, 2023 04:33

jspark1105 force-pushed the fp8_all_gather branch from dfe122b to fe3a0d6 Compare September 18, 2023 14:47

jspark1105 mentioned this pull request Sep 19, 2023

support for fp8 allgather FSDP pytorch/pytorch#109654

Closed

jspark1105 force-pushed the fp8_all_gather branch from fe3a0d6 to 7dd000e Compare September 19, 2023 22:55

fp8 allgather

0224797

jspark1105 force-pushed the fp8_all_gather branch from 7dd000e to 0224797 Compare September 19, 2023 23:17

don't shard norm weights

db6a1c7

jspark1105 changed the base branch from main to ngoyal_added_zero2_shard_modelparams_multiple_gpus October 4, 2023 23:05

jspark1105 force-pushed the fp8_all_gather branch 2 times, most recently from bd70153 to af3d2d7 Compare October 5, 2023 03:49

vedanuj mentioned this pull request Oct 5, 2023

Add main grad before fwd pass #1142

Draft

jianyuh reviewed Oct 6, 2023

View reviewed changes

jspark1105 force-pushed the fp8_all_gather branch 2 times, most recently from b9b093b to a2b49d1 Compare October 7, 2023 03:10

jspark1105 commented Oct 7, 2023

View reviewed changes

jspark1105 force-pushed the fp8_all_gather branch 2 times, most recently from d92dc0f to 6a4d7f4 Compare October 15, 2023 18:56

use main_grad for higher precision gradient accumulation; update amax…

c0f4b97

… during post_backward_hook

jspark1105 force-pushed the fp8_all_gather branch from 6a4d7f4 to c0f4b97 Compare October 15, 2023 20:00

jiecaoyu mentioned this pull request Oct 29, 2023

[Not for merge] fp8allgather debug #1147

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fp8 all gather hack #1136

Fp8 all gather hack #1136

jspark1105 commented Sep 17, 2023 •

edited

jspark1105 commented Oct 4, 2023

jianyuh Oct 6, 2023

jspark1105 Oct 6, 2023

jianyuh Oct 6, 2023

jspark1105 Oct 7, 2023

Fp8 all gather hack #1136

Are you sure you want to change the base?

Fp8 all gather hack #1136

Conversation

jspark1105 commented Sep 17, 2023 • edited

jspark1105 commented Oct 4, 2023

jianyuh Oct 6, 2023

Choose a reason for hiding this comment

jspark1105 Oct 6, 2023

Choose a reason for hiding this comment

jianyuh Oct 6, 2023

Choose a reason for hiding this comment

jspark1105 Oct 7, 2023

Choose a reason for hiding this comment

jspark1105 commented Sep 17, 2023 •

edited