enable yuan autotp & add conv tp #5428

Yejing-Lai · 2024-04-17T08:38:26Z

This PR aims to enable yuan model autotp and add conv tp.

Yuan model used shared qk.
For example:
q_linear_out = [q1, q2, q3, q4, q5, ... , q16]
k_linear_out = [k1, k2, k3, k4, k5, ... , k16]

after share qk:
TP=1:
q' = [q1,q2,q3,q4, q9,q10,q11,q12, k1,k2 k3,k4, k9,k10,k11,k12]
k' = [q5,q6,q7,q8, q13,q14,q15,q16, k5,k6,k7,k8, k13,k14,k15,k16]
v' = [v1,v2,v3,v4, v5,v6,v7,v8, v9,v10,v11,v12, v13,v14,v15,v16]

TP=2:
rank0:
q'_0 = [q1,q2,q3,q4, k1,k2 k3,k4]
k'_0 = [q5,q6,q7,q8, k5,k6,k7,k8]
v'_0 = [v1,v2,v3,v4, v5,v6,v7,v8] -> v'_0 is error! Expect value is: [v1,v2,v3,v4, v9,v10,v11,v12]
rank1:
q'_1 = [q9,q10,q11,q12, k9,k10,k11,k12]
k'_1 = [q13,q14,q15,q16, k13,k14,k15,k16]
v'_1 = [v9,v10,v11,v12, v13,v14,v15,v16] -> v'_1 is error! Expect value is: [v5,v6,v7,v8, v13,v14,v15,v16]

To avoid modifying the modeling code. We adjust the value and oproj weight to fit this qk type.

We also added the conv tp to support some models that including the heavy conv calculation. It is similar to the linear tp policy.
if not last_conv_layer:

1. Divide the conv weight to each rank along the output channel dimension.
1. To apply conv2d.

else:

1. Divide the conv weight to each rank along the input channel dimension.
1. Apply conv2d.
1. Use allreduce to add outputs.

delock · 2024-04-17T12:24:04Z

https://huggingface.co/IEITYuan/Yuan2-102B-hf

delock · 2024-04-18T02:17:32Z

deepspeed/module_inject/fusedqkv_utils.py

@@ -123,3 +123,54 @@ def _transpose_fused_qkvw(src, mp_size, fused_qkv_type=None, module=None):
    warning_once(f"Unrecognized fusedkqv weight type, default to using bloom type,"
                 f"please check in prepare_tp_fused_qkvw() to avoid potential calculation errors")
    return _bloom_type_transpose(src, mp_size)
+
+
+def shard_value_with_share_qk(


Comments here needed (with an example) to help understand functionality of shard_value_with_share_qk()

delock · 2024-04-18T02:19:14Z

deepspeed/module_inject/layers.py

+        self.shard_by_oc = shard_by_oc
+        self.shard_weights(conv)
+
+    def shard_weights(self, conv):


Should have some comments to explain the sharding scheme here. Better with a simple example to help understanding.

delock · 2024-04-18T02:20:31Z

deepspeed/module_inject/replace_module.py

@@ -350,6 +372,9 @@ def set_lm_head(module):
            pbar.update(1)
            gc.collect()
        replaced_module = set_lm_head(replaced_module)
+        # conv2d tp module replace
+        if 'Yuan' in str(replaced_module):


Does it mean we apply conv sharding only for models we know there is conv layer?

Yes. I added the comment for helping to understand this situation~

delock · 2024-04-23T07:05:02Z

Hi @tjruwase, we get request to support Yuan model AutoTP (https://huggingface.co/IEITYuan/Yuan2-102B-hf). This model has special QKV format and also has convolution layers which need special treatment in tensor parallelism. This PR address both model features and support them inside DeepSpeed AutoTP. Can this PR be reviewed? Thanks!

loadams · 2024-05-15T00:09:43Z

Hi @delock - FYI could you resolve the merge conflicts on this PR so it can be reviewed/tests run?

Yejing-Lai · 2024-05-15T01:49:39Z

Hi @delock - FYI could you resolve the merge conflicts on this PR so it can be reviewed/tests run?

Hi @loadams. The conflicts have been resolved. Please review~

loadams · 2024-05-22T18:26:03Z

Hi @delock - FYI could you resolve the merge conflicts on this PR so it can be reviewed/tests run?

Hi @loadams. The conflicts have been resolved. Please review~

This looks fine to me, anything else you want to review @delock ?

delock · 2024-05-22T23:35:36Z

Hi @delock - FYI could you resolve the merge conflicts on this PR so it can be reviewed/tests run?

Hi @loadams. The conflicts have been resolved. Please review~

This looks fine to me, anything else you want to review @delock ?

@loadams looks fine for me, thanks!

enable yuan autotp & add conv tp

3551b98

Yejing-Lai requested review from mrwyattii, awan-10 and arashb as code owners April 17, 2024 08:38

delock mentioned this pull request Apr 17, 2024

(Do not merge) (CPU) aggregation of few recent fixes/optimizations #3920

Draft

25 tasks

delock reviewed Apr 18, 2024

View reviewed changes

add commit for helping understanding this situation

8d32555

tjruwase and others added 3 commits April 23, 2024 08:48

Merge branch 'master' into lyj/enable_yuan

cf7ca07

Merge branch 'master' into lyj/enable_yuan

e3354bf

Merge branch 'master' into lyj/enable_yuan

2c5aded

Merge branch 'master' into lyj/enable_yuan

2e353ed

Merge branch 'master' into lyj/enable_yuan

0516583

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable yuan autotp & add conv tp #5428

enable yuan autotp & add conv tp #5428

Yejing-Lai commented Apr 17, 2024 •

edited

delock commented Apr 17, 2024

delock Apr 18, 2024

delock Apr 18, 2024

delock Apr 18, 2024

Yejing-Lai Apr 22, 2024

delock commented Apr 23, 2024

loadams commented May 15, 2024

Yejing-Lai commented May 15, 2024

loadams commented May 22, 2024

delock commented May 22, 2024

enable yuan autotp & add conv tp #5428

Are you sure you want to change the base?

enable yuan autotp & add conv tp #5428

Conversation

Yejing-Lai commented Apr 17, 2024 • edited

delock commented Apr 17, 2024

delock Apr 18, 2024

Choose a reason for hiding this comment

delock Apr 18, 2024

Choose a reason for hiding this comment

delock Apr 18, 2024

Choose a reason for hiding this comment

Yejing-Lai Apr 22, 2024

Choose a reason for hiding this comment

delock commented Apr 23, 2024

loadams commented May 15, 2024

Yejing-Lai commented May 15, 2024

loadams commented May 22, 2024

delock commented May 22, 2024

Yejing-Lai commented Apr 17, 2024 •

edited