The sequence parallel is open when I don't use it. #669

amulil · 2024-05-09T13:26:40Z

version

05/09 21:16:21 - mmengine - INFO - 0.1.18

how to reproduce

CUDA_VISIBLE_DEVICES=4,5,6,7 NPROC_PER_NODE=4 xtuner train qwen1_5_0_5b_chat_qlora_alpaca_e3

log

I only change the batch_size to 4 in config file qwen1_5_0_5b_chat_qlora_alpaca_e3.But sequence_parallel_world_size is changed to 4.

[rank3]: Traceback (most recent call last):
[rank3]:   File "/data/home/xxx/examples/dpo/xtuner/xtuner/tools/train.py", line 360, in <module>
[rank3]:     main()
[rank3]:   File "/data/home/xxx/examples/dpo/xtuner/xtuner/tools/train.py", line 356, in main
[rank3]:     runner.train()
[rank3]:   File "/data/home/xxx/.conda/envs/xpo/lib/python3.10/site-packages/mmengine/runner/runner.py", line 1777, in train
[rank3]:     model = self.train_loop.run()  # type: ignore
[rank3]:   File "/data/home/xxx/.conda/envs/xpo/lib/python3.10/site-packages/mmengine/runner/loops.py", line 287, in run
[rank3]:     self.run_iter(data_batch)
[rank3]:   File "/data/home/xxx/.conda/envs/xpo/lib/python3.10/site-packages/mmengine/runner/loops.py", line 311, in run_iter
[rank3]:     outputs = self.runner.model.train_step(
[rank3]:   File "/data/home/xxx/.conda/envs/xpo/lib/python3.10/site-packages/mmengine/model/wrappers/distributed.py", line 121, in train_step
[rank3]:     losses = self._run_forward(data, mode='loss')
[rank3]:   File "/data/home/xxx/.conda/envs/xpo/lib/python3.10/site-packages/mmengine/model/wrappers/distributed.py", line 161, in _run_forward
[rank3]:     results = self(**data, mode=mode)
[rank3]:   File "/data/home/xxx/.conda/envs/xpo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:   File "/data/home/xxx/.conda/envs/xpo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:   File "/data/home/xxx/.conda/envs/xpo/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1593, in forward
[rank3]:     else self._run_ddp_forward(*inputs, **kwargs)
[rank3]:   File "/data/home/xxx/.conda/envs/xpo/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1411, in _run_ddp_forward
[rank3]:     return self.module(*inputs, **kwargs)  # type: ignore[index]
[rank3]:   File "/data/home/xxx/.conda/envs/xpo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:   File "/data/home/xxx/.conda/envs/xpo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:   File "/data/home/xxx/examples/dpo/xtuner/xtuner/model/sft.py", line 228, in forward
[rank3]:     return self.compute_loss(data, data_samples)
[rank3]:   File "/data/home/xxx/examples/dpo/xtuner/xtuner/model/sft.py", line 277, in compute_loss
[rank3]:     return self._compute_sequence_parallel_loss(data)
[rank3]:   File "/data/home/xxx/examples/dpo/xtuner/xtuner/model/sft.py", line 262, in _compute_sequence_parallel_loss
[rank3]:     outputs = self.llm(**data)
[rank3]:   File "/data/home/xxx/.conda/envs/xpo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:   File "/data/home/xxx/.conda/envs/xpo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:   File "/data/home/xxx/.conda/envs/xpo/lib/python3.10/site-packages/peft/peft_model.py", line 1129, in forward
[rank3]:     return self.base_model(
[rank3]:   File "/data/home/xxx/.conda/envs/xpo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:   File "/data/home/xxx/.conda/envs/xpo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:   File "/data/home/xxx/.conda/envs/xpo/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 161, in forward
[rank3]:     return self.model.forward(*args, **kwargs)
[rank3]:   File "/data/home/xxx/.conda/envs/xpo/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
[rank3]:     output = module._old_forward(*args, **kwargs)
[rank3]:   File "/data/home/xxx/.conda/envs/xpo/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1169, in forward
[rank3]:     outputs = self.model(
[rank3]:   File "/data/home/xxx/.conda/envs/xpo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:   File "/data/home/xxx/.conda/envs/xpo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:   File "/data/home/xxx/.conda/envs/xpo/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
[rank3]:     output = module._old_forward(*args, **kwargs)
[rank3]:   File "/data/home/xxx/.conda/envs/xpo/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1015, in forward
[rank3]:     attention_mask = _prepare_4d_causal_attention_mask_for_sdpa(
[rank3]:   File "/data/home/xxx/.conda/envs/xpo/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py", line 389, in _prepare_4d_causal_attention_mask_for_sdpa
[rank3]:     expanded_4d_mask = attn_mask_converter.to_4d(
[rank3]:   File "/data/home/xxx/.conda/envs/xpo/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py", line 137, in to_4d
[rank3]:     expanded_attn_mask = causal_4d_mask.masked_fill(expanded_attn_mask.bool(), torch.finfo(dtype).min)
[rank3]: RuntimeError: The size of tensor a (288) must match the size of tensor b (72) at non-singleton dimension 3

The text was updated successfully, but these errors were encountered:

HIT-cwh · 2024-05-10T11:05:30Z

Hi @amulil !
Please provide the config or log file corresponding to this picture.
BTW, have you installed flash_attn?

amulil · 2024-05-11T08:53:31Z

@HIT-cwh
I use this config, just set batch_size=4. https://github.com/InternLM/xtuner/blob/193f614ffbb2463010808ebb2e689331a9c5e4f6/xtuner/configs/qwen/qwen1_5/qwen1_5_0_5b_chat/qwen1_5_0_5b_chat_qlora_alpaca_e3.py#L40C8-L40C8
Then I use the command CUDA_VISIBLE_DEVICES=4,5,6,7 NPROC_PER_NODE=4 xtuner train qwen1_5_0_5b_chat_qlora_alpaca_e3 to train.

Thanks for your tip, I didn't install flash-attn. After I install it, There is no error info.

But the command I run shouldn't use the sequence parrellel. Its sequence_parallel_world_size is changed to 4.In fact, it should be 1.

WencWu · 2024-05-17T12:20:16Z

@HIT-cwh I use this config, just set batch_size=4. https://github.com/InternLM/xtuner/blob/193f614ffbb2463010808ebb2e689331a9c5e4f6/xtuner/configs/qwen/qwen1_5/qwen1_5_0_5b_chat/qwen1_5_0_5b_chat_qlora_alpaca_e3.py#L40C8-L40C8 Then I use the command CUDA_VISIBLE_DEVICES=4,5,6,7 NPROC_PER_NODE=4 xtuner train qwen1_5_0_5b_chat_qlora_alpaca_e3 to train.

Thanks for your tip, I didn't install flash-attn. After I install it, There is no error info.

But the command I run shouldn't use the sequence parrellel. Its sequence_parallel_world_size is changed to 4.In fact, it should be 1.

I ran into the same problem. Do you have a solution for it, bro?

HIT-cwh · 2024-06-07T06:52:20Z

@HIT-cwh I use this config, just set batch_size=4. https://github.com/InternLM/xtuner/blob/193f614ffbb2463010808ebb2e689331a9c5e4f6/xtuner/configs/qwen/qwen1_5/qwen1_5_0_5b_chat/qwen1_5_0_5b_chat_qlora_alpaca_e3.py#L40C8-L40C8 Then I use the command CUDA_VISIBLE_DEVICES=4,5,6,7 NPROC_PER_NODE=4 xtuner train qwen1_5_0_5b_chat_qlora_alpaca_e3 to train.

Thanks for your tip, I didn't install flash-attn. After I install it, There is no error info.

But the command I run shouldn't use the sequence parrellel. Its sequence_parallel_world_size is changed to 4.In fact, it should be 1.

Currently, there is a bug arising from sequence parallel when training without deepspeed. This pr will fix the bug and will be integrated soon. We apologize for any inconvenience this may have caused.

In addition, we recommand to use DeepSpeed to optimize the training phase by --deepspeed deepspeed_zero1

hhaAndroid assigned HIT-cwh May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The sequence parallel is open when I don't use it. #669

The sequence parallel is open when I don't use it. #669

amulil commented May 9, 2024

HIT-cwh commented May 10, 2024

amulil commented May 11, 2024 •

edited

WencWu commented May 17, 2024

HIT-cwh commented Jun 7, 2024

The sequence parallel is open when I don't use it. #669

The sequence parallel is open when I don't use it. #669

Comments

amulil commented May 9, 2024

version

how to reproduce

log

HIT-cwh commented May 10, 2024

amulil commented May 11, 2024 • edited

WencWu commented May 17, 2024

HIT-cwh commented Jun 7, 2024

amulil commented May 11, 2024 •

edited