Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

支持Yi-1.5系列Chat模型 #698

Closed
thomas-yanxin opened this issue May 17, 2024 · 10 comments
Closed

支持Yi-1.5系列Chat模型 #698

thomas-yanxin opened this issue May 17, 2024 · 10 comments

Comments

@thomas-yanxin
Copy link

thomas-yanxin commented May 17, 2024

我在template.py中增加了如下内容:


yi_chat=dict(
    SYSTEM=('<|im_start|>system\n{system}<|im_end|>\n'),
    INSTRUCTION=('<|im_start|>user\n{input}<|im_end|>\n'
                    '<|im_start|>assistant\n'),
    SUFFIX='<|im_end|>',
    SUFFIX_AS_EOS=True,
    SEP='\n',
    STOP_WORDS=['<|im_end|>'])

但是在训练中它卡住了

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
05/17 18:33:46 - mmengine - INFO - Hooks will be executed in the following order:
……
05/17 18:34:02 - mmengine - INFO - xtuner_dataset_timeout = 2:00:00

@pppppM
Copy link
Collaborator

pppppM commented May 17, 2024

看起来是卡在了加载数据阶段,是用的 HF Hub 上的开源数据么?如果是的话,可能是因为网络原因 @thomas-yanxin

@thomas-yanxin
Copy link
Author

看起来是卡在了加载数据阶段,是用的 HF Hub 上的开源数据么?如果是的话,可能是因为网络原因 @thomas-yanxin

不是的,本地数据。

@pppppM
Copy link
Collaborator

pppppM commented May 17, 2024

那可能要检查数据加载环节是不是出问题了,可以用 xtuner log-dataset $CONFIG ,看下能否正常输出数据集的信息

同时,可以用 alpaca 数据看下是否能正常训练

@thomas-yanxin
Copy link
Author

那可能要检查数据加载环节是不是出问题了,可以用 xtuner log-dataset $CONFIG ,看下能否正常输出数据集的信息

同时,可以用 alpaca 数据看下是否能正常训练

应该不是数据的问题。用同样的数据,其他模型也可以正常训练。

@pppppM
Copy link
Collaborator

pppppM commented May 17, 2024

从 log 上看,确实是卡在了数据加载的阶段,还没有加载开始加载模型,应该和模型关系不大

可能是因为你的数据量太大了?

@thomas-yanxin
Copy link
Author

从 log 上看,确实是卡在了数据加载的阶段,还没有加载开始加载模型,应该和模型关系不大

可能是因为你的数据量太大了?

我可能会觉得不是这个问题。

我先解释一下我的任务:我在执行一个LLaVA-Yi的任务。然后在Pretrain阶段,我只使用了20w的数据。这个量应该还好。

@hhaAndroid
Copy link
Collaborator

hhaAndroid commented May 17, 2024

@thomas-yanxin 20w 还好不是很多。你可以考虑先离线化,这样应该更容易判断是否是数据集问题。我跑过了 yi 1.5 34b+llava,没有啥问题,可以正常训练

@thomas-yanxin
Copy link
Author

thomas-yanxin commented May 17, 2024

../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [17,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [17,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [17,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [17,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [17,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "/yanxin/xtuner/xtuner/tools/train.py", line 360, in <module>
    main()
  File "/yanxin/xtuner/xtuner/tools/train.py", line 356, in main
    runner.train()
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1200, in train
    model = self.train_loop.run()  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/loops.py", line 286, in run
    self.run_iter(data_batch)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/loops.py", line 309, in run_iter
    outputs = self.runner.model.train_step(
  File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 133, in train_step
    losses = self._run_forward(data, mode='loss')
  File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 176, in _run_forward
    results = self.model(**data, mode=mode)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1833, in forward
    loss = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/yanxin/xtuner/xtuner/model/llava.py", line 280, in forward
    data = prepare_inputs_labels_for_multimodal(llm=self.llm, **data)
  File "/yanxin/xtuner/xtuner/model/utils.py", line 207, in prepare_inputs_labels_for_multimodal
    for i in range(num_images + 1):
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

出现了上述的报错 @hhaAndroid

@pppppM
Copy link
Collaborator

pppppM commented May 20, 2024

@thomas-yanxin 这个错一般是数据处理时的 tokenizer 和 llm 的 embeddings 不匹配导致的

@thomas-yanxin
Copy link
Author

solved, tks~~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants