支持Yi-1.5系列Chat模型 #698

thomas-yanxin · 2024-05-17T10:31:15Z

我在template.py中增加了如下内容：


yi_chat=dict(
    SYSTEM=('<|im_start|>system\n{system}<|im_end|>\n'),
    INSTRUCTION=('<|im_start|>user\n{input}<|im_end|>\n'
                    '<|im_start|>assistant\n'),
    SUFFIX='<|im_end|>',
    SUFFIX_AS_EOS=True,
    SEP='\n',
    STOP_WORDS=['<|im_end|>'])

但是在训练中它卡住了

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
05/17 18:33:46 - mmengine - INFO - Hooks will be executed in the following order:
……
05/17 18:34:02 - mmengine - INFO - xtuner_dataset_timeout = 2:00:00

The text was updated successfully, but these errors were encountered:

pppppM · 2024-05-17T10:40:39Z

看起来是卡在了加载数据阶段，是用的 HF Hub 上的开源数据么？如果是的话，可能是因为网络原因 @thomas-yanxin

thomas-yanxin · 2024-05-17T10:42:11Z

看起来是卡在了加载数据阶段，是用的 HF Hub 上的开源数据么？如果是的话，可能是因为网络原因 @thomas-yanxin

不是的，本地数据。

pppppM · 2024-05-17T10:46:30Z

那可能要检查数据加载环节是不是出问题了，可以用 xtuner log-dataset $CONFIG ，看下能否正常输出数据集的信息

同时，可以用 alpaca 数据看下是否能正常训练

thomas-yanxin · 2024-05-17T10:47:34Z

那可能要检查数据加载环节是不是出问题了，可以用 xtuner log-dataset $CONFIG ，看下能否正常输出数据集的信息

同时，可以用 alpaca 数据看下是否能正常训练

应该不是数据的问题。用同样的数据，其他模型也可以正常训练。

pppppM · 2024-05-17T10:50:22Z

从 log 上看，确实是卡在了数据加载的阶段，还没有加载开始加载模型，应该和模型关系不大

可能是因为你的数据量太大了？

thomas-yanxin · 2024-05-17T10:52:28Z

从 log 上看，确实是卡在了数据加载的阶段，还没有加载开始加载模型，应该和模型关系不大

可能是因为你的数据量太大了？

我可能会觉得不是这个问题。

我先解释一下我的任务：我在执行一个LLaVA-Yi的任务。然后在Pretrain阶段，我只使用了20w的数据。这个量应该还好。

hhaAndroid · 2024-05-17T11:00:32Z

@thomas-yanxin 20w 还好不是很多。你可以考虑先离线化，这样应该更容易判断是否是数据集问题。我跑过了 yi 1.5 34b+llava，没有啥问题，可以正常训练

thomas-yanxin · 2024-05-17T11:48:36Z

../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [17,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [17,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [17,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [17,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [17,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "/yanxin/xtuner/xtuner/tools/train.py", line 360, in <module>
    main()
  File "/yanxin/xtuner/xtuner/tools/train.py", line 356, in main
    runner.train()
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1200, in train
    model = self.train_loop.run()  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/loops.py", line 286, in run
    self.run_iter(data_batch)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/loops.py", line 309, in run_iter
    outputs = self.runner.model.train_step(
  File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 133, in train_step
    losses = self._run_forward(data, mode='loss')
  File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 176, in _run_forward
    results = self.model(**data, mode=mode)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1833, in forward
    loss = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/yanxin/xtuner/xtuner/model/llava.py", line 280, in forward
    data = prepare_inputs_labels_for_multimodal(llm=self.llm, **data)
  File "/yanxin/xtuner/xtuner/model/utils.py", line 207, in prepare_inputs_labels_for_multimodal
    for i in range(num_images + 1):
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

出现了上述的报错 @hhaAndroid

pppppM · 2024-05-20T03:48:31Z

@thomas-yanxin 这个错一般是数据处理时的 tokenizer 和 llm 的 embeddings 不匹配导致的

thomas-yanxin · 2024-05-20T05:00:02Z

solved, tks~~

thomas-yanxin closed this as completed May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

支持Yi-1.5系列Chat模型 #698

支持Yi-1.5系列Chat模型 #698

thomas-yanxin commented May 17, 2024 •

edited

pppppM commented May 17, 2024

thomas-yanxin commented May 17, 2024

pppppM commented May 17, 2024

thomas-yanxin commented May 17, 2024

pppppM commented May 17, 2024

thomas-yanxin commented May 17, 2024

hhaAndroid commented May 17, 2024 •

edited

thomas-yanxin commented May 17, 2024 •

edited

pppppM commented May 20, 2024

thomas-yanxin commented May 20, 2024

支持Yi-1.5系列Chat模型 #698

支持Yi-1.5系列Chat模型 #698

Comments

thomas-yanxin commented May 17, 2024 • edited

pppppM commented May 17, 2024

thomas-yanxin commented May 17, 2024

pppppM commented May 17, 2024

thomas-yanxin commented May 17, 2024

pppppM commented May 17, 2024

thomas-yanxin commented May 17, 2024

hhaAndroid commented May 17, 2024 • edited

thomas-yanxin commented May 17, 2024 • edited

pppppM commented May 20, 2024

thomas-yanxin commented May 20, 2024

thomas-yanxin commented May 17, 2024 •

edited

hhaAndroid commented May 17, 2024 •

edited

thomas-yanxin commented May 17, 2024 •

edited