运行chatglm3-6b官方 finetune 命令报错 kernel 需要update #843

alexhmyang · 2024-04-25T07:07:13Z

https://modelscope.cn/models/ZhipuAI/chatglm3-6b/summary
chatglm3-6b官方模型和代码都拉了两遍，但是运行 finetune 报错 kernel 需要update

!CUDA_VISIBLE_DEVICES=0 NCCL_P2P_DISABLE="1" NCCL_IB_DISABLE="1" python finetune_hf.py data/AdvertiseGen_fix /mnt/workspace/chatglm3-6b configs/lora.yaml

Detected kernel version 4.19.91, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
日志如下:

2024-04-25 14:56:59.070864: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-04-25 14:56:59.073797: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-25 14:56:59.105362: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-25 14:56:59.105394: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-25 14:56:59.105413: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-25 14:56:59.111071: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-25 14:56:59.111270: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-25 14:57:00.457944: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Setting eos_token is not supported, use the default one.
Setting pad_token is not supported, use the default one.
Setting unk_token is not supported, use the default one.
Loading checkpoint shards: 100%|██████████████████| 7/7 [00:35<00:00, 5.10s/it]
trainable params: 1,949,696 || all params: 6,245,533,696 || trainable%: 0.031217444255383614
--> Model

--> model has 1.949696M params

Map (num_proc=16): 100%|██████| 114599/114599 [00:04<00:00, 24420.30 examples/s]
train_dataset: Dataset({
features: ['input_ids', 'labels'],
num_rows: 114599
})
Map (num_proc=16): 100%|███████████| 1070/1070 [00:00<00:00, 1333.83 examples/s]
val_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1070
})
Map (num_proc=16): 100%|███████████| 1070/1070 [00:00<00:00, 1387.08 examples/s]
test_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1070
})
--> Sanity check
'[gMASK]': 64790 -> -100
'sop': 64792 -> -100
'<|user|>': 64795 -> -100
'': 30910 -> -100
'\n': 13 -> -100
'': 30910 -> -100
'类型': 33467 -> -100
'#': 31010 -> -100
'裤': 56532 -> -100
'': 30998 -> -100
'版': 55090 -> -100
'型': 54888 -> -100
'#': 31010 -> -100
'宽松': 40833 -> -100
'': 30998 -> -100
'风格': 32799 -> -100
'#': 31010 -> -100
'性感': 40589 -> -100
'': 30998 -> -100
'图案': 37505 -> -100
'#': 31010 -> -100
'线条': 37216 -> -100
'': 30998 -> -100
'裤': 56532 -> -100
'型': 54888 -> -100
'#': 31010 -> -100
'阔': 56529 -> -100
'腿': 56158 -> -100
'裤': 56532 -> -100
'<|assistant|>': 64796 -> -100
'': 30910 -> 30910
'\n': 13 -> 13
'': 30910 -> 30910
'宽松': 40833 -> 40833
'的': 54530 -> 54530
'阔': 56529 -> 56529
'腿': 56158 -> 56158
'裤': 56532 -> 56532
'这': 54551 -> 54551
'两年': 33808 -> 33808
'真的': 32041 -> 32041
'吸': 55360 -> 55360
'粉': 55486 -> 55486
'不少': 32138 -> 32138
'，': 31123 -> 31123
'明星': 32943 -> 32943
'时尚': 33481 -> 33481
'达': 54880 -> 54880
'人的': 31664 -> 31664
'心头': 46565 -> 46565
'爱': 54799 -> 54799
'。': 31155 -> 31155
'毕竟': 33051 -> 33051
'好': 54591 -> 54591
'穿': 55432 -> 55432
'时尚': 33481 -> 33481
'，': 31123 -> 31123
'谁': 55622 -> 55622
'都能': 32904 -> 32904
'穿': 55432 -> 55432
'出': 54557 -> 54557
'腿': 56158 -> 56158
'长': 54625 -> 54625
'2': 30943 -> 30943
'米': 55055 -> 55055
'的效果': 35590 -> 35590
'宽松': 40833 -> 40833
'的': 54530 -> 54530
'裤': 56532 -> 56532
'腿': 56158 -> 56158
'，': 31123 -> 31123
'当然是': 48466 -> 48466
'遮': 57148 -> 57148
'肉': 55343 -> 55343
'小': 54603 -> 54603
'能手': 49355 -> 49355
'啊': 55674 -> 55674
'。': 31155 -> 31155
'上身': 51605 -> 51605
'随': 55119 -> 55119
'性': 54642 -> 54642
'自然': 31799 -> 31799
'不': 54535 -> 54535
'拘': 57036 -> 57036
'束': 55625 -> 55625
'，': 31123 -> 31123
'面料': 46839 -> 46839
'亲': 55113 -> 55113
'肤': 56089 -> 56089
'舒适': 33894 -> 33894
'贴': 55778 -> 55778
'身体': 31902 -> 31902
'验': 55017 -> 55017
'感': 54706 -> 54706
'棒': 56382 -> 56382
'棒': 56382 -> 56382
'哒': 59230 -> 59230
'。': 31155 -> 31155
'系': 54712 -> 54712
'带': 54882 -> 54882
'部分': 31726 -> 31726
'增加': 31917 -> 31917
'设计': 31735 -> 31735
'看点': 45032 -> 45032
'，': 31123 -> 31123
'还': 54656 -> 54656
'让': 54772 -> 54772
'单品': 46539 -> 46539
'的设计': 34481 -> 34481
'感': 54706 -> 54706
'更强': 43084 -> 43084
'。': 31155 -> 31155
'腿部': 46799 -> 46799
'线条': 37216 -> 37216
'若': 55351 -> 55351
'隐': 55733 -> 55733
'若': 55351 -> 55351
'现': 54600 -> 54600
'的': 54530 -> 54530
'，': 31123 -> 31123
'性感': 40589 -> 40589
'撩': 58521 -> 58521
'人': 54533 -> 54533
'。': 31155 -> 31155
'颜色': 33692 -> 33692
'敲': 57004 -> 57004
'温柔': 34678 -> 34678
'的': 54530 -> 54530
'，': 31123 -> 31123
'与': 54619 -> 54619
'裤子': 44722 -> 44722
'本身': 32754 -> 32754
'所': 54626 -> 54626
'呈现': 33169 -> 33169
'的风格': 48084 -> 48084
'有点': 33149 -> 33149
'反': 54955 -> 54955
'差': 55342 -> 55342
'萌': 56842 -> 56842
'。': 31155 -> 31155
'': 2 -> 2
Detected kernel version 4.19.91, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /mnt/workspace/finetune_hf.py:517 in main │
│ │
│ 514 │ model.gradient_checkpointing_enable() │
│ 515 │ model.enable_input_require_grads() │
│ 516 │ │
│ ❱ 517 │ trainer = Seq2SeqTrainer( │
│ 518 │ │ model=model, │
│ 519 │ │ args=ft_config.training_args, │
│ 520 │ │ data_collator=DataCollatorForSeq2Seq( │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/trainer_seq2seq.py:57 │
│ in init │
│ │
│ 54 │ │ optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_schedu │
│ 55 │ │ preprocess_logits_for_metrics: Optional[Callable[[torch.Tensor │
│ 56 │ ): │
│ ❱ 57 │ │ super().init( │
│ 58 │ │ │ model=model, │
│ 59 │ │ │ args=args, │
│ 60 │ │ │ data_collator=data_collator, │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:514 in │
│ init │
│ │
│ 511 │ │ │ self.place_model_on_device │
│ 512 │ │ │ and not getattr(model, "quantization_method", None) == Qu │
│ 513 │ │ ): │
│ ❱ 514 │ │ │ self._move_model_to_device(model, args.device) │
│ 515 │ │ │
│ 516 │ │ # Force n_gpu to 1 to avoid DataParallel as MP will manage th │
│ 517 │ │ if self.is_model_parallel: │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:757 in │
│ _move_model_to_device │
│ │
│ 754 │ │ self.callback_handler.remove_callback(callback) │
│ 755 │ │
│ 756 │ def _move_model_to_device(self, model, device): │
│ ❱ 757 │ │ model = model.to(device) │
│ 758 │ │ # Moving a model to an XLA device disconnects the tied weight │
│ 759 │ │ if self.args.parallel_mode == ParallelMode.TPU and hasattr(mo │
│ 760 │ │ │ model.tie_weights() │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1160 in │
│ to │
│ │
│ 1157 │ │ │ │ │ │ │ non_blocking, memory_format=convert_to_fo │
│ 1158 │ │ │ return t.to(device, dtype if t.is_floating_point() or t.i │
│ 1159 │ │ │
│ ❱ 1160 │ │ return self._apply(convert) │
│ 1161 │ │
│ 1162 │ def register_full_backward_pre_hook( │
│ 1163 │ │ self, │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:810 in │
│ _apply │
│ │
│ 807 │ def _apply(self, fn, recurse=True): │
│ 808 │ │ if recurse: │
│ 809 │ │ │ for module in self.children(): │
│ ❱ 810 │ │ │ │ module._apply(fn) │
│ 811 │ │ │
│ 812 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 813 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:810 in │
│ _apply │
│ │
│ 807 │ def _apply(self, fn, recurse=True): │
│ 808 │ │ if recurse: │
│ 809 │ │ │ for module in self.children(): │
│ ❱ 810 │ │ │ │ module._apply(fn) │
│ 811 │ │ │
│ 812 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 813 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:810 in │
│ _apply │
│ │
│ 807 │ def _apply(self, fn, recurse=True): │
│ 808 │ │ if recurse: │
│ 809 │ │ │ for module in self.children(): │
│ ❱ 810 │ │ │ │ module._apply(fn) │
│ 811 │ │ │
│ 812 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 813 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:810 in │
│ _apply │
│ │
│ 807 │ def _apply(self, fn, recurse=True): │
│ 808 │ │ if recurse: │
│ 809 │ │ │ for module in self.children(): │
│ ❱ 810 │ │ │ │ module._apply(fn) │
│ 811 │ │ │
│ 812 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 813 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:810 in │
│ _apply │
│ │
│ 807 │ def _apply(self, fn, recurse=True): │
│ 808 │ │ if recurse: │
│ 809 │ │ │ for module in self.children(): │
│ ❱ 810 │ │ │ │ module._apply(fn) │
│ 811 │ │ │
│ 812 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 813 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:833 in │
│ _apply │
│ │
│ 830 │ │ │ # track autograd history of param_applied, so we have t │
│ 831 │ │ │ # with torch.no_grad(): │
│ 832 │ │ │ with torch.no_grad(): │
│ ❱ 833 │ │ │ │ param_applied = fn(param) │
│ 834 │ │ │ should_use_set_data = compute_should_use_set_data(param, │
│ 835 │ │ │ if should_use_set_data: │
│ 836 │ │ │ │ param.data = param_applied │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1158 in │
│ convert │
│ │
│ 1155 │ │ │ if convert_to_format is not None and t.dim() in (4, 5): │
│ 1156 │ │ │ │ return t.to(device, dtype if t.is_floating_point() or │
│ 1157 │ │ │ │ │ │ │ non_blocking, memory_format=convert_to_fo │
│ ❱ 1158 │ │ │ return t.to(device, dtype if t.is_floating_point() or t.i │
│ 1159 │ │ │
│ 1160 │ │ return self._apply(convert) │
│ 1161 │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so
the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

https://github.com/THUDM/ChatGLM3/blob/main/finetune_demo/lora_finetune.ipynb
教程参照这个

The text was updated successfully, but these errors were encountered:

wenmengzhou · 2024-04-25T10:59:23Z

Could not find cuda drivers on your machine, GPU will not be used. 您的环境cuda没有找到

alexhmyang · 2024-04-26T06:28:52Z

Detected kernel version 4.19.91, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.

alexhmyang · 2024-04-26T06:29:36Z

不是cuda环境的问题，我直接用的你们 gpu的 cuda镜像怎么可能没找到呢，要么就是你们3090经常挂

github-actions · 2024-05-27T01:49:32Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

alexhmyang assigned Firmament-cyou, tastelikefeet, wangxingjun778 and wenmengzhou Apr 25, 2024

github-actions bot added the Stale label May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

运行chatglm3-6b官方 finetune 命令报错 kernel 需要update #843

运行chatglm3-6b官方 finetune 命令报错 kernel 需要update #843

alexhmyang commented Apr 25, 2024

wenmengzhou commented Apr 25, 2024

alexhmyang commented Apr 26, 2024

alexhmyang commented Apr 26, 2024

github-actions bot commented May 27, 2024

运行chatglm3-6b官方 finetune 命令 报错 kernel 需要update #843

运行chatglm3-6b官方 finetune 命令 报错 kernel 需要update #843

Comments

alexhmyang commented Apr 25, 2024

!CUDA_VISIBLE_DEVICES=0 NCCL_P2P_DISABLE="1" NCCL_IB_DISABLE="1" python finetune_hf.py data/AdvertiseGen_fix /mnt/workspace/chatglm3-6b configs/lora.yaml

wenmengzhou commented Apr 25, 2024

alexhmyang commented Apr 26, 2024

alexhmyang commented Apr 26, 2024

github-actions bot commented May 27, 2024

运行chatglm3-6b官方 finetune 命令报错 kernel 需要update #843

运行chatglm3-6b官方 finetune 命令报错 kernel 需要update #843