Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

运行chatglm3-6b官方 finetune 命令 报错 kernel 需要update #843

Open
alexhmyang opened this issue Apr 25, 2024 · 4 comments
Open
Assignees
Labels

Comments

@alexhmyang
Copy link

https://modelscope.cn/models/ZhipuAI/chatglm3-6b/summary
chatglm3-6b官方模型和代码都拉了两遍,但是 运行 finetune 报错 kernel 需要update

!CUDA_VISIBLE_DEVICES=0 NCCL_P2P_DISABLE="1" NCCL_IB_DISABLE="1" python finetune_hf.py data/AdvertiseGen_fix /mnt/workspace/chatglm3-6b configs/lora.yaml

Detected kernel version 4.19.91, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
日志如下:

2024-04-25 14:56:59.070864: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-04-25 14:56:59.073797: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-25 14:56:59.105362: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-25 14:56:59.105394: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-25 14:56:59.105413: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-25 14:56:59.111071: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-25 14:56:59.111270: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-25 14:57:00.457944: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Setting eos_token is not supported, use the default one.
Setting pad_token is not supported, use the default one.
Setting unk_token is not supported, use the default one.
Loading checkpoint shards: 100%|██████████████████| 7/7 [00:35<00:00, 5.10s/it]
trainable params: 1,949,696 || all params: 6,245,533,696 || trainable%: 0.031217444255383614
--> Model

--> model has 1.949696M params

Map (num_proc=16): 100%|██████| 114599/114599 [00:04<00:00, 24420.30 examples/s]
train_dataset: Dataset({
features: ['input_ids', 'labels'],
num_rows: 114599
})
Map (num_proc=16): 100%|███████████| 1070/1070 [00:00<00:00, 1333.83 examples/s]
val_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1070
})
Map (num_proc=16): 100%|███████████| 1070/1070 [00:00<00:00, 1387.08 examples/s]
test_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1070
})
--> Sanity check
'[gMASK]': 64790 -> -100
'sop': 64792 -> -100
'<|user|>': 64795 -> -100
'': 30910 -> -100
'\n': 13 -> -100
'': 30910 -> -100
'类型': 33467 -> -100
'#': 31010 -> -100
'裤': 56532 -> -100
'': 30998 -> -100
'版': 55090 -> -100
'型': 54888 -> -100
'#': 31010 -> -100
'宽松': 40833 -> -100
'
': 30998 -> -100
'风格': 32799 -> -100
'#': 31010 -> -100
'性感': 40589 -> -100
'': 30998 -> -100
'图案': 37505 -> -100
'#': 31010 -> -100
'线条': 37216 -> -100
'
': 30998 -> -100
'裤': 56532 -> -100
'型': 54888 -> -100
'#': 31010 -> -100
'阔': 56529 -> -100
'腿': 56158 -> -100
'裤': 56532 -> -100
'<|assistant|>': 64796 -> -100
'': 30910 -> 30910
'\n': 13 -> 13
'': 30910 -> 30910
'宽松': 40833 -> 40833
'的': 54530 -> 54530
'阔': 56529 -> 56529
'腿': 56158 -> 56158
'裤': 56532 -> 56532
'这': 54551 -> 54551
'两年': 33808 -> 33808
'真的': 32041 -> 32041
'吸': 55360 -> 55360
'粉': 55486 -> 55486
'不少': 32138 -> 32138
',': 31123 -> 31123
'明星': 32943 -> 32943
'时尚': 33481 -> 33481
'达': 54880 -> 54880
'人的': 31664 -> 31664
'心头': 46565 -> 46565
'爱': 54799 -> 54799
'。': 31155 -> 31155
'毕竟': 33051 -> 33051
'好': 54591 -> 54591
'穿': 55432 -> 55432
'时尚': 33481 -> 33481
',': 31123 -> 31123
'谁': 55622 -> 55622
'都能': 32904 -> 32904
'穿': 55432 -> 55432
'出': 54557 -> 54557
'腿': 56158 -> 56158
'长': 54625 -> 54625
'2': 30943 -> 30943
'米': 55055 -> 55055
'的效果': 35590 -> 35590
'宽松': 40833 -> 40833
'的': 54530 -> 54530
'裤': 56532 -> 56532
'腿': 56158 -> 56158
',': 31123 -> 31123
'当然是': 48466 -> 48466
'遮': 57148 -> 57148
'肉': 55343 -> 55343
'小': 54603 -> 54603
'能手': 49355 -> 49355
'啊': 55674 -> 55674
'。': 31155 -> 31155
'上身': 51605 -> 51605
'随': 55119 -> 55119
'性': 54642 -> 54642
'自然': 31799 -> 31799
'不': 54535 -> 54535
'拘': 57036 -> 57036
'束': 55625 -> 55625
',': 31123 -> 31123
'面料': 46839 -> 46839
'亲': 55113 -> 55113
'肤': 56089 -> 56089
'舒适': 33894 -> 33894
'贴': 55778 -> 55778
'身体': 31902 -> 31902
'验': 55017 -> 55017
'感': 54706 -> 54706
'棒': 56382 -> 56382
'棒': 56382 -> 56382
'哒': 59230 -> 59230
'。': 31155 -> 31155
'系': 54712 -> 54712
'带': 54882 -> 54882
'部分': 31726 -> 31726
'增加': 31917 -> 31917
'设计': 31735 -> 31735
'看点': 45032 -> 45032
',': 31123 -> 31123
'还': 54656 -> 54656
'让': 54772 -> 54772
'单品': 46539 -> 46539
'的设计': 34481 -> 34481
'感': 54706 -> 54706
'更强': 43084 -> 43084
'。': 31155 -> 31155
'腿部': 46799 -> 46799
'线条': 37216 -> 37216
'若': 55351 -> 55351
'隐': 55733 -> 55733
'若': 55351 -> 55351
'现': 54600 -> 54600
'的': 54530 -> 54530
',': 31123 -> 31123
'性感': 40589 -> 40589
'撩': 58521 -> 58521
'人': 54533 -> 54533
'。': 31155 -> 31155
'颜色': 33692 -> 33692
'敲': 57004 -> 57004
'温柔': 34678 -> 34678
'的': 54530 -> 54530
',': 31123 -> 31123
'与': 54619 -> 54619
'裤子': 44722 -> 44722
'本身': 32754 -> 32754
'所': 54626 -> 54626
'呈现': 33169 -> 33169
'的风格': 48084 -> 48084
'有点': 33149 -> 33149
'反': 54955 -> 54955
'差': 55342 -> 55342
'萌': 56842 -> 56842
'。': 31155 -> 31155
'': 2 -> 2
Detected kernel version 4.19.91, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /mnt/workspace/finetune_hf.py:517 in main │
│ │
│ 514 │ model.gradient_checkpointing_enable() │
│ 515 │ model.enable_input_require_grads() │
│ 516 │ │
│ ❱ 517 │ trainer = Seq2SeqTrainer( │
│ 518 │ │ model=model, │
│ 519 │ │ args=ft_config.training_args, │
│ 520 │ │ data_collator=DataCollatorForSeq2Seq( │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/trainer_seq2seq.py:57 │
│ in init
│ │
│ 54 │ │ optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_schedu │
│ 55 │ │ preprocess_logits_for_metrics: Optional[Callable[[torch.Tensor │
│ 56 │ ): │
│ ❱ 57 │ │ super().init( │
│ 58 │ │ │ model=model, │
│ 59 │ │ │ args=args, │
│ 60 │ │ │ data_collator=data_collator, │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:514 in │
init
│ │
│ 511 │ │ │ self.place_model_on_device │
│ 512 │ │ │ and not getattr(model, "quantization_method", None) == Qu │
│ 513 │ │ ): │
│ ❱ 514 │ │ │ self._move_model_to_device(model, args.device) │
│ 515 │ │ │
│ 516 │ │ # Force n_gpu to 1 to avoid DataParallel as MP will manage th │
│ 517 │ │ if self.is_model_parallel: │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:757 in │
│ _move_model_to_device │
│ │
│ 754 │ │ self.callback_handler.remove_callback(callback) │
│ 755 │ │
│ 756 │ def _move_model_to_device(self, model, device): │
│ ❱ 757 │ │ model = model.to(device) │
│ 758 │ │ # Moving a model to an XLA device disconnects the tied weight │
│ 759 │ │ if self.args.parallel_mode == ParallelMode.TPU and hasattr(mo │
│ 760 │ │ │ model.tie_weights() │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1160 in │
│ to │
│ │
│ 1157 │ │ │ │ │ │ │ non_blocking, memory_format=convert_to_fo │
│ 1158 │ │ │ return t.to(device, dtype if t.is_floating_point() or t.i │
│ 1159 │ │ │
│ ❱ 1160 │ │ return self._apply(convert) │
│ 1161 │ │
│ 1162 │ def register_full_backward_pre_hook( │
│ 1163 │ │ self, │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:810 in │
│ _apply │
│ │
│ 807 │ def _apply(self, fn, recurse=True): │
│ 808 │ │ if recurse: │
│ 809 │ │ │ for module in self.children(): │
│ ❱ 810 │ │ │ │ module._apply(fn) │
│ 811 │ │ │
│ 812 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 813 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:810 in │
│ _apply │
│ │
│ 807 │ def _apply(self, fn, recurse=True): │
│ 808 │ │ if recurse: │
│ 809 │ │ │ for module in self.children(): │
│ ❱ 810 │ │ │ │ module._apply(fn) │
│ 811 │ │ │
│ 812 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 813 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:810 in │
│ _apply │
│ │
│ 807 │ def _apply(self, fn, recurse=True): │
│ 808 │ │ if recurse: │
│ 809 │ │ │ for module in self.children(): │
│ ❱ 810 │ │ │ │ module._apply(fn) │
│ 811 │ │ │
│ 812 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 813 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:810 in │
│ _apply │
│ │
│ 807 │ def _apply(self, fn, recurse=True): │
│ 808 │ │ if recurse: │
│ 809 │ │ │ for module in self.children(): │
│ ❱ 810 │ │ │ │ module._apply(fn) │
│ 811 │ │ │
│ 812 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 813 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:810 in │
│ _apply │
│ │
│ 807 │ def _apply(self, fn, recurse=True): │
│ 808 │ │ if recurse: │
│ 809 │ │ │ for module in self.children(): │
│ ❱ 810 │ │ │ │ module._apply(fn) │
│ 811 │ │ │
│ 812 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 813 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:833 in │
│ _apply │
│ │
│ 830 │ │ │ # track autograd history of param_applied, so we have t │
│ 831 │ │ │ # with torch.no_grad():
│ 832 │ │ │ with torch.no_grad(): │
│ ❱ 833 │ │ │ │ param_applied = fn(param) │
│ 834 │ │ │ should_use_set_data = compute_should_use_set_data(param, │
│ 835 │ │ │ if should_use_set_data: │
│ 836 │ │ │ │ param.data = param_applied │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1158 in │
│ convert │
│ │
│ 1155 │ │ │ if convert_to_format is not None and t.dim() in (4, 5): │
│ 1156 │ │ │ │ return t.to(device, dtype if t.is_floating_point() or │
│ 1157 │ │ │ │ │ │ │ non_blocking, memory_format=convert_to_fo │
│ ❱ 1158 │ │ │ return t.to(device, dtype if t.is_floating_point() or t.i │
│ 1159 │ │ │
│ 1160 │ │ return self._apply(convert) │
│ 1161 │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so
the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.


https://github.com/THUDM/ChatGLM3/blob/main/finetune_demo/lora_finetune.ipynb
教程参照这个

image

@wenmengzhou
Copy link
Collaborator

Could not find cuda drivers on your machine, GPU will not be used. 您的环境cuda没有找到

@alexhmyang
Copy link
Author

Detected kernel version 4.19.91, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.

@alexhmyang
Copy link
Author

不是cuda环境的问题,我直接用的 你们 gpu的 cuda镜像怎么可能 没找到呢,要么就是你们3090经常挂

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants