You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Detected kernel version 4.19.91, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
日志如下:
2024-04-25 14:56:59.070864: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-04-25 14:56:59.073797: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-25 14:56:59.105362: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-25 14:56:59.105394: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-25 14:56:59.105413: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-25 14:56:59.111071: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-25 14:56:59.111270: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-25 14:57:00.457944: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Setting eos_token is not supported, use the default one.
Setting pad_token is not supported, use the default one.
Setting unk_token is not supported, use the default one.
Loading checkpoint shards: 100%|██████████████████| 7/7 [00:35<00:00, 5.10s/it]
trainable params: 1,949,696 || all params: 6,245,533,696 || trainable%: 0.031217444255383614
--> Model
Detected kernel version 4.19.91, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
https://modelscope.cn/models/ZhipuAI/chatglm3-6b/summary
chatglm3-6b官方模型和代码都拉了两遍,但是 运行 finetune 报错 kernel 需要update
!CUDA_VISIBLE_DEVICES=0 NCCL_P2P_DISABLE="1" NCCL_IB_DISABLE="1" python finetune_hf.py data/AdvertiseGen_fix /mnt/workspace/chatglm3-6b configs/lora.yaml
Detected kernel version 4.19.91, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
日志如下:
2024-04-25 14:56:59.070864: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable
TF_ENABLE_ONEDNN_OPTS=0
.2024-04-25 14:56:59.073797: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-25 14:56:59.105362: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-25 14:56:59.105394: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-25 14:56:59.105413: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-25 14:56:59.111071: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-25 14:56:59.111270: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-25 14:57:00.457944: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Setting eos_token is not supported, use the default one.
Setting pad_token is not supported, use the default one.
Setting unk_token is not supported, use the default one.
Loading checkpoint shards: 100%|██████████████████| 7/7 [00:35<00:00, 5.10s/it]
trainable params: 1,949,696 || all params: 6,245,533,696 || trainable%: 0.031217444255383614
--> Model
--> model has 1.949696M params
Map (num_proc=16): 100%|██████| 114599/114599 [00:04<00:00, 24420.30 examples/s]
train_dataset: Dataset({
features: ['input_ids', 'labels'],
num_rows: 114599
})
Map (num_proc=16): 100%|███████████| 1070/1070 [00:00<00:00, 1333.83 examples/s]
val_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1070
})
Map (num_proc=16): 100%|███████████| 1070/1070 [00:00<00:00, 1387.08 examples/s]
test_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1070
})
--> Sanity check
'[gMASK]': 64790 -> -100
'sop': 64792 -> -100
'<|user|>': 64795 -> -100
'': 30910 -> -100
'\n': 13 -> -100
'': 30910 -> -100
'类型': 33467 -> -100
'#': 31010 -> -100
'裤': 56532 -> -100
'': 30998 -> -100
'版': 55090 -> -100
'型': 54888 -> -100
'#': 31010 -> -100
'宽松': 40833 -> -100
'': 30998 -> -100
'风格': 32799 -> -100
'#': 31010 -> -100
'性感': 40589 -> -100
'': 30998 -> -100
'图案': 37505 -> -100
'#': 31010 -> -100
'线条': 37216 -> -100
'': 30998 -> -100
'裤': 56532 -> -100
'型': 54888 -> -100
'#': 31010 -> -100
'阔': 56529 -> -100
'腿': 56158 -> -100
'裤': 56532 -> -100
'<|assistant|>': 64796 -> -100
'': 30910 -> 30910
'\n': 13 -> 13
'': 30910 -> 30910
'宽松': 40833 -> 40833
'的': 54530 -> 54530
'阔': 56529 -> 56529
'腿': 56158 -> 56158
'裤': 56532 -> 56532
'这': 54551 -> 54551
'两年': 33808 -> 33808
'真的': 32041 -> 32041
'吸': 55360 -> 55360
'粉': 55486 -> 55486
'不少': 32138 -> 32138
',': 31123 -> 31123
'明星': 32943 -> 32943
'时尚': 33481 -> 33481
'达': 54880 -> 54880
'人的': 31664 -> 31664
'心头': 46565 -> 46565
'爱': 54799 -> 54799
'。': 31155 -> 31155
'毕竟': 33051 -> 33051
'好': 54591 -> 54591
'穿': 55432 -> 55432
'时尚': 33481 -> 33481
',': 31123 -> 31123
'谁': 55622 -> 55622
'都能': 32904 -> 32904
'穿': 55432 -> 55432
'出': 54557 -> 54557
'腿': 56158 -> 56158
'长': 54625 -> 54625
'2': 30943 -> 30943
'米': 55055 -> 55055
'的效果': 35590 -> 35590
'宽松': 40833 -> 40833
'的': 54530 -> 54530
'裤': 56532 -> 56532
'腿': 56158 -> 56158
',': 31123 -> 31123
'当然是': 48466 -> 48466
'遮': 57148 -> 57148
'肉': 55343 -> 55343
'小': 54603 -> 54603
'能手': 49355 -> 49355
'啊': 55674 -> 55674
'。': 31155 -> 31155
'上身': 51605 -> 51605
'随': 55119 -> 55119
'性': 54642 -> 54642
'自然': 31799 -> 31799
'不': 54535 -> 54535
'拘': 57036 -> 57036
'束': 55625 -> 55625
',': 31123 -> 31123
'面料': 46839 -> 46839
'亲': 55113 -> 55113
'肤': 56089 -> 56089
'舒适': 33894 -> 33894
'贴': 55778 -> 55778
'身体': 31902 -> 31902
'验': 55017 -> 55017
'感': 54706 -> 54706
'棒': 56382 -> 56382
'棒': 56382 -> 56382
'哒': 59230 -> 59230
'。': 31155 -> 31155
'系': 54712 -> 54712
'带': 54882 -> 54882
'部分': 31726 -> 31726
'增加': 31917 -> 31917
'设计': 31735 -> 31735
'看点': 45032 -> 45032
',': 31123 -> 31123
'还': 54656 -> 54656
'让': 54772 -> 54772
'单品': 46539 -> 46539
'的设计': 34481 -> 34481
'感': 54706 -> 54706
'更强': 43084 -> 43084
'。': 31155 -> 31155
'腿部': 46799 -> 46799
'线条': 37216 -> 37216
'若': 55351 -> 55351
'隐': 55733 -> 55733
'若': 55351 -> 55351
'现': 54600 -> 54600
'的': 54530 -> 54530
',': 31123 -> 31123
'性感': 40589 -> 40589
'撩': 58521 -> 58521
'人': 54533 -> 54533
'。': 31155 -> 31155
'颜色': 33692 -> 33692
'敲': 57004 -> 57004
'温柔': 34678 -> 34678
'的': 54530 -> 54530
',': 31123 -> 31123
'与': 54619 -> 54619
'裤子': 44722 -> 44722
'本身': 32754 -> 32754
'所': 54626 -> 54626
'呈现': 33169 -> 33169
'的风格': 48084 -> 48084
'有点': 33149 -> 33149
'反': 54955 -> 54955
'差': 55342 -> 55342
'萌': 56842 -> 56842
'。': 31155 -> 31155
'': 2 -> 2
Detected kernel version 4.19.91, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /mnt/workspace/finetune_hf.py:517 in main │
│ │
│ 514 │ model.gradient_checkpointing_enable() │
│ 515 │ model.enable_input_require_grads() │
│ 516 │ │
│ ❱ 517 │ trainer = Seq2SeqTrainer( │
│ 518 │ │ model=model, │
│ 519 │ │ args=ft_config.training_args, │
│ 520 │ │ data_collator=DataCollatorForSeq2Seq( │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/trainer_seq2seq.py:57 │
│ in init │
│ │
│ 54 │ │ optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_schedu │
│ 55 │ │ preprocess_logits_for_metrics: Optional[Callable[[torch.Tensor │
│ 56 │ ): │
│ ❱ 57 │ │ super().init( │
│ 58 │ │ │ model=model, │
│ 59 │ │ │ args=args, │
│ 60 │ │ │ data_collator=data_collator, │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:514 in │
│ init │
│ │
│ 511 │ │ │ self.place_model_on_device │
│ 512 │ │ │ and not getattr(model, "quantization_method", None) == Qu │
│ 513 │ │ ): │
│ ❱ 514 │ │ │ self._move_model_to_device(model, args.device) │
│ 515 │ │ │
│ 516 │ │ # Force n_gpu to 1 to avoid DataParallel as MP will manage th │
│ 517 │ │ if self.is_model_parallel: │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:757 in │
│ _move_model_to_device │
│ │
│ 754 │ │ self.callback_handler.remove_callback(callback) │
│ 755 │ │
│ 756 │ def _move_model_to_device(self, model, device): │
│ ❱ 757 │ │ model = model.to(device) │
│ 758 │ │ # Moving a model to an XLA device disconnects the tied weight │
│ 759 │ │ if self.args.parallel_mode == ParallelMode.TPU and hasattr(mo │
│ 760 │ │ │ model.tie_weights() │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1160 in │
│ to │
│ │
│ 1157 │ │ │ │ │ │ │ non_blocking, memory_format=convert_to_fo │
│ 1158 │ │ │ return t.to(device, dtype if t.is_floating_point() or t.i │
│ 1159 │ │ │
│ ❱ 1160 │ │ return self._apply(convert) │
│ 1161 │ │
│ 1162 │ def register_full_backward_pre_hook( │
│ 1163 │ │ self, │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:810 in │
│ _apply │
│ │
│ 807 │ def _apply(self, fn, recurse=True): │
│ 808 │ │ if recurse: │
│ 809 │ │ │ for module in self.children(): │
│ ❱ 810 │ │ │ │ module._apply(fn) │
│ 811 │ │ │
│ 812 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 813 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:810 in │
│ _apply │
│ │
│ 807 │ def _apply(self, fn, recurse=True): │
│ 808 │ │ if recurse: │
│ 809 │ │ │ for module in self.children(): │
│ ❱ 810 │ │ │ │ module._apply(fn) │
│ 811 │ │ │
│ 812 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 813 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:810 in │
│ _apply │
│ │
│ 807 │ def _apply(self, fn, recurse=True): │
│ 808 │ │ if recurse: │
│ 809 │ │ │ for module in self.children(): │
│ ❱ 810 │ │ │ │ module._apply(fn) │
│ 811 │ │ │
│ 812 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 813 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:810 in │
│ _apply │
│ │
│ 807 │ def _apply(self, fn, recurse=True): │
│ 808 │ │ if recurse: │
│ 809 │ │ │ for module in self.children(): │
│ ❱ 810 │ │ │ │ module._apply(fn) │
│ 811 │ │ │
│ 812 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 813 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:810 in │
│ _apply │
│ │
│ 807 │ def _apply(self, fn, recurse=True): │
│ 808 │ │ if recurse: │
│ 809 │ │ │ for module in self.children(): │
│ ❱ 810 │ │ │ │ module._apply(fn) │
│ 811 │ │ │
│ 812 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 813 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:833 in │
│ _apply │
│ │
│ 830 │ │ │ # track autograd history of
param_applied
, so we have t ││ 831 │ │ │ #
with torch.no_grad():
││ 832 │ │ │ with torch.no_grad(): │
│ ❱ 833 │ │ │ │ param_applied = fn(param) │
│ 834 │ │ │ should_use_set_data = compute_should_use_set_data(param, │
│ 835 │ │ │ if should_use_set_data: │
│ 836 │ │ │ │ param.data = param_applied │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1158 in │
│ convert │
│ │
│ 1155 │ │ │ if convert_to_format is not None and t.dim() in (4, 5): │
│ 1156 │ │ │ │ return t.to(device, dtype if t.is_floating_point() or │
│ 1157 │ │ │ │ │ │ │ non_blocking, memory_format=convert_to_fo │
│ ❱ 1158 │ │ │ return t.to(device, dtype if t.is_floating_point() or t.i │
│ 1159 │ │ │
│ 1160 │ │ return self._apply(convert) │
│ 1161 │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so
the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.https://github.com/THUDM/ChatGLM3/blob/main/finetune_demo/lora_finetune.ipynb
教程参照这个
The text was updated successfully, but these errors were encountered: