训练时报错 OSError: (External) CUDA error(719), unspecified launch failure. #3137

q465414859 · 2024-05-08T11:30:43Z

欢迎您使用PaddleClas并反馈相关问题，非常感谢您对PaddleClas的贡献！
提出issue时，辛苦您提供以下信息，方便我们快速定位问题并及时有效地解决您的问题：

PaddleClas版本以及PaddlePaddle版本：请您提供您使用的版本号或分支信息，如PaddleClas release/2.2和PaddlePaddle 2.1.0
涉及的其他产品使用的版本号：如您在使用PaddleClas的同时还在使用其他产品，如PaddleServing、PaddleInference等，请您提供其版本号
训练环境信息：2.4
a. 具体操作系统，Windows10
b. Python版本号，Python38
c. CUDA/cuDNN版本， CUDA11.7
完整的代码(相比于repo中代码，有改动的地方)、详细的错误信息及相关log

Error: C:\home\workspace\Paddle\paddle\phi\kernels\gpu\cross_entropy_kernel.cu:1010 Assertion false failed. The value of label expected >= 0 and < 7, or == -100, but got 29. Please check label value.
Error: C:\home\workspace\Paddle\paddle\phi\kernels\gpu\cross_entropy_kernel.cu:1010 Assertion false failed. The value of label expected >= 0 and < 7, or == -100, but got 29. Please check label value.
Error: C:\home\workspace\Paddle\paddle\phi\kernels\gpu\cross_entropy_kernel.cu:1010 Assertion false failed. The value of label expected >= 0 and < 7, or == -100, but got 29. Please check label value.
Traceback (most recent call last):
File "tools/train.py", line 32, in
engine.train()
File "F:\code\PaddleClas\ppcls\engine\engine.py", line 339, in train
self.train_epoch_func(self, epoch_id, print_batch_step)
File "F:\code\PaddleClas\ppcls\engine\train\train.py", line 54, in train_epoch
loss_dict = engine.train_loss_func(out, batch[1])
File "F:\code\PaddleClas\ppcls\loss_init_.py", line 58, in call
loss = self.loss_func[0](input, batch)
File "D:\anaconda\envs\PaddleClas\lib\site-packages\paddle\nn\layer\layers.py", line 1254, in call
return self.forward(*inputs, **kwargs)
File "F:\code\PaddleClas\ppcls\loss\celoss.py", line 57, in forward
loss = F.cross_entropy(x, label=label, soft_label=soft_label)
File "D:\anaconda\envs\PaddleClas\lib\site-packages\paddle\nn\functional\loss.py", line 2790, in cross_entropy
if paddle.count_nonzero(is_ignore) > 0: # ignore label
File "D:\anaconda\envs\PaddleClas\lib\site-packages\paddle\fluid\dygraph\tensor_patch_methods.py", line 673, in bool
return self.nonzero()
File "D:\anaconda\envs\PaddleClas\lib\site-packages\paddle\fluid\dygraph\tensor_patch_methods.py", line 670, in nonzero
return bool(np.array(self) > 0)
File "D:\anaconda\envs\PaddleClas\lib\site-packages\paddle\fluid\dygraph\tensor_patch_methods.py", line 696, in array
array = self.numpy(False)
OSError: (External) CUDA error(719), unspecified launch failure.
[Hint: 'cudaErrorLaunchFailure'. An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointerand accessing out of bounds shared memory. L
ess common cases can be system specific - more information about these cases canbe found in the system specific user guide. This leaves the process in an inconsistent state and any further CUDA work willreturn the same error. To continue using CUDA, the process must be terminated and relaunched.] (at ..\paddle\phi\backends\gpu\cuda\cuda_info.cc:267)
········
上面是报错信息

我得cuda环境是没问题的，训练OCR都可以。下面是配置文件与分类文件
······
class_gt.txt
PPLCNet_x1_0_search.txt

The text was updated successfully, but these errors were encountered:

q465414859 · 2024-05-08T11:31:50Z

class_gt.txt 生成的数据有写问题，但我已经修复了

q465414859 · 2024-05-08T11:33:14Z

(PaddleClas) F:\code\PaddleClas>pip list
Package Version

anyio 4.3.0
astor 0.8.1
Babel 2.14.0
bce-python-sdk 0.9.7
blinker 1.8.1
cachetools 5.3.3
certifi 2024.2.2
charset-normalizer 3.3.2
click 8.1.7
colorama 0.4.6
contourpy 1.1.1
cssselect 1.2.0
cssutils 2.10.2
cycler 0.12.1
Cython 3.0.10
decorator 5.1.1
easydict 1.13
et-xmlfile 1.1.0
exceptiongroup 1.2.1
faiss-cpu 1.8.0
Flask 3.0.3
flask-babel 4.0.0
fonttools 4.51.0
future 1.0.0
gast 0.3.3
h11 0.14.0
httpcore 1.0.5
httpx 0.27.0
idna 3.7
imageio 2.34.1
imgaug 0.4.0
importlib_metadata 7.1.0
importlib_resources 6.4.0
itsdangerous 2.2.0
Jinja2 3.1.3
joblib 1.4.2
kiwisolver 1.4.5
lazy_loader 0.4
Levenshtein 0.25.1
lmdb 1.4.1
lxml 5.2.1
MarkupSafe 2.1.5
matplotlib 3.7.5
networkx 3.1
numpy 1.24.4
opencv-contrib-python 4.4.0.46
opencv-python 4.6.0.66
openpyxl 3.1.2
opt-einsum 3.3.0
packaging 24.0
paddleclas 2.5.2
paddlepaddle-gpu 2.5.2
pandas 2.0.3
pillow 10.3.0
pip 24.0
premailer 3.10.0
prettytable 3.10.0
protobuf 3.20.2
psutil 5.9.8
pyclipper 1.3.0.post5
pycryptodome 3.20.0
pyparsing 3.1.2
python-dateutil 2.9.0.post0
python-Levenshtein 0.25.1
pytz 2024.1
PyWavelets 1.4.1
PyYAML 6.0.1
rapidfuzz 3.9.0
rarfile 4.2
requests 2.31.0
scikit-image 0.21.0
scikit-learn 1.3.2
scipy 1.10.1
setuptools 69.5.1
shapely 2.0.4
six 1.16.0
sniffio 1.3.1
threadpoolctl 3.5.0
tifffile 2023.7.10
tqdm 4.66.4
typing_extensions 4.11.0
tzdata 2024.1
ujson 5.9.0
urllib3 2.2.1
visualdl 2.5.3
wcwidth 0.2.13
Werkzeug 3.0.2
wheel 0.43.0
zipp 3.18.1

···························
这里是pip list

q465414859 · 2024-05-12T08:44:28Z

@Sunting78 能帮助下吗？

cuicheng01 · 2024-05-13T06:35:55Z

修复完class_gt.txt之后报什么错呢

q465414859 · 2024-05-13T09:42:56Z

修复完class_gt.txt之后报什么错呢

和原来一样，找个问题跟class_gt报错没关系

cuicheng01 · 2024-05-15T08:17:22Z

现在的你的机器的共享内存有多大呢

q465414859 · 2024-05-15T08:20:19Z

现在的你的机器的共享内存有多大呢

paddle-bot bot assigned Sunting78 May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

训练时报错 OSError: (External) CUDA error(719), unspecified launch failure. #3137

训练时报错 OSError: (External) CUDA error(719), unspecified launch failure. #3137

q465414859 commented May 8, 2024

q465414859 commented May 8, 2024

q465414859 commented May 8, 2024

q465414859 commented May 12, 2024

cuicheng01 commented May 13, 2024

q465414859 commented May 13, 2024

cuicheng01 commented May 15, 2024

q465414859 commented May 15, 2024

训练时报错 OSError: (External) CUDA error(719), unspecified launch failure. #3137

训练时报错 OSError: (External) CUDA error(719), unspecified launch failure. #3137

Comments

q465414859 commented May 8, 2024

q465414859 commented May 8, 2024

q465414859 commented May 8, 2024

q465414859 commented May 12, 2024

cuicheng01 commented May 13, 2024

q465414859 commented May 13, 2024

cuicheng01 commented May 15, 2024

q465414859 commented May 15, 2024