Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

在云上运行FCN网络的时候使用GPU进行训练会报这个错:FileNotFoundError: [Errno 2] No such file or directory srun: error: gpu03: task 0: Exited with exit code 1 #801

Open
nanwang-crea opened this issue Apr 25, 2024 · 0 comments

Comments

@nanwang-crea
Copy link

这是完整的报错,网上搜了,很多讲的是进程之间通信的问题,这个问题要怎么解决呀?应该在代码中修改哪些位置?
Epoch: [0] [ 0/366] eta: 0:31:04 lr: 0.000000 loss: 2.1887 (2.1887) time: 5.0952 data: 0.7384
Epoch: [0] [ 10/366] eta: 0:15:24 lr: 0.000003 loss: 0.5890 (2.3867) time: 2.5974 data: 0.0681
Epoch: [0] [ 20/366] eta: 0:14:01 lr: 0.000006 loss: 0.2813 (1.7838) time: 2.2994 data: 0.0011
Epoch: [0] [ 30/366] eta: 0:13:21 lr: 0.000009 loss: 2.2992 (1.4588) time: 2.2671 data: 0.0010
Epoch: [0] [ 40/366] eta: 0:12:44 lr: 0.000011 loss: 1.2415 (1.4418) time: 2.2521 data: 0.0010
Epoch: [0] [ 50/366] eta: 0:12:14 lr: 0.000014 loss: 1.4934 (1.4652) time: 2.2295 data: 0.0010
Epoch: [0] [ 60/366] eta: 0:11:49 lr: 0.000017 loss: 0.5944 (1.4093) time: 2.2702 data: 0.0010
Epoch: [0] [ 70/366] eta: 0:11:23 lr: 0.000019 loss: 0.6704 (1.4132) time: 2.2722 data: 0.0010
Epoch: [0] [ 80/366] eta: 0:11:04 lr: 0.000022 loss: 0.3548 (1.3494) time: 2.3282 data: 0.0010
Epoch: [0] [ 90/366] eta: 0:10:39 lr: 0.000025 loss: 0.3015 (1.2649) time: 2.3509 data: 0.0011
Epoch: [0] [100/366] eta: 0:10:14 lr: 0.000028 loss: 0.6640 (1.2471) time: 2.2596 data: 0.0011
Epoch: [0] [110/366] eta: 0:09:51 lr: 0.000030 loss: 2.1179 (1.2050) time: 2.2716 data: 0.0010
Epoch: [0] [120/366] eta: 0:09:27 lr: 0.000033 loss: 2.0124 (1.2004) time: 2.3035 data: 0.0010
Epoch: [0] [130/366] eta: 0:09:04 lr: 0.000036 loss: 1.1753 (1.1981) time: 2.2837 data: 0.0010
Epoch: [0] [140/366] eta: 0:08:39 lr: 0.000039 loss: 2.3567 (1.2141) time: 2.2321 data: 0.0010
Epoch: [0] [150/366] eta: 0:08:18 lr: 0.000041 loss: 0.5729 (1.1973) time: 2.3115 data: 0.0010
Epoch: [0] [160/366] eta: 0:07:54 lr: 0.000044 loss: 0.4893 (1.2001) time: 2.3283 data: 0.0011
Epoch: [0] [170/366] eta: 0:07:30 lr: 0.000047 loss: 0.7241 (1.1839) time: 2.2304 data: 0.0011
Epoch: [0] [180/366] eta: 0:07:06 lr: 0.000050 loss: 1.3635 (1.1723) time: 2.2145 data: 0.0010
Traceback (most recent call last):
File "/public/home/2023020919/FCN/train.py", line 206, in
main(args)
File "/public/home/2023020919/FCN/train.py", line 141, in main
mean_loss, lr = train_one_epoch(model, optimizer, train_loader, device, epoch,
File "/public/home/2023020919/FCN/train_utils/train_and_evals.py", line 42, in train_one_epoch
for image, target in metric_logger.log_every(data_loader, print_freq, header):
File "/public/home/2023020919/FCN/train_utils/distrributed_utils.py", line 189, in log_every
for obj in iterable:
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 631, in next
data = self._next_data()
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1329, in _next_data
idx, data = self._get_data()
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1295, in _get_data
success, data = self._try_get_data()
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 495, in rebuild_storage_fd
fd = df.detach()
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/multiprocessing/resource_sharer.py", line 86, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/multiprocessing/connection.py", line 502, in Client
c = SocketClient(address)
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/multiprocessing/connection.py", line 630, in SocketClient
s.connect(address)
FileNotFoundError: [Errno 2] No such file or directory
srun: error: gpu03: task 0: Exited with exit code 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant