Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training doesn't start - I'm getting an error with the data loader #168

Open
rsamvelyan opened this issue Dec 8, 2021 · 4 comments
Open

Comments

@rsamvelyan
Copy link

Hello

I am trying to run it on my Windows machine. My dataset seems to be correct.
When I start the training I get an error.

Here is the call with the arguments:
python train_ssd.py --dataset_type open_images --datasets C:/Users/rsamv/Documents/data/open_images_datasets/apples --net mb1-ssd --pretrained_ssd C:/Users/rsamv/Documents/pytorch-ssd/models/mb1-ssd/mobilenet-v1-ssd-mp-0_675.pth --scheduler cosine --lr 0.01 --t_max 100 --validation_epochs 5 --num_epochs 100 --base_net_lr 0.01 --batch_size 5

Here is the error I get:
(base) PS C:\Users\rsamv> cd C:\Users\rsamv\Documents\pytorch-ssd
(base) PS C:\Users\rsamv\Documents\pytorch-ssd> python train_ssd.py --dataset_type open_images --datasets C:/Users/rsamv/Documents/data/open_images_datasets/apples --net mb1-ssd --pretrained_ssd C:/Users/rsamv/Documents/pytorch-ssd/models/mb1-ssd/mobilenet-v1-ssd-mp-0_675.pth --scheduler cosine --lr 0.01 --t_max 100 --validation_epochs 5 --num_epochs 100 --base_net_lr 0.01 --batch_size 5
2021-12-07 23:11:37,702 - root - INFO - Use Cuda.
2021-12-07 23:11:37,703 - root - INFO - Namespace(dataset_type='open_images', datasets=['C:/Users/rsamv/Documents/data/open_images_datasets/apples'], validation_dataset=None, balance_data=False, net='mb1-ssd', freeze_base_net=False, freeze_net=False, mb2_width_mult=1.0, lr=0.01, momentum=0.9, weight_decay=0.0005, gamma=0.1, base_net_lr=0.01, extra_layers_lr=None, base_net=None, pretrained_ssd='C:/Users/rsamv/Documents/pytorch-ssd/models/mb1-ssd/mobilenet-v1-ssd-mp-0_675.pth', resume=None, scheduler='cosine', milestones='80,100', t_max=100.0, batch_size=5, num_epochs=100, num_workers=4, validation_epochs=5, debug_steps=100, use_cuda=True, checkpoint_folder='models/')
2021-12-07 23:11:37,703 - root - INFO - Prepare training datasets.
2021-12-07 23:11:38,263 - root - INFO - Dataset Summary:Number of Images: 1344
Minimum Number of Images for a Class: -1
Label Distribution:
apple: 5376
2021-12-07 23:11:38,277 - root - INFO - Stored labels into file models/open-images-model-labels.txt.
2021-12-07 23:11:38,278 - root - INFO - Train dataset size: 1344
2021-12-07 23:11:38,279 - root - INFO - Prepare Validation datasets.
2021-12-07 23:11:38,472 - root - INFO - Dataset Summary:Number of Images: 480
Minimum Number of Images for a Class: -1
Label Distribution:
apple: 1920
2021-12-07 23:11:38,476 - root - INFO - validation dataset size: 480
2021-12-07 23:11:38,477 - root - INFO - Build network.
2021-12-07 23:11:38,537 - root - INFO - Init from pretrained ssd C:/Users/rsamv/Documents/pytorch-ssd/models/mb1-ssd/mobilenet-v1-ssd-mp-0_675.pth
2021-12-07 23:11:38,583 - root - INFO - Took 0.05 seconds to load the model.
2021-12-07 23:11:38,996 - root - INFO - Learning rate: 0.01, Base net learning rate: 0.01, Extra Layers learning rate: 0.01.
2021-12-07 23:11:38,997 - root - INFO - Uses CosineAnnealingLR scheduler.
2021-12-07 23:11:38,997 - root - INFO - Start training from epoch 0.
Traceback (most recent call last):
File "C:\Users\rsamv\Documents\pytorch-ssd\train_ssd.py", line 325, in
train(train_loader, net, criterion, optimizer,
File "C:\Users\rsamv\Documents\pytorch-ssd\train_ssd.py", line 116, in train
for i, data in enumerate(loader):
File "C:\Users\rsamv\AppData\Roaming\Python\Python39\site-packages\torch\utils\data\dataloader.py", line 359, in iter
return self._get_iterator()
File "C:\Users\rsamv\AppData\Roaming\Python\Python39\site-packages\torch\utils\data\dataloader.py", line 305, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "C:\Users\rsamv\AppData\Roaming\Python\Python39\site-packages\torch\utils\data\dataloader.py", line 918, in init
w.start()
File "C:\Users\rsamv\anaconda3\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "C:\Users\rsamv\anaconda3\lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users\rsamv\anaconda3\lib\multiprocessing\context.py", line 327, in _Popen
return Popen(process_obj)
File "C:\Users\rsamv\anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 93, in init
reduction.dump(process_obj, to_child)
File "C:\Users\rsamv\anaconda3\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'TrainAugmentation.init..'
(base) PS C:\Users\rsamv\Documents\pytorch-ssd> 2021-12-07 23:11:40,772 - root - INFO - Use Cuda.
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\rsamv\anaconda3\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "C:\Users\rsamv\anaconda3\lib\multiprocessing\spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

It seems that the loader variable has a problem. I wonder if it's caused by some incompatibility with Windows, for instance at the Path level?

Any ideas?

Thanks a lot!

@gururaj-bhat
Copy link

I am also facing similar issues , but on Ubuntu.
It is stuck at the below point , while training
https://github.com/qfgaohao/pytorch-ssd/blob/master/train_ssd.py#L116

I guess this is because of Pytorch version , I am using latets 1.10 version and probably we should strictly use 1.0.0 only.

@jyan-R
Copy link

jyan-R commented Mar 14, 2022

Hello

I am trying to run it on my Windows machine. My dataset seems to be correct. When I start the training I get an error.

Here is the call with the arguments: python train_ssd.py --dataset_type open_images --datasets C:/Users/rsamv/Documents/data/open_images_datasets/apples --net mb1-ssd --pretrained_ssd C:/Users/rsamv/Documents/pytorch-ssd/models/mb1-ssd/mobilenet-v1-ssd-mp-0_675.pth --scheduler cosine --lr 0.01 --t_max 100 --validation_epochs 5 --num_epochs 100 --base_net_lr 0.01 --batch_size 5

Here is the error I get: (base) PS C:\Users\rsamv> cd C:\Users\rsamv\Documents\pytorch-ssd (base) PS C:\Users\rsamv\Documents\pytorch-ssd> python train_ssd.py --dataset_type open_images --datasets C:/Users/rsamv/Documents/data/open_images_datasets/apples --net mb1-ssd --pretrained_ssd C:/Users/rsamv/Documents/pytorch-ssd/models/mb1-ssd/mobilenet-v1-ssd-mp-0_675.pth --scheduler cosine --lr 0.01 --t_max 100 --validation_epochs 5 --num_epochs 100 --base_net_lr 0.01 --batch_size 5 2021-12-07 23:11:37,702 - root - INFO - Use Cuda. 2021-12-07 23:11:37,703 - root - INFO - Namespace(dataset_type='open_images', datasets=['C:/Users/rsamv/Documents/data/open_images_datasets/apples'], validation_dataset=None, balance_data=False, net='mb1-ssd', freeze_base_net=False, freeze_net=False, mb2_width_mult=1.0, lr=0.01, momentum=0.9, weight_decay=0.0005, gamma=0.1, base_net_lr=0.01, extra_layers_lr=None, base_net=None, pretrained_ssd='C:/Users/rsamv/Documents/pytorch-ssd/models/mb1-ssd/mobilenet-v1-ssd-mp-0_675.pth', resume=None, scheduler='cosine', milestones='80,100', t_max=100.0, batch_size=5, num_epochs=100, num_workers=4, validation_epochs=5, debug_steps=100, use_cuda=True, checkpoint_folder='models/') 2021-12-07 23:11:37,703 - root - INFO - Prepare training datasets. 2021-12-07 23:11:38,263 - root - INFO - Dataset Summary:Number of Images: 1344 Minimum Number of Images for a Class: -1 Label Distribution: apple: 5376 2021-12-07 23:11:38,277 - root - INFO - Stored labels into file models/open-images-model-labels.txt. 2021-12-07 23:11:38,278 - root - INFO - Train dataset size: 1344 2021-12-07 23:11:38,279 - root - INFO - Prepare Validation datasets. 2021-12-07 23:11:38,472 - root - INFO - Dataset Summary:Number of Images: 480 Minimum Number of Images for a Class: -1 Label Distribution: apple: 1920 2021-12-07 23:11:38,476 - root - INFO - validation dataset size: 480 2021-12-07 23:11:38,477 - root - INFO - Build network. 2021-12-07 23:11:38,537 - root - INFO - Init from pretrained ssd C:/Users/rsamv/Documents/pytorch-ssd/models/mb1-ssd/mobilenet-v1-ssd-mp-0_675.pth 2021-12-07 23:11:38,583 - root - INFO - Took 0.05 seconds to load the model. 2021-12-07 23:11:38,996 - root - INFO - Learning rate: 0.01, Base net learning rate: 0.01, Extra Layers learning rate: 0.01. 2021-12-07 23:11:38,997 - root - INFO - Uses CosineAnnealingLR scheduler. 2021-12-07 23:11:38,997 - root - INFO - Start training from epoch 0. Traceback (most recent call last): File "C:\Users\rsamv\Documents\pytorch-ssd\train_ssd.py", line 325, in train(train_loader, net, criterion, optimizer, File "C:\Users\rsamv\Documents\pytorch-ssd\train_ssd.py", line 116, in train for i, data in enumerate(loader): File "C:\Users\rsamv\AppData\Roaming\Python\Python39\site-packages\torch\utils\data\dataloader.py", line 359, in iter return self._get_iterator() File "C:\Users\rsamv\AppData\Roaming\Python\Python39\site-packages\torch\utils\data\dataloader.py", line 305, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "C:\Users\rsamv\AppData\Roaming\Python\Python39\site-packages\torch\utils\data\dataloader.py", line 918, in init w.start() File "C:\Users\rsamv\anaconda3\lib\multiprocessing\process.py", line 121, in start self._popen = self._Popen(self) File "C:\Users\rsamv\anaconda3\lib\multiprocessing\context.py", line 224, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "C:\Users\rsamv\anaconda3\lib\multiprocessing\context.py", line 327, in _Popen return Popen(process_obj) File "C:\Users\rsamv\anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 93, in init reduction.dump(process_obj, to_child) File "C:\Users\rsamv\anaconda3\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'TrainAugmentation.init..' (base) PS C:\Users\rsamv\Documents\pytorch-ssd> 2021-12-07 23:11:40,772 - root - INFO - Use Cuda. Traceback (most recent call last): File "", line 1, in File "C:\Users\rsamv\anaconda3\lib\multiprocessing\spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "C:\Users\rsamv\anaconda3\lib\multiprocessing\spawn.py", line 126, in _main self = reduction.pickle.load(from_parent) EOFError: Ran out of input

It seems that the loader variable has a problem. I wonder if it's caused by some incompatibility with Windows, for instance at the Path level?

Any ideas?

Thanks a lot!

the same issue, have you got any ideas to solve it? thanks

@jyan-R
Copy link

jyan-R commented Mar 14, 2022

I am also facing similar issues , but on Ubuntu. It is stuck at the below point , while training https://github.com/qfgaohao/pytorch-ssd/blob/master/train_ssd.py#L116

I guess this is because of Pytorch version , I am using latets 1.10 version and probably we should strictly use 1.0.0 only.

using 1.0.0 will raise a new problem:
module 'torch.jit' has no attribute 'unused'

@Biswajit-Banerjee
Copy link

Hello

I am trying to run it on my Windows machine. My dataset seems to be correct. When I start the training I get an error.

Here is the call with the arguments: python train_ssd.py --dataset_type open_images --datasets C:/Users/rsamv/Documents/data/open_images_datasets/apples --net mb1-ssd --pretrained_ssd C:/Users/rsamv/Documents/pytorch-ssd/models/mb1-ssd/mobilenet-v1-ssd-mp-0_675.pth --scheduler cosine --lr 0.01 --t_max 100 --validation_epochs 5 --num_epochs 100 --base_net_lr 0.01 --batch_size 5

Here is the error I get: (base) PS C:\Users\rsamv> cd C:\Users\rsamv\Documents\pytorch-ssd (base) PS C:\Users\rsamv\Documents\pytorch-ssd> python train_ssd.py --dataset_type open_images --datasets C:/Users/rsamv/Documents/data/open_images_datasets/apples --net mb1-ssd --pretrained_ssd C:/Users/rsamv/Documents/pytorch-ssd/models/mb1-ssd/mobilenet-v1-ssd-mp-0_675.pth --scheduler cosine --lr 0.01 --t_max 100 --validation_epochs 5 --num_epochs 100 --base_net_lr 0.01 --batch_size 5 2021-12-07 23:11:37,702 - root - INFO - Use Cuda. 2021-12-07 23:11:37,703 - root - INFO - Namespace(dataset_type='open_images', datasets=['C:/Users/rsamv/Documents/data/open_images_datasets/apples'], validation_dataset=None, balance_data=False, net='mb1-ssd', freeze_base_net=False, freeze_net=False, mb2_width_mult=1.0, lr=0.01, momentum=0.9, weight_decay=0.0005, gamma=0.1, base_net_lr=0.01, extra_layers_lr=None, base_net=None, pretrained_ssd='C:/Users/rsamv/Documents/pytorch-ssd/models/mb1-ssd/mobilenet-v1-ssd-mp-0_675.pth', resume=None, scheduler='cosine', milestones='80,100', t_max=100.0, batch_size=5, num_epochs=100, num_workers=4, validation_epochs=5, debug_steps=100, use_cuda=True, checkpoint_folder='models/') 2021-12-07 23:11:37,703 - root - INFO - Prepare training datasets. 2021-12-07 23:11:38,263 - root - INFO - Dataset Summary:Number of Images: 1344 Minimum Number of Images for a Class: -1 Label Distribution: apple: 5376 2021-12-07 23:11:38,277 - root - INFO - Stored labels into file models/open-images-model-labels.txt. 2021-12-07 23:11:38,278 - root - INFO - Train dataset size: 1344 2021-12-07 23:11:38,279 - root - INFO - Prepare Validation datasets. 2021-12-07 23:11:38,472 - root - INFO - Dataset Summary:Number of Images: 480 Minimum Number of Images for a Class: -1 Label Distribution: apple: 1920 2021-12-07 23:11:38,476 - root - INFO - validation dataset size: 480 2021-12-07 23:11:38,477 - root - INFO - Build network. 2021-12-07 23:11:38,537 - root - INFO - Init from pretrained ssd C:/Users/rsamv/Documents/pytorch-ssd/models/mb1-ssd/mobilenet-v1-ssd-mp-0_675.pth 2021-12-07 23:11:38,583 - root - INFO - Took 0.05 seconds to load the model. 2021-12-07 23:11:38,996 - root - INFO - Learning rate: 0.01, Base net learning rate: 0.01, Extra Layers learning rate: 0.01. 2021-12-07 23:11:38,997 - root - INFO - Uses CosineAnnealingLR scheduler. 2021-12-07 23:11:38,997 - root - INFO - Start training from epoch 0. Traceback (most recent call last): File "C:\Users\rsamv\Documents\pytorch-ssd\train_ssd.py", line 325, in train(train_loader, net, criterion, optimizer, File "C:\Users\rsamv\Documents\pytorch-ssd\train_ssd.py", line 116, in train for i, data in enumerate(loader): File "C:\Users\rsamv\AppData\Roaming\Python\Python39\site-packages\torch\utils\data\dataloader.py", line 359, in iter return self._get_iterator() File "C:\Users\rsamv\AppData\Roaming\Python\Python39\site-packages\torch\utils\data\dataloader.py", line 305, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "C:\Users\rsamv\AppData\Roaming\Python\Python39\site-packages\torch\utils\data\dataloader.py", line 918, in init w.start() File "C:\Users\rsamv\anaconda3\lib\multiprocessing\process.py", line 121, in start self._popen = self._Popen(self) File "C:\Users\rsamv\anaconda3\lib\multiprocessing\context.py", line 224, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "C:\Users\rsamv\anaconda3\lib\multiprocessing\context.py", line 327, in _Popen return Popen(process_obj) File "C:\Users\rsamv\anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 93, in init reduction.dump(process_obj, to_child) File "C:\Users\rsamv\anaconda3\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'TrainAugmentation.init..' (base) PS C:\Users\rsamv\Documents\pytorch-ssd> 2021-12-07 23:11:40,772 - root - INFO - Use Cuda. Traceback (most recent call last): File "", line 1, in File "C:\Users\rsamv\anaconda3\lib\multiprocessing\spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "C:\Users\rsamv\anaconda3\lib\multiprocessing\spawn.py", line 126, in _main self = reduction.pickle.load(from_parent) EOFError: Ran out of input

It seems that the loader variable has a problem. I wonder if it's caused by some incompatibility with Windows, for instance at the Path level?

Any ideas?

Thanks a lot!

I am also getting the same issue.

From what I could gather it seems issue with the pickling of lambda function in multi-processing. So disabling the multiprocessing data loading worked for me.
You can do so by either mentioning --num_workers 0 or making the default value 0 in train_ssd.py for num_workers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants