Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in Resume training module of 4_efficientdet, getting after completing 5 epoch. #56

Open
waghts95 opened this issue Aug 21, 2020 · 30 comments
Labels
bug Something isn't working solution added Solution added to the raised issue

Comments

@waghts95
Copy link

waghts95 commented Aug 21, 2020

I am using torch 1.6.0 , efficientnet-pytorch-0.6.3, tensorboardX-2.1

This is my code

`from train_detector import Detector
gtf = Detector()
#directs the model towards file structure
root_dir = "./"
coco_dir = "cellphone"
img_dir = "./"
set_dir = "Images"
#smells like some free compute from Colab, nice
gtf.Train_Dataset(root_dir, coco_dir, img_dir, set_dir, batch_size=8, image_size=32, use_gpu=True)
gtf.Model(model_name="efficientnet-b0",load_pretrained_model_from="/content/trained/signatrix_efficientdet_coco.pth")

gtf.Set_Hyperparams(lr=0.0001, val_interval=1, es_min_delta=0.0, es_patience=0)
gtf.Train(num_epochs=50, model_output_dir="trained/");`

My error is

Epoch: 1/50. Iteration: 910/910. Cls loss: 0.12021. Reg loss: 0.26245. Batch loss: 0.38265 Total loss: 0.50293
100% 910/910 [24:24<00:00, 1.58s/it]

/content/Monk_Object_Detection/4_efficientdet/lib/src/model.py:251: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if len(inputs) == 2:
/content/Monk_Object_Detection/4_efficientdet/lib/src/utils.py:84: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
image_shape = np.array(image_shape)
/content/Monk_Object_Detection/4_efficientdet/lib/src/utils.py:96: TracerWarning: torch.from_numpy results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
anchors = torch.from_numpy(all_anchors.astype(np.float32))
/content/Monk_Object_Detection/4_efficientdet/lib/src/model.py:282: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if scores_over_thresh.sum() == 0:
Epoch: 2/50. Iteration: 910/910. Cls loss: 0.17044. Reg loss: 0.19580. Batch loss: 0.36624 Total loss: 0.48137
100% 910/910 [24:31<00:00, 1.57s/it]

Epoch: 3/50. Iteration: 910/910. Cls loss: 0.22575. Reg loss: 0.32424. Batch loss: 0.54999 Total loss: 0.46841
100% 910/910 [24:36<00:00, 1.60s/it]

Epoch: 4/50. Iteration: 910/910. Cls loss: 0.13469. Reg loss: 0.25157. Batch loss: 0.38626 Total loss: 0.45206
100% 910/910 [24:40<00:00, 1.57s/it]

Epoch: 5/50. Iteration: 910/910. Cls loss: 0.24624. Reg loss: 0.34335. Batch loss: 0.58959 Total loss: 0.44057
100% 910/910 [23:59<00:00, 1.54s/it]

Epoch: 6/50. Iteration: 910/910. Cls loss: 0.20909. Reg loss: 0.26789. Batch loss: 0.47698 Total loss: 0.42917
100% 910/910 [23:53<00:00, 1.52s/it]

/usr/local/lib/python3.6/dist-packages/torch/onnx/symbolic_helper.py:253: UserWarning: You are trying to export the model with onnx:Upsample for ONNX opset version 9. This operator might cause results to not match the expected results by PyTorch.
ONNX's Upsample/Resize operator did not match Pytorch's Interpolation until opset 11. Attributes to determine how to transform the input were added in onnx:Resize in opset 11 to support Pytorch's behavior (like coordinate_transformation_mode and nearest_mode).
We recommend using opset 11 and above for models using this operator.
"" + str(_export_onnx_opset_version) + ". "

RuntimeError Traceback (most recent call last)
in ()
1 gtf.Set_Hyperparams(lr=0.0001, val_interval=1, es_min_delta=0.0, es_patience=0)
----> 2 gtf.Train(num_epochs=50, model_output_dir="trained/");

9 frames
/usr/local/lib/python3.6/dist-packages/torch/onnx/symbolic_helper.py in _onnx_opset_unsupported(op_name, current_opset, supported_opset)
184 def _onnx_opset_unsupported(op_name, current_opset, supported_opset):
185 raise RuntimeError('Unsupported: ONNX export of {} in '
--> 186 'opset {}. Please try opset version {}.'.format(op_name, current_opset, supported_opset))
187
188

RuntimeError: Unsupported: ONNX export of index_put in opset 9. Please try opset version 11.

@abhi-kumar
Copy link
Contributor

Thank you for pointing out the issue. We will try to resolve it as soon as possible. On your end please check by downgrading pytorch to version 1.4

@abhi-kumar abhi-kumar added the bug Something isn't working label Aug 21, 2020
@waghts95
Copy link
Author

waghts95 commented Aug 21, 2020 via email

@abhi-kumar
Copy link
Contributor

Did a version downgrade help your case?

@waghts95
Copy link
Author

waghts95 commented Aug 24, 2020 via email

@abhi-kumar
Copy link
Contributor

We are unable to reproduce that error with pytorch v1.4. Please check and let us know

@waghts95
Copy link
Author

waghts95 commented Aug 24, 2020 via email

@abhi-kumar
Copy link
Contributor

the error is because onnx is still incompatible with torch 1.6; Hence reducing torch to 1.4 and torchvision 0.5 will resolve the errors. Requirement files have been updated accordingly.

@abhi-kumar abhi-kumar added the solution added Solution added to the raised issue label Aug 24, 2020
@waghts95
Copy link
Author

waghts95 commented Aug 24, 2020 via email

@waghts95
Copy link
Author

waghts95 commented Aug 25, 2020

When I use torch 1.4 and torchvision 0.5, I am getting

loading annotations into memory...
Done (t=0.13s)
creating index...
index created!

RuntimeError Traceback (most recent call last)
in ()
8 #smells like some free compute from Colab, nice
9 gtf.Train_Dataset(root_dir, coco_dir, img_dir, set_dir, batch_size=8, image_size=32, use_gpu=True)
---> 10 gtf.Model(model_name="efficientnet-b0",load_pretrained_model_from="/content/trained/signatrix_efficientdet_coco.pth")

2 frames
/usr/local/lib/python3.6/dist-packages/torch/serialization.py in init(self, name_or_buffer)
222 class _open_zipfile_reader(_opener):
223 def init(self, name_or_buffer):
--> 224 super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))
225
226

RuntimeError: version_ <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /pytorch/caffe2/serialize/inline_container.cc:132, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 2. Your PyTorch installation may be too old. (init at /pytorch/caffe2/serialize/inline_container.cc:132)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f5933aff193 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: caffe2::serialize::PyTorchStreamReader::init() + 0x1f5b (0x7f5936c879eb in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so)
frame #2: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::string const&) + 0x64 (0x7f5936c88c04 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so)
frame #3: + 0x6c53a6 (0x7f597ebb83a6 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #4: + 0x2961c4 (0x7f597e7891c4 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #5: _PyCFunction_FastCallDict + 0x35c (0x566ddc in /usr/bin/python3)
frame #6: /usr/bin/python3() [0x594b71]
frame #7: /usr/bin/python3() [0x54a325]
frame #8: /usr/bin/python3() [0x5517c1]
frame #9: _PyObject_FastCallKeywords + 0x19c (0x5a9eec in /usr/bin/python3)
frame #10: /usr/bin/python3() [0x50a783]
frame #11: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #12: /usr/bin/python3() [0x507f24]
frame #13: _PyFunction_FastCallDict + 0x2e2 (0x509202 in /usr/bin/python3)
frame #14: /usr/bin/python3() [0x594b01]
frame #15: /usr/bin/python3() [0x54a17f]
frame #16: /usr/bin/python3() [0x5517c1]
frame #17: _PyObject_FastCallKeywords + 0x19c (0x5a9eec in /usr/bin/python3)
frame #18: /usr/bin/python3() [0x50a783]
frame #19: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #20: /usr/bin/python3() [0x507f24]
frame #21: /usr/bin/python3() [0x509c50]
frame #22: /usr/bin/python3() [0x50a64d]
frame #23: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #24: /usr/bin/python3() [0x507f24]
frame #25: /usr/bin/python3() [0x509c50]
frame #26: /usr/bin/python3() [0x50a64d]
frame #27: _PyEval_EvalFrameDefault + 0x1226 (0x50cfd6 in /usr/bin/python3)
frame #28: /usr/bin/python3() [0x507f24]
frame #29: /usr/bin/python3() [0x5165a5]
frame #30: /usr/bin/python3() [0x50a47f]
frame #31: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #32: /usr/bin/python3() [0x507f24]
frame #33: /usr/bin/python3() [0x509c50]
frame #34: /usr/bin/python3() [0x50a64d]
frame #35: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #36: /usr/bin/python3() [0x507f24]
frame #37: /usr/bin/python3() [0x509c50]
frame #38: /usr/bin/python3() [0x50a64d]
frame #39: _PyEval_EvalFrameDefault + 0x1226 (0x50cfd6 in /usr/bin/python3)
frame #40: /usr/bin/python3() [0x507f24]
frame #41: _PyFunction_FastCallDict + 0x2e2 (0x509202 in /usr/bin/python3)
frame #42: /usr/bin/python3() [0x594b01]
frame #43: PyObject_Call + 0x3e (0x59fe1e in /usr/bin/python3)
frame #44: _PyEval_EvalFrameDefault + 0x17e6 (0x50d596 in /usr/bin/python3)
frame #45: /usr/bin/python3() [0x507f24]
frame #46: /usr/bin/python3() [0x509c50]
frame #47: /usr/bin/python3() [0x50a64d]
frame #48: _PyEval_EvalFrameDefault + 0x1226 (0x50cfd6 in /usr/bin/python3)
frame #49: /usr/bin/python3() [0x507f24]
frame #50: /usr/bin/python3() [0x509c50]
frame #51: /usr/bin/python3() [0x50a64d]
frame #52: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #53: /usr/bin/python3() [0x509918]
frame #54: /usr/bin/python3() [0x50a64d]
frame #55: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #56: /usr/bin/python3() [0x509918]
frame #57: /usr/bin/python3() [0x50a64d]
frame #58: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #59: /usr/bin/python3() [0x507f24]
frame #60: /usr/bin/python3() [0x588e91]
frame #61: PyObject_Call + 0x3e (0x59fe1e in /usr/bin/python3)
frame #62: _PyEval_EvalFrameDefault + 0x17e6 (0x50d596 in /usr/bin/python3)
frame #63: /usr/bin/python3() [0x507f24]

@waghts95
Copy link
Author

Earlier I was able to reach till epoch 5 or sometimes 13. But now training starts but after a minute I get this ( Not using torch == 1.4 and torchvision == 0.5 as with this training does not start and directly gives above error)

100%
910/910 [01:55<00:00, 7.89it/s]
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3

/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py:3335: RuntimeWarning: Mean of empty slice.
out=out, **kwargs)
/usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py:161: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
/content/Monk_Object_Detection/4_efficientdet/lib/src/model.py:297: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if len(inputs) == 2:
/content/Monk_Object_Detection/4_efficientdet/lib/src/utils.py:84: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
image_shape = np.array(image_shape)
/content/Monk_Object_Detection/4_efficientdet/lib/src/utils.py:96: TracerWarning: torch.from_numpy results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
anchors = torch.from_numpy(all_anchors.astype(np.float32))
/content/Monk_Object_Detection/4_efficientdet/lib/src/model.py:328: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if scores_over_thresh.sum() == 0:
/usr/local/lib/python3.6/dist-packages/torch/onnx/symbolic_helper.py:253: UserWarning: You are trying to export the model with onnx:Upsample for ONNX opset version 9. This operator might cause results to not match the expected results by PyTorch.
ONNX's Upsample/Resize operator did not match Pytorch's Interpolation until opset 11. Attributes to determine how to transform the input were added in onnx:Resize in opset 11 to support Pytorch's behavior (like coordinate_transformation_mode and nearest_mode).
We recommend using opset 11 and above for models using this operator.
"" + str(_export_onnx_opset_version) + ". "

RuntimeError Traceback (most recent call last)
in ()
1 gtf.Set_Hyperparams(lr=0.0001, val_interval=1, es_min_delta=0.0, es_patience=0)
----> 2 gtf.Train(num_epochs=50, model_output_dir="trained1/")

9 frames
/usr/local/lib/python3.6/dist-packages/torch/onnx/symbolic_helper.py in _onnx_opset_unsupported(op_name, current_opset, supported_opset)
184 def _onnx_opset_unsupported(op_name, current_opset, supported_opset):
185 raise RuntimeError('Unsupported: ONNX export of {} in '
--> 186 'opset {}. Please try opset version {}.'.format(op_name, current_opset, supported_opset))
187
188

RuntimeError: Unsupported: ONNX export of index_put in opset 9. Please try opset version 11.

@abhi-kumar
Copy link
Contributor

When I use torch 1.4 and torchvision 0.5, I am getting

loading annotations into memory...

Done (t=0.13s)
creating index...
index created!

RuntimeError Traceback (most recent call last)
in ()
8 #smells like some free compute from Colab, nice
9 gtf.Train_Dataset(root_dir, coco_dir, img_dir, set_dir, batch_size=8, image_size=32, use_gpu=True)
---> 10 gtf.Model(model_name="efficientnet-b0",load_pretrained_model_from="/content/trained/signatrix_efficientdet_coco.pth")

2 frames
/usr/local/lib/python3.6/dist-packages/torch/serialization.py in init(self, name_or_buffer)
222 class _open_zipfile_reader(_opener):
223 def init(self, name_or_buffer):
--> 224 super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))
225
226

RuntimeError: version_ <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /pytorch/caffe2/serialize/inline_container.cc:132, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 2. Your PyTorch installation may be too old. (init at /pytorch/caffe2/serialize/inline_container.cc:132)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f5933aff193 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: caffe2::serialize::PyTorchStreamReader::init() + 0x1f5b (0x7f5936c879eb in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so)
frame #2: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::string const&) + 0x64 (0x7f5936c88c04 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so)
frame #3: + 0x6c53a6 (0x7f597ebb83a6 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #4: + 0x2961c4 (0x7f597e7891c4 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #5: _PyCFunction_FastCallDict + 0x35c (0x566ddc in /usr/bin/python3)
frame #6: /usr/bin/python3() [0x594b71]
frame #7: /usr/bin/python3() [0x54a325]
frame #8: /usr/bin/python3() [0x5517c1]
frame #9: _PyObject_FastCallKeywords + 0x19c (0x5a9eec in /usr/bin/python3)
frame #10: /usr/bin/python3() [0x50a783]
frame #11: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #12: /usr/bin/python3() [0x507f24]
frame #13: _PyFunction_FastCallDict + 0x2e2 (0x509202 in /usr/bin/python3)
frame #14: /usr/bin/python3() [0x594b01]
frame #15: /usr/bin/python3() [0x54a17f]
frame #16: /usr/bin/python3() [0x5517c1]
frame #17: _PyObject_FastCallKeywords + 0x19c (0x5a9eec in /usr/bin/python3)
frame #18: /usr/bin/python3() [0x50a783]
frame #19: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #20: /usr/bin/python3() [0x507f24]
frame #21: /usr/bin/python3() [0x509c50]
frame #22: /usr/bin/python3() [0x50a64d]
frame #23: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #24: /usr/bin/python3() [0x507f24]
frame #25: /usr/bin/python3() [0x509c50]
frame #26: /usr/bin/python3() [0x50a64d]
frame #27: _PyEval_EvalFrameDefault + 0x1226 (0x50cfd6 in /usr/bin/python3)
frame #28: /usr/bin/python3() [0x507f24]
frame #29: /usr/bin/python3() [0x5165a5]
frame #30: /usr/bin/python3() [0x50a47f]
frame #31: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #32: /usr/bin/python3() [0x507f24]
frame #33: /usr/bin/python3() [0x509c50]
frame #34: /usr/bin/python3() [0x50a64d]
frame #35: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #36: /usr/bin/python3() [0x507f24]
frame #37: /usr/bin/python3() [0x509c50]
frame #38: /usr/bin/python3() [0x50a64d]
frame #39: _PyEval_EvalFrameDefault + 0x1226 (0x50cfd6 in /usr/bin/python3)
frame #40: /usr/bin/python3() [0x507f24]
frame #41: _PyFunction_FastCallDict + 0x2e2 (0x509202 in /usr/bin/python3)
frame #42: /usr/bin/python3() [0x594b01]
frame #43: PyObject_Call + 0x3e (0x59fe1e in /usr/bin/python3)
frame #44: _PyEval_EvalFrameDefault + 0x17e6 (0x50d596 in /usr/bin/python3)
frame #45: /usr/bin/python3() [0x507f24]
frame #46: /usr/bin/python3() [0x509c50]
frame #47: /usr/bin/python3() [0x50a64d]
frame #48: _PyEval_EvalFrameDefault + 0x1226 (0x50cfd6 in /usr/bin/python3)
frame #49: /usr/bin/python3() [0x507f24]
frame #50: /usr/bin/python3() [0x509c50]
frame #51: /usr/bin/python3() [0x50a64d]
frame #52: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #53: /usr/bin/python3() [0x509918]
frame #54: /usr/bin/python3() [0x50a64d]
frame #55: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #56: /usr/bin/python3() [0x509918]
frame #57: /usr/bin/python3() [0x50a64d]
frame #58: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #59: /usr/bin/python3() [0x507f24]
frame #60: /usr/bin/python3() [0x588e91]
frame #61: PyObject_Call + 0x3e (0x59fe1e in /usr/bin/python3)
frame #62: _PyEval_EvalFrameDefault + 0x17e6 (0x50d596 in /usr/bin/python3)
frame #63: /usr/bin/python3() [0x507f24]

Don't mixup versions when resuming training. Keep every training restricted to pytorch version 1.4 and torchvision version 0.5 starting from the very first training itself. Serializing a model trained in version 1.5 or 1.6 may not be possible in version 1.4.

@waghts95
Copy link
Author

Earlier I was able to reach till epoch 5 or sometimes 13. But now training starts but after a minute I get this ( Not using torch == 1.4 and torchvision == 0.5 as with this training does not start and directly gives above error)

100%
910/910 [01:55<00:00, 7.89it/s]
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3

/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py:3335: RuntimeWarning: Mean of empty slice.

out=out, **kwargs)
/usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py:161: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
/content/Monk_Object_Detection/4_efficientdet/lib/src/model.py:297: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if len(inputs) == 2:
/content/Monk_Object_Detection/4_efficientdet/lib/src/utils.py:84: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
image_shape = np.array(image_shape)
/content/Monk_Object_Detection/4_efficientdet/lib/src/utils.py:96: TracerWarning: torch.from_numpy results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
anchors = torch.from_numpy(all_anchors.astype(np.float32))
/content/Monk_Object_Detection/4_efficientdet/lib/src/model.py:328: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if scores_over_thresh.sum() == 0:
/usr/local/lib/python3.6/dist-packages/torch/onnx/symbolic_helper.py:253: UserWarning: You are trying to export the model with onnx:Upsample for ONNX opset version 9. This operator might cause results to not match the expected results by PyTorch.
ONNX's Upsample/Resize operator did not match Pytorch's Interpolation until opset 11. Attributes to determine how to transform the input were added in onnx:Resize in opset 11 to support Pytorch's behavior (like coordinate_transformation_mode and nearest_mode).
We recommend using opset 11 and above for models using this operator.
"" + str(_export_onnx_opset_version) + ". "
RuntimeError Traceback (most recent call last)
in ()
1 gtf.Set_Hyperparams(lr=0.0001, val_interval=1, es_min_delta=0.0, es_patience=0)
----> 2 gtf.Train(num_epochs=50, model_output_dir="trained1/")

9 frames
/usr/local/lib/python3.6/dist-packages/torch/onnx/symbolic_helper.py in _onnx_opset_unsupported(op_name, current_opset, supported_opset)
184 def _onnx_opset_unsupported(op_name, current_opset, supported_opset):
185 raise RuntimeError('Unsupported: ONNX export of {} in '
--> 186 'opset {}. Please try opset version {}.'.format(op_name, current_opset, supported_opset))
187
188

RuntimeError: Unsupported: ONNX export of index_put in opset 9. Please try opset version 11.

Please let me know how can I deal with this error ?

@abhi-kumar
Copy link
Contributor

WAY 1:

a) Switch to torch==1.4, torchvision==0.5 and efficientnet_pytorch==0.6.3
b) Train your first detector
c) Then resume or reload training from this checkpoint.

WAY 2:

When you clone the library comment out the line number 393-396 and 400-403 in the file Monk_Object_Detection/4_efficientdet/lib/train_detector.py

These lines

 torch.onnx.export(self.system_dict["local"]["model"].module, dummy_input,
                                              os.path.join(self.system_dict["output"]["saved_path"], "signatrix_efficientdet_coco.onnx"),
                                              verbose=False)

and

torch.onnx.export(self.system_dict["local"]["model"], dummy_input,
                                              os.path.join(self.system_dict["output"]["saved_path"], "signatrix_efficientdet_coco.onnx"),
                                              verbose=False)

@waghts95
Copy link
Author

waghts95 commented Aug 27, 2020

WAY 2, did not work.
For WAY 1,
a) Switch to torch==1.4, torchvision==0.5 and efficientnet_pytorch==0.6.3 ====> Done
b) Train your first detector =====> For this,
training is executing but continuously getting this,
'The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3'
and not showing training status like epoch details, loss details, etc.

@abhi-kumar
Copy link
Contributor

Please share your code.

@waghts95
Copy link
Author

waghts95 commented Aug 27, 2020

Shared.

@abhi-kumar
Copy link
Contributor

abhi-kumar commented Aug 27, 2020

The image size is 32? For EfficientNet - b0 image size should be 512. See this example - https://github.com/Tessellate-Imaging/Monk_Object_Detection/blob/master/example_notebooks/4_efficientdet/train%20-%20with%20validation%20dataset.ipynb

@waghts95
Copy link
Author

waghts95 commented Aug 27, 2020 via email

@abhi-kumar
Copy link
Contributor

If the image shapes were inconsistent it auto switched to default shapes. Since latest efficientnet_pytorch upgrade requires a manual input of shapes we have made the argument as a required entity and cannot take in inconsistencies.

@abhi-kumar
Copy link
Contributor

Keep image shape as 512 with B0 version and the training engine will scale annotations accordingly.

@waghts95
Copy link
Author

''The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3'
and not showing training status like epoch details, loss details, etc.'

This error is gone.
Thank you very much.

@waghts95
Copy link
Author

WAY 1:

a) Switch to torch==1.4, torchvision==0.5 and efficientnet_pytorch==0.6.3
b) Train your first detector
c) Then resume or reload training from this checkpoint.

WAY 2:

When you clone the library comment out the line number 393-396 and 400-403 in the file Monk_Object_Detection/4_efficientdet/lib/train_detector.py

These lines

 torch.onnx.export(self.system_dict["local"]["model"].module, dummy_input,
                                              os.path.join(self.system_dict["output"]["saved_path"], "signatrix_efficientdet_coco.onnx"),
                                              verbose=False)

and

torch.onnx.export(self.system_dict["local"]["model"], dummy_input,
                                              os.path.join(self.system_dict["output"]["saved_path"], "signatrix_efficientdet_coco.onnx"),
                                              verbose=False)

I used way 1 and could successfully train module and also resume training worked fine. Today when I again tried resume training, I got error which attached in text file.
resume_training_error.txt

@abhi-kumar
Copy link
Contributor

WAY 1:

a) Switch to torch==1.4, torchvision==0.5 and efficientnet_pytorch==0.6.3
b) Train your first detector
c) Then resume or reload training from this checkpoint.

WAY 2:

When you clone the library comment out the line number 393-396 and 400-403 in the file Monk_Object_Detection/4_efficientdet/lib/train_detector.py

These lines

 torch.onnx.export(self.system_dict["local"]["model"].module, dummy_input,
                                              os.path.join(self.system_dict["output"]["saved_path"], "signatrix_efficientdet_coco.onnx"),
                                              verbose=False)

and

torch.onnx.export(self.system_dict["local"]["model"], dummy_input,
                                              os.path.join(self.system_dict["output"]["saved_path"], "signatrix_efficientdet_coco.onnx"),
                                              verbose=False)

Since you are using colab make sure the versioning done is correct.

And comment out the two lines mentioned in Way 2.

@waghts95
Copy link
Author

versioning is as per your colab_requirement.txt, also commenting did not help.

@alsheabi
Copy link

alsheabi commented Dec 8, 2020

try to add these in way 2
opset_version=11 looks like this after added
torch.onnx.export(self.system_dict["local"]["model"].module, dummy_input, os.path.join(self.system_dict["output"]["saved_path"], "signatrix_efficientdet_coco.onnx"), verbose=False, opset_version=11)

@alsheabi
Copy link

alsheabi commented Dec 9, 2020

Keep image shape as 512 with B0 version and the training engine will scale annotations accordingly.

Hello @abhi-kumar I used 786 for B2 but I got the same error. Any suggestion.
The size of tensor a (49) must match the size of tensor b (48) at non-singleton dimension 3.

@aritzLizoain
Copy link

Keep image shape as 512 with B0 version and the training engine will scale annotations accordingly.

Hello @abhi-kumar I used 786 for B2 but I got the same error. Any suggestion.
The size of tensor a (49) must match the size of tensor b (48) at non-singleton dimension 3.

I obtain the same error. It only disappears when I use image_size = 512, regardless of the chosen model version. E.g. image_size = 786 and model version B2 fails, while image_size = 512 and model version B2 works.

I tried modifying dummy_input from torch.rand(1, 3, 512, 512) to torch.rand(1, 3, image_size, image_size) in lines 387 and 452 of train_detector.py, but nothing changed.

@abhi-kumar
Copy link
Contributor

Thank you for mentioning the issue.

The issue will be taken into consideration very soon (most probably post Christmas).

@srihari12345
Copy link

@abhi-kumar
i have finished 200 epochs with using '7_yolov3'. in that using train_detector.py.
now i need to train for 200 more with weights saved how can i resume with this.

@alsheabi
Copy link

alsheabi commented Feb 6, 2021

@abhi-kumar Any update for the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working solution added Solution added to the raised issue
Projects
None yet
Development

No branches or pull requests

5 participants