Error in Resume training module of 4_efficientdet, getting after completing 5 epoch. #56

waghts95 · 2020-08-21T06:39:58Z

I am using torch 1.6.0 , efficientnet-pytorch-0.6.3, tensorboardX-2.1

This is my code

`from train_detector import Detector
gtf = Detector()
#directs the model towards file structure
root_dir = "./"
coco_dir = "cellphone"
img_dir = "./"
set_dir = "Images"
#smells like some free compute from Colab, nice
gtf.Train_Dataset(root_dir, coco_dir, img_dir, set_dir, batch_size=8, image_size=32, use_gpu=True)
gtf.Model(model_name="efficientnet-b0",load_pretrained_model_from="/content/trained/signatrix_efficientdet_coco.pth")

gtf.Set_Hyperparams(lr=0.0001, val_interval=1, es_min_delta=0.0, es_patience=0)
gtf.Train(num_epochs=50, model_output_dir="trained/");`

My error is

Epoch: 1/50. Iteration: 910/910. Cls loss: 0.12021. Reg loss: 0.26245. Batch loss: 0.38265 Total loss: 0.50293
100% 910/910 [24:24<00:00, 1.58s/it]

/content/Monk_Object_Detection/4_efficientdet/lib/src/model.py:251: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if len(inputs) == 2:
/content/Monk_Object_Detection/4_efficientdet/lib/src/utils.py:84: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
image_shape = np.array(image_shape)
/content/Monk_Object_Detection/4_efficientdet/lib/src/utils.py:96: TracerWarning: torch.from_numpy results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
anchors = torch.from_numpy(all_anchors.astype(np.float32))
/content/Monk_Object_Detection/4_efficientdet/lib/src/model.py:282: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if scores_over_thresh.sum() == 0:
Epoch: 2/50. Iteration: 910/910. Cls loss: 0.17044. Reg loss: 0.19580. Batch loss: 0.36624 Total loss: 0.48137
100% 910/910 [24:31<00:00, 1.57s/it]

Epoch: 3/50. Iteration: 910/910. Cls loss: 0.22575. Reg loss: 0.32424. Batch loss: 0.54999 Total loss: 0.46841
100% 910/910 [24:36<00:00, 1.60s/it]

Epoch: 4/50. Iteration: 910/910. Cls loss: 0.13469. Reg loss: 0.25157. Batch loss: 0.38626 Total loss: 0.45206
100% 910/910 [24:40<00:00, 1.57s/it]

Epoch: 5/50. Iteration: 910/910. Cls loss: 0.24624. Reg loss: 0.34335. Batch loss: 0.58959 Total loss: 0.44057
100% 910/910 [23:59<00:00, 1.54s/it]

Epoch: 6/50. Iteration: 910/910. Cls loss: 0.20909. Reg loss: 0.26789. Batch loss: 0.47698 Total loss: 0.42917
100% 910/910 [23:53<00:00, 1.52s/it]

/usr/local/lib/python3.6/dist-packages/torch/onnx/symbolic_helper.py:253: UserWarning: You are trying to export the model with onnx:Upsample for ONNX opset version 9. This operator might cause results to not match the expected results by PyTorch.
ONNX's Upsample/Resize operator did not match Pytorch's Interpolation until opset 11. Attributes to determine how to transform the input were added in onnx:Resize in opset 11 to support Pytorch's behavior (like coordinate_transformation_mode and nearest_mode).
We recommend using opset 11 and above for models using this operator.
"" + str(_export_onnx_opset_version) + ". "

RuntimeError Traceback (most recent call last)
in ()
1 gtf.Set_Hyperparams(lr=0.0001, val_interval=1, es_min_delta=0.0, es_patience=0)
----> 2 gtf.Train(num_epochs=50, model_output_dir="trained/");

9 frames
/usr/local/lib/python3.6/dist-packages/torch/onnx/symbolic_helper.py in _onnx_opset_unsupported(op_name, current_opset, supported_opset)
184 def _onnx_opset_unsupported(op_name, current_opset, supported_opset):
185 raise RuntimeError('Unsupported: ONNX export of {} in '
--> 186 'opset {}. Please try opset version {}.'.format(op_name, current_opset, supported_opset))
187
188

RuntimeError: Unsupported: ONNX export of index_put in opset 9. Please try opset version 11.

abhi-kumar · 2020-08-21T16:35:48Z

Thank you for pointing out the issue. We will try to resolve it as soon as possible. On your end please check by downgrading pytorch to version 1.4

waghts95 · 2020-08-21T16:38:47Z

Okay Best regards, Tushar Wagh +91 9890132816

…

On Fri, Aug 21, 2020, 22:06 Abhishek Kumar Annamraju < ***@***.***> wrote: Thank you for pointing out the issue. We will try to resolve it as soon as possible. On your end please check by downgrading pytorch to version 1.4 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#56 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE3ISNMZT56G5LASS2CF4WTSB2O7FANCNFSM4QG6UJQQ> .

abhi-kumar · 2020-08-24T10:39:37Z

Did a version downgrade help your case?

waghts95 · 2020-08-24T11:01:34Z

Not tried yet. Best regards, Tushar Wagh +91 9890132816

…

On Mon, Aug 24, 2020, 16:09 Abhishek Kumar Annamraju < ***@***.***> wrote: Did a version downgrade help your case? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#56 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE3ISNNHW6AWYQSBUBNJPHDSCI7PPANCNFSM4QG6UJQQ> .

abhi-kumar · 2020-08-24T11:04:30Z

We are unable to reproduce that error with pytorch v1.4. Please check and let us know

waghts95 · 2020-08-24T11:06:29Z

Okay. Some time I get error at epoch 5 and sometime at epoch 12. Best regards, Tushar Wagh +91 9890132816

…

On Mon, Aug 24, 2020, 16:34 Abhishek Kumar Annamraju < ***@***.***> wrote: We are unable to reproduce that error with pytorch v1.4. Please check and let us know — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#56 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE3ISNKTX6Y55SGBSG6OTLDSCJCMZANCNFSM4QG6UJQQ> .

abhi-kumar · 2020-08-24T16:00:02Z

the error is because onnx is still incompatible with torch 1.6; Hence reducing torch to 1.4 and torchvision 0.5 will resolve the errors. Requirement files have been updated accordingly.

waghts95 · 2020-08-24T16:01:46Z

Thanks. Best regards, Tushar Wagh +91 9890132816

…

On Mon, Aug 24, 2020, 21:30 Abhishek Kumar Annamraju < ***@***.***> wrote: the error is because onnx is still incompatible with torch 1.6; Hence reducing torch to 1.4 and torchvision 0.5 will resolve the errors. Requirement files have been updated accordingly. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#56 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE3ISNNZCQOCSNR4VAUGCHTSCKFBHANCNFSM4QG6UJQQ> .

waghts95 · 2020-08-25T07:28:57Z

When I use torch 1.4 and torchvision 0.5, I am getting

loading annotations into memory...
Done (t=0.13s)
creating index...
index created!

RuntimeError Traceback (most recent call last)
in ()
8 #smells like some free compute from Colab, nice
9 gtf.Train_Dataset(root_dir, coco_dir, img_dir, set_dir, batch_size=8, image_size=32, use_gpu=True)
---> 10 gtf.Model(model_name="efficientnet-b0",load_pretrained_model_from="/content/trained/signatrix_efficientdet_coco.pth")

2 frames
/usr/local/lib/python3.6/dist-packages/torch/serialization.py in init(self, name_or_buffer)
222 class _open_zipfile_reader(_opener):
223 def init(self, name_or_buffer):
--> 224 super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))
225
226

RuntimeError: version_ <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /pytorch/caffe2/serialize/inline_container.cc:132, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 2. Your PyTorch installation may be too old. (init at /pytorch/caffe2/serialize/inline_container.cc:132)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f5933aff193 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: caffe2::serialize::PyTorchStreamReader::init() + 0x1f5b (0x7f5936c879eb in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so)
frame #2: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::string const&) + 0x64 (0x7f5936c88c04 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so)
frame #3: + 0x6c53a6 (0x7f597ebb83a6 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #4: + 0x2961c4 (0x7f597e7891c4 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #5: _PyCFunction_FastCallDict + 0x35c (0x566ddc in /usr/bin/python3)
frame #6: /usr/bin/python3() [0x594b71]
frame #7: /usr/bin/python3() [0x54a325]
frame #8: /usr/bin/python3() [0x5517c1]
frame #9: _PyObject_FastCallKeywords + 0x19c (0x5a9eec in /usr/bin/python3)
frame #10: /usr/bin/python3() [0x50a783]
frame #11: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #12: /usr/bin/python3() [0x507f24]
frame #13: _PyFunction_FastCallDict + 0x2e2 (0x509202 in /usr/bin/python3)
frame #14: /usr/bin/python3() [0x594b01]
frame #15: /usr/bin/python3() [0x54a17f]
frame #16: /usr/bin/python3() [0x5517c1]
frame #17: _PyObject_FastCallKeywords + 0x19c (0x5a9eec in /usr/bin/python3)
frame #18: /usr/bin/python3() [0x50a783]
frame #19: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #20: /usr/bin/python3() [0x507f24]
frame #21: /usr/bin/python3() [0x509c50]
frame #22: /usr/bin/python3() [0x50a64d]
frame #23: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #24: /usr/bin/python3() [0x507f24]
frame #25: /usr/bin/python3() [0x509c50]
frame #26: /usr/bin/python3() [0x50a64d]
frame #27: _PyEval_EvalFrameDefault + 0x1226 (0x50cfd6 in /usr/bin/python3)
frame #28: /usr/bin/python3() [0x507f24]
frame #29: /usr/bin/python3() [0x5165a5]
frame #30: /usr/bin/python3() [0x50a47f]
frame #31: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #32: /usr/bin/python3() [0x507f24]
frame #33: /usr/bin/python3() [0x509c50]
frame #34: /usr/bin/python3() [0x50a64d]
frame #35: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #36: /usr/bin/python3() [0x507f24]
frame #37: /usr/bin/python3() [0x509c50]
frame #38: /usr/bin/python3() [0x50a64d]
frame #39: _PyEval_EvalFrameDefault + 0x1226 (0x50cfd6 in /usr/bin/python3)
frame #40: /usr/bin/python3() [0x507f24]
frame #41: _PyFunction_FastCallDict + 0x2e2 (0x509202 in /usr/bin/python3)
frame #42: /usr/bin/python3() [0x594b01]
frame #43: PyObject_Call + 0x3e (0x59fe1e in /usr/bin/python3)
frame #44: _PyEval_EvalFrameDefault + 0x17e6 (0x50d596 in /usr/bin/python3)
frame #45: /usr/bin/python3() [0x507f24]
frame #46: /usr/bin/python3() [0x509c50]
frame #47: /usr/bin/python3() [0x50a64d]
frame #48: _PyEval_EvalFrameDefault + 0x1226 (0x50cfd6 in /usr/bin/python3)
frame #49: /usr/bin/python3() [0x507f24]
frame #50: /usr/bin/python3() [0x509c50]
frame #51: /usr/bin/python3() [0x50a64d]
frame #52: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #53: /usr/bin/python3() [0x509918]
frame #54: /usr/bin/python3() [0x50a64d]
frame #55: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #56: /usr/bin/python3() [0x509918]
frame #57: /usr/bin/python3() [0x50a64d]
frame #58: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #59: /usr/bin/python3() [0x507f24]
frame #60: /usr/bin/python3() [0x588e91]
frame #61: PyObject_Call + 0x3e (0x59fe1e in /usr/bin/python3)
frame #62: _PyEval_EvalFrameDefault + 0x17e6 (0x50d596 in /usr/bin/python3)
frame #63: /usr/bin/python3() [0x507f24]

waghts95 · 2020-08-25T09:46:00Z

Earlier I was able to reach till epoch 5 or sometimes 13. But now training starts but after a minute I get this ( Not using torch == 1.4 and torchvision == 0.5 as with this training does not start and directly gives above error)

100%
910/910 [01:55<00:00, 7.89it/s]
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3

/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py:3335: RuntimeWarning: Mean of empty slice.
out=out, **kwargs)
/usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py:161: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
/content/Monk_Object_Detection/4_efficientdet/lib/src/model.py:297: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if len(inputs) == 2:
/content/Monk_Object_Detection/4_efficientdet/lib/src/utils.py:84: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
image_shape = np.array(image_shape)
/content/Monk_Object_Detection/4_efficientdet/lib/src/utils.py:96: TracerWarning: torch.from_numpy results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
anchors = torch.from_numpy(all_anchors.astype(np.float32))
/content/Monk_Object_Detection/4_efficientdet/lib/src/model.py:328: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if scores_over_thresh.sum() == 0:
/usr/local/lib/python3.6/dist-packages/torch/onnx/symbolic_helper.py:253: UserWarning: You are trying to export the model with onnx:Upsample for ONNX opset version 9. This operator might cause results to not match the expected results by PyTorch.
ONNX's Upsample/Resize operator did not match Pytorch's Interpolation until opset 11. Attributes to determine how to transform the input were added in onnx:Resize in opset 11 to support Pytorch's behavior (like coordinate_transformation_mode and nearest_mode).
We recommend using opset 11 and above for models using this operator.
"" + str(_export_onnx_opset_version) + ". "

RuntimeError Traceback (most recent call last)
in ()
1 gtf.Set_Hyperparams(lr=0.0001, val_interval=1, es_min_delta=0.0, es_patience=0)
----> 2 gtf.Train(num_epochs=50, model_output_dir="trained1/")

9 frames
/usr/local/lib/python3.6/dist-packages/torch/onnx/symbolic_helper.py in _onnx_opset_unsupported(op_name, current_opset, supported_opset)
184 def _onnx_opset_unsupported(op_name, current_opset, supported_opset):
185 raise RuntimeError('Unsupported: ONNX export of {} in '
--> 186 'opset {}. Please try opset version {}.'.format(op_name, current_opset, supported_opset))
187
188

RuntimeError: Unsupported: ONNX export of index_put in opset 9. Please try opset version 11.

abhi-kumar · 2020-08-25T11:35:13Z

When I use torch 1.4 and torchvision 0.5, I am getting

loading annotations into memory...

Done (t=0.13s)
creating index...
index created!

RuntimeError Traceback (most recent call last)
in ()
8 #smells like some free compute from Colab, nice
9 gtf.Train_Dataset(root_dir, coco_dir, img_dir, set_dir, batch_size=8, image_size=32, use_gpu=True)
---> 10 gtf.Model(model_name="efficientnet-b0",load_pretrained_model_from="/content/trained/signatrix_efficientdet_coco.pth")

2 frames
/usr/local/lib/python3.6/dist-packages/torch/serialization.py in init(self, name_or_buffer)
222 class _open_zipfile_reader(_opener):
223 def init(self, name_or_buffer):
--> 224 super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))
225
226

RuntimeError: version_ <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /pytorch/caffe2/serialize/inline_container.cc:132, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 2. Your PyTorch installation may be too old. (init at /pytorch/caffe2/serialize/inline_container.cc:132)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f5933aff193 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: caffe2::serialize::PyTorchStreamReader::init() + 0x1f5b (0x7f5936c879eb in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so)
frame #2: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::string const&) + 0x64 (0x7f5936c88c04 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so)
frame #3: + 0x6c53a6 (0x7f597ebb83a6 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #4: + 0x2961c4 (0x7f597e7891c4 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #5: _PyCFunction_FastCallDict + 0x35c (0x566ddc in /usr/bin/python3)
frame #6: /usr/bin/python3() [0x594b71]
frame #7: /usr/bin/python3() [0x54a325]
frame #8: /usr/bin/python3() [0x5517c1]
frame #9: _PyObject_FastCallKeywords + 0x19c (0x5a9eec in /usr/bin/python3)
frame #10: /usr/bin/python3() [0x50a783]
frame #11: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #12: /usr/bin/python3() [0x507f24]
frame #13: _PyFunction_FastCallDict + 0x2e2 (0x509202 in /usr/bin/python3)
frame #14: /usr/bin/python3() [0x594b01]
frame #15: /usr/bin/python3() [0x54a17f]
frame #16: /usr/bin/python3() [0x5517c1]
frame #17: _PyObject_FastCallKeywords + 0x19c (0x5a9eec in /usr/bin/python3)
frame #18: /usr/bin/python3() [0x50a783]
frame #19: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #20: /usr/bin/python3() [0x507f24]
frame #21: /usr/bin/python3() [0x509c50]
frame #22: /usr/bin/python3() [0x50a64d]
frame #23: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #24: /usr/bin/python3() [0x507f24]
frame #25: /usr/bin/python3() [0x509c50]
frame #26: /usr/bin/python3() [0x50a64d]
frame #27: _PyEval_EvalFrameDefault + 0x1226 (0x50cfd6 in /usr/bin/python3)
frame #28: /usr/bin/python3() [0x507f24]
frame #29: /usr/bin/python3() [0x5165a5]
frame #30: /usr/bin/python3() [0x50a47f]
frame #31: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #32: /usr/bin/python3() [0x507f24]
frame #33: /usr/bin/python3() [0x509c50]
frame #34: /usr/bin/python3() [0x50a64d]
frame #35: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #36: /usr/bin/python3() [0x507f24]
frame #37: /usr/bin/python3() [0x509c50]
frame #38: /usr/bin/python3() [0x50a64d]
frame #39: _PyEval_EvalFrameDefault + 0x1226 (0x50cfd6 in /usr/bin/python3)
frame #40: /usr/bin/python3() [0x507f24]
frame #41: _PyFunction_FastCallDict + 0x2e2 (0x509202 in /usr/bin/python3)
frame #42: /usr/bin/python3() [0x594b01]
frame #43: PyObject_Call + 0x3e (0x59fe1e in /usr/bin/python3)
frame #44: _PyEval_EvalFrameDefault + 0x17e6 (0x50d596 in /usr/bin/python3)
frame #45: /usr/bin/python3() [0x507f24]
frame #46: /usr/bin/python3() [0x509c50]
frame #47: /usr/bin/python3() [0x50a64d]
frame #48: _PyEval_EvalFrameDefault + 0x1226 (0x50cfd6 in /usr/bin/python3)
frame #49: /usr/bin/python3() [0x507f24]
frame #50: /usr/bin/python3() [0x509c50]
frame #51: /usr/bin/python3() [0x50a64d]
frame #52: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #53: /usr/bin/python3() [0x509918]
frame #54: /usr/bin/python3() [0x50a64d]
frame #55: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #56: /usr/bin/python3() [0x509918]
frame #57: /usr/bin/python3() [0x50a64d]
frame #58: _PyEval_EvalFrameDefault + 0x444 (0x50c1f4 in /usr/bin/python3)
frame #59: /usr/bin/python3() [0x507f24]
frame #60: /usr/bin/python3() [0x588e91]
frame #61: PyObject_Call + 0x3e (0x59fe1e in /usr/bin/python3)
frame #62: _PyEval_EvalFrameDefault + 0x17e6 (0x50d596 in /usr/bin/python3)
frame #63: /usr/bin/python3() [0x507f24]

Don't mixup versions when resuming training. Keep every training restricted to pytorch version 1.4 and torchvision version 0.5 starting from the very first training itself. Serializing a model trained in version 1.5 or 1.6 may not be possible in version 1.4.

waghts95 · 2020-08-26T10:04:38Z

Earlier I was able to reach till epoch 5 or sometimes 13. But now training starts but after a minute I get this ( Not using torch == 1.4 and torchvision == 0.5 as with this training does not start and directly gives above error)

100%
910/910 [01:55<00:00, 7.89it/s]
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3
The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3

/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py:3335: RuntimeWarning: Mean of empty slice.

out=out, **kwargs)
/usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py:161: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
/content/Monk_Object_Detection/4_efficientdet/lib/src/model.py:297: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if len(inputs) == 2:
/content/Monk_Object_Detection/4_efficientdet/lib/src/utils.py:84: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
image_shape = np.array(image_shape)
/content/Monk_Object_Detection/4_efficientdet/lib/src/utils.py:96: TracerWarning: torch.from_numpy results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
anchors = torch.from_numpy(all_anchors.astype(np.float32))
/content/Monk_Object_Detection/4_efficientdet/lib/src/model.py:328: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if scores_over_thresh.sum() == 0:
/usr/local/lib/python3.6/dist-packages/torch/onnx/symbolic_helper.py:253: UserWarning: You are trying to export the model with onnx:Upsample for ONNX opset version 9. This operator might cause results to not match the expected results by PyTorch.
ONNX's Upsample/Resize operator did not match Pytorch's Interpolation until opset 11. Attributes to determine how to transform the input were added in onnx:Resize in opset 11 to support Pytorch's behavior (like coordinate_transformation_mode and nearest_mode).
We recommend using opset 11 and above for models using this operator.
"" + str(_export_onnx_opset_version) + ". "
RuntimeError Traceback (most recent call last)
in ()
1 gtf.Set_Hyperparams(lr=0.0001, val_interval=1, es_min_delta=0.0, es_patience=0)
----> 2 gtf.Train(num_epochs=50, model_output_dir="trained1/")

9 frames
/usr/local/lib/python3.6/dist-packages/torch/onnx/symbolic_helper.py in _onnx_opset_unsupported(op_name, current_opset, supported_opset)
184 def _onnx_opset_unsupported(op_name, current_opset, supported_opset):
185 raise RuntimeError('Unsupported: ONNX export of {} in '
--> 186 'opset {}. Please try opset version {}.'.format(op_name, current_opset, supported_opset))
187
188

RuntimeError: Unsupported: ONNX export of index_put in opset 9. Please try opset version 11.

Please let me know how can I deal with this error ?

abhi-kumar · 2020-08-26T10:11:54Z

WAY 1:

a) Switch to torch==1.4, torchvision==0.5 and efficientnet_pytorch==0.6.3
b) Train your first detector
c) Then resume or reload training from this checkpoint.

WAY 2:

When you clone the library comment out the line number 393-396 and 400-403 in the file Monk_Object_Detection/4_efficientdet/lib/train_detector.py

These lines

 torch.onnx.export(self.system_dict["local"]["model"].module, dummy_input,
                                              os.path.join(self.system_dict["output"]["saved_path"], "signatrix_efficientdet_coco.onnx"),
                                              verbose=False)

and

torch.onnx.export(self.system_dict["local"]["model"], dummy_input,
                                              os.path.join(self.system_dict["output"]["saved_path"], "signatrix_efficientdet_coco.onnx"),
                                              verbose=False)

waghts95 · 2020-08-27T06:01:56Z

WAY 2, did not work.
For WAY 1,
a) Switch to torch==1.4, torchvision==0.5 and efficientnet_pytorch==0.6.3 ====> Done
b) Train your first detector =====> For this,
training is executing but continuously getting this,
'The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3'
and not showing training status like epoch details, loss details, etc.

abhi-kumar · 2020-08-27T07:17:01Z

Please share your code.

waghts95 · 2020-08-27T07:23:34Z

Shared.

abhi-kumar · 2020-08-27T07:47:58Z

The image size is 32? For EfficientNet - b0 image size should be 512. See this example - https://github.com/Tessellate-Imaging/Monk_Object_Detection/blob/master/example_notebooks/4_efficientdet/train%20-%20with%20validation%20dataset.ipynb

waghts95 · 2020-08-27T07:51:52Z

How earlier was working?

…

On Thu, Aug 27, 2020, 13:18 Abhishek Kumar Annamraju < ***@***.***> wrote: The image size is 32? For EfficientNet - b0 image size should be 512 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#56 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE3ISNJM3M6PNGICJGTDA7DSCYFTZANCNFSM4QG6UJQQ> .

abhi-kumar · 2020-08-27T07:55:15Z

If the image shapes were inconsistent it auto switched to default shapes. Since latest efficientnet_pytorch upgrade requires a manual input of shapes we have made the argument as a required entity and cannot take in inconsistencies.

abhi-kumar · 2020-08-27T07:55:51Z

Keep image shape as 512 with B0 version and the training engine will scale annotations accordingly.

waghts95 · 2020-08-27T08:46:40Z

''The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 3'
and not showing training status like epoch details, loss details, etc.'

This error is gone.
Thank you very much.

waghts95 · 2020-08-28T08:53:39Z

WAY 1:

a) Switch to torch==1.4, torchvision==0.5 and efficientnet_pytorch==0.6.3
b) Train your first detector
c) Then resume or reload training from this checkpoint.

WAY 2:

When you clone the library comment out the line number 393-396 and 400-403 in the file Monk_Object_Detection/4_efficientdet/lib/train_detector.py

These lines
 torch.onnx.export(self.system_dict["local"]["model"].module, dummy_input,
                                              os.path.join(self.system_dict["output"]["saved_path"], "signatrix_efficientdet_coco.onnx"),
                                              verbose=False)
and
torch.onnx.export(self.system_dict["local"]["model"], dummy_input,
                                              os.path.join(self.system_dict["output"]["saved_path"], "signatrix_efficientdet_coco.onnx"),
                                              verbose=False)

I used way 1 and could successfully train module and also resume training worked fine. Today when I again tried resume training, I got error which attached in text file.
resume_training_error.txt

abhi-kumar · 2020-08-28T09:08:00Z

WAY 1:

a) Switch to torch==1.4, torchvision==0.5 and efficientnet_pytorch==0.6.3
b) Train your first detector
c) Then resume or reload training from this checkpoint.

WAY 2:

When you clone the library comment out the line number 393-396 and 400-403 in the file Monk_Object_Detection/4_efficientdet/lib/train_detector.py

These lines
 torch.onnx.export(self.system_dict["local"]["model"].module, dummy_input,
                                              os.path.join(self.system_dict["output"]["saved_path"], "signatrix_efficientdet_coco.onnx"),
                                              verbose=False)
and
torch.onnx.export(self.system_dict["local"]["model"], dummy_input,
                                              os.path.join(self.system_dict["output"]["saved_path"], "signatrix_efficientdet_coco.onnx"),
                                              verbose=False)

Since you are using colab make sure the versioning done is correct.

And comment out the two lines mentioned in Way 2.

waghts95 · 2020-08-28T11:18:47Z

versioning is as per your colab_requirement.txt, also commenting did not help.

alsheabi · 2020-12-08T22:43:57Z

try to add these in way 2
opset_version=11 looks like this after added
torch.onnx.export(self.system_dict["local"]["model"].module, dummy_input, os.path.join(self.system_dict["output"]["saved_path"], "signatrix_efficientdet_coco.onnx"), verbose=False, opset_version=11)

alsheabi · 2020-12-09T22:27:33Z

Keep image shape as 512 with B0 version and the training engine will scale annotations accordingly.

Hello @abhi-kumar I used 786 for B2 but I got the same error. Any suggestion.
The size of tensor a (49) must match the size of tensor b (48) at non-singleton dimension 3.

aritzLizoain · 2020-12-21T10:49:01Z

Keep image shape as 512 with B0 version and the training engine will scale annotations accordingly.

Hello @abhi-kumar I used 786 for B2 but I got the same error. Any suggestion.
The size of tensor a (49) must match the size of tensor b (48) at non-singleton dimension 3.

I obtain the same error. It only disappears when I use image_size = 512, regardless of the chosen model version. E.g. image_size = 786 and model version B2 fails, while image_size = 512 and model version B2 works.

I tried modifying dummy_input from torch.rand(1, 3, 512, 512) to torch.rand(1, 3, image_size, image_size) in lines 387 and 452 of train_detector.py, but nothing changed.

abhi-kumar · 2020-12-22T02:38:35Z

Thank you for mentioning the issue.

The issue will be taken into consideration very soon (most probably post Christmas).

srihari12345 · 2021-02-06T03:36:13Z

@abhi-kumar
i have finished 200 epochs with using '7_yolov3'. in that using train_detector.py.
now i need to train for 200 more with weights saved how can i resume with this.

alsheabi · 2021-02-06T14:44:58Z

@abhi-kumar Any update for the issue?

abhi-kumar added the bug Something isn't working label Aug 21, 2020

abhi-kumar added a commit that referenced this issue Aug 24, 2020

Updated requirements file to resolve issue #56

eca41cd

abhi-kumar added the solution added Solution added to the raised issue label Aug 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in Resume training module of 4_efficientdet, getting after completing 5 epoch. #56

Error in Resume training module of 4_efficientdet, getting after completing 5 epoch. #56

waghts95 commented Aug 21, 2020 •

edited

abhi-kumar commented Aug 21, 2020

waghts95 commented Aug 21, 2020 via email

abhi-kumar commented Aug 24, 2020

waghts95 commented Aug 24, 2020 via email

abhi-kumar commented Aug 24, 2020

waghts95 commented Aug 24, 2020 via email

abhi-kumar commented Aug 24, 2020

waghts95 commented Aug 24, 2020 via email

waghts95 commented Aug 25, 2020 •

edited

waghts95 commented Aug 25, 2020

abhi-kumar commented Aug 25, 2020

loading annotations into memory...

waghts95 commented Aug 26, 2020

/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py:3335: RuntimeWarning: Mean of empty slice.

abhi-kumar commented Aug 26, 2020

waghts95 commented Aug 27, 2020 •

edited

abhi-kumar commented Aug 27, 2020

waghts95 commented Aug 27, 2020 •

edited

abhi-kumar commented Aug 27, 2020 •

edited

waghts95 commented Aug 27, 2020 via email

abhi-kumar commented Aug 27, 2020

abhi-kumar commented Aug 27, 2020

waghts95 commented Aug 27, 2020

waghts95 commented Aug 28, 2020

abhi-kumar commented Aug 28, 2020

waghts95 commented Aug 28, 2020

alsheabi commented Dec 8, 2020

alsheabi commented Dec 9, 2020

aritzLizoain commented Dec 21, 2020

abhi-kumar commented Dec 22, 2020

srihari12345 commented Feb 6, 2021

alsheabi commented Feb 6, 2021

Error in Resume training module of 4_efficientdet, getting after completing 5 epoch. #56

Error in Resume training module of 4_efficientdet, getting after completing 5 epoch. #56

Comments

waghts95 commented Aug 21, 2020 • edited

abhi-kumar commented Aug 21, 2020

waghts95 commented Aug 21, 2020 via email

abhi-kumar commented Aug 24, 2020

waghts95 commented Aug 24, 2020 via email

abhi-kumar commented Aug 24, 2020

waghts95 commented Aug 24, 2020 via email

abhi-kumar commented Aug 24, 2020

waghts95 commented Aug 24, 2020 via email

waghts95 commented Aug 25, 2020 • edited

loading annotations into memory... Done (t=0.13s) creating index... index created!

waghts95 commented Aug 25, 2020

abhi-kumar commented Aug 25, 2020

loading annotations into memory...

waghts95 commented Aug 26, 2020

/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py:3335: RuntimeWarning: Mean of empty slice.

abhi-kumar commented Aug 26, 2020

waghts95 commented Aug 27, 2020 • edited

abhi-kumar commented Aug 27, 2020

waghts95 commented Aug 27, 2020 • edited

abhi-kumar commented Aug 27, 2020 • edited

waghts95 commented Aug 27, 2020 via email

abhi-kumar commented Aug 27, 2020

abhi-kumar commented Aug 27, 2020

waghts95 commented Aug 27, 2020

waghts95 commented Aug 28, 2020

abhi-kumar commented Aug 28, 2020

waghts95 commented Aug 28, 2020

alsheabi commented Dec 8, 2020

alsheabi commented Dec 9, 2020

aritzLizoain commented Dec 21, 2020

abhi-kumar commented Dec 22, 2020

srihari12345 commented Feb 6, 2021

alsheabi commented Feb 6, 2021

waghts95 commented Aug 21, 2020 •

edited

waghts95 commented Aug 25, 2020 •

edited

loading annotations into memory...
Done (t=0.13s)
creating index...
index created!

waghts95 commented Aug 27, 2020 •

edited

waghts95 commented Aug 27, 2020 •

edited

abhi-kumar commented Aug 27, 2020 •

edited