fine tune DBNET problem #1599

SalehBM · 2024-05-16T13:56:58Z

Hey everyone!

Today I was trying to fine tune DBNET detector but I have faced problem that I couldn't solve

Namespace(train_path='./train', val_path='./val', arch='db_resnet50', name=None, epochs=5, batch_size=2, device=None, save_interval_epoch=False, input_size=1024, lr=0.001, weight_decay=0, workers=None, resume=None, test_only=False, freeze_backbone=False, show_samples=False, wb=False, push_to_hub=False, pretrained=False, rotation=False, eval_straight=False, sched='poly', amp=False, find_lr=False, early_stop=False, early_stop_epochs=5, early_stop_delta=0.01)
Validation set loaded in 0.03153s (2 samples in 1 batches)
Train set loaded in 0.02547s (30 samples in 15 batches)

  0%|                                                    | 0/15 [00:00<?, ?it/s]
Training loss: 3.39883:   0%|                            | 0/15 [00:04<?, ?it/s]
Training loss: 3.39883:   7%|█▎                  | 1/15 [00:04<01:00,  4.32s/it]
Training loss: 2.69255:   7%|█▎                  | 1/15 [00:04<01:00,  4.32s/it]
Training loss: 2.69255:  13%|██▋                 | 2/15 [00:04<00:26,  2.05s/it]
Training loss: 2.09078:  13%|██▋                 | 2/15 [00:05<00:26,  2.05s/it]
Training loss: 2.09078:  20%|████                | 3/15 [00:05<00:21,  1.80s/it]
Traceback (most recent call last):
  File "/tf/doctr/references/detection/train_pytorch.py", line 473, in <module>
    main(args)
  File "/tf/doctr/references/detection/train_pytorch.py", line 380, in main
    fit_one_epoch(model, train_loader, batch_transforms, optimizer, scheduler, amp=args.amp)
  File "/tf/doctr/references/detection/train_pytorch.py", line 109, in fit_one_epoch
    for images, targets in pbar:
  File "/usr/local/lib/python3.11/dist-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
    return self._process_data(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
    data.reraise()
  File "/usr/local/lib/python3.11/dist-packages/torch/_utils.py", line 705, in reraise
    raise exception
IndexError: Caught IndexError in DataLoader worker process 3.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
            ~~~~~~~~~~~~^^^^^
  File "/usr/local/lib/python3.11/dist-packages/doctr/datasets/datasets/base.py", line 67, in __getitem__
    img_transformed, target[class_name] = self.sample_transforms(img, bboxes)
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/doctr/transforms/modules/base.py", line 56, in __call__
    x, target = t(x, target)
                ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/doctr/transforms/modules/pytorch.py", line 168, in forward
    _target["boxes"][:, ::2] = 1 - target["boxes"][:, [2, 0]]
                                   ~~~~~~^^^^^^^^^
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

The text was updated successfully, but these errors were encountered:

felixdittrich92 · 2024-05-16T18:01:26Z

Hi @SalehBM 👋,
Looks like you have modified the augmentations right ?

Could you please update to the latest changes from the main branch ? It's already fixed :)

doctr/doctr/transforms/modules/pytorch.py

Line 169 in 45c2df3

_target[:, ::2] = 1 - target[:, [2, 0]]

SalehBM · 2024-05-17T11:35:36Z

Hey @felixdittrich92
You're right, I deleted some data augmentation lines because they caused an error stating that T does not have the attribute "RandomResize". To resolve this, I decided to remove all lines related to it.

doctr/references/detection/train_tensorflow.py

Lines 238 to 241 in 45c2df3

    
           T.OneOf([ 
        
               T.RandomApply(T.RandomCrop(ratio=(0.6, 1.33)), 0.25), 
        
               T.RandomResize(scale_range=(0.4, 0.9), preserve_aspect_ratio=0.5, symmetric_pad=0.5, p=0.25), 
        
           ]),

doctr/references/detection/train_tensorflow.py

Lines 247 to 250 in 45c2df3

    
           T.OneOf([ 
        
               T.RandomApply(T.RandomCrop(ratio=(0.6, 1.33)), 0.25), 
        
               T.RandomResize(scale_range=(0.4, 0.9), preserve_aspect_ratio=0.5, symmetric_pad=0.5, p=0.25), 
        
           ]),

felixdittrich92 · 2024-05-17T12:46:48Z

Hey @SalehBM 👋,

But the transformations are available and correct as mentioned it looks like your train script is up to date (with main branch) but the doctr code isn't ^^

At the end it's your decision which augmentations you want to apply, but if you want to try everything as is in the current train script it should work.

you need only to checkout to the main branch -> git pull -> pip install -e.

One point about using TensorFlow training on GPU:
We use threading under the hood but that doesn't work well with TF actually so please disable it before.

example:

DOCTR_MULTIPROCESSING_DISABLE=TRUE USE_TF=1 python doctr/references/detection/train_tensorflow.py ...

SalehBM · 2024-05-19T18:56:08Z

Looks great problem solved!
Thank you, @felixdittrich92!

SalehBM closed this as completed May 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fine tune DBNET problem #1599

fine tune DBNET problem #1599

SalehBM commented May 16, 2024

felixdittrich92 commented May 16, 2024

SalehBM commented May 17, 2024 •

edited

felixdittrich92 commented May 17, 2024 •

edited

SalehBM commented May 19, 2024

fine tune DBNET problem #1599

fine tune DBNET problem #1599

Comments

SalehBM commented May 16, 2024

felixdittrich92 commented May 16, 2024

SalehBM commented May 17, 2024 • edited

felixdittrich92 commented May 17, 2024 • edited

SalehBM commented May 19, 2024

SalehBM commented May 17, 2024 •

edited

felixdittrich92 commented May 17, 2024 •

edited