Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fine tune DBNET problem #1599

Closed
SalehBM opened this issue May 16, 2024 · 4 comments
Closed

fine tune DBNET problem #1599

SalehBM opened this issue May 16, 2024 · 4 comments

Comments

@SalehBM
Copy link

SalehBM commented May 16, 2024

Hey everyone!

Today I was trying to fine tune DBNET detector but I have faced problem that I couldn't solve

Namespace(train_path='./train', val_path='./val', arch='db_resnet50', name=None, epochs=5, batch_size=2, device=None, save_interval_epoch=False, input_size=1024, lr=0.001, weight_decay=0, workers=None, resume=None, test_only=False, freeze_backbone=False, show_samples=False, wb=False, push_to_hub=False, pretrained=False, rotation=False, eval_straight=False, sched='poly', amp=False, find_lr=False, early_stop=False, early_stop_epochs=5, early_stop_delta=0.01)
Validation set loaded in 0.03153s (2 samples in 1 batches)
Train set loaded in 0.02547s (30 samples in 15 batches)

  0%|                                                    | 0/15 [00:00<?, ?it/s]
Training loss: 3.39883:   0%|                            | 0/15 [00:04<?, ?it/s]
Training loss: 3.39883:   7%|█▎                  | 1/15 [00:04<01:00,  4.32s/it]
Training loss: 2.69255:   7%|█▎                  | 1/15 [00:04<01:00,  4.32s/it]
Training loss: 2.69255:  13%|██▋                 | 2/15 [00:04<00:26,  2.05s/it]
Training loss: 2.09078:  13%|██▋                 | 2/15 [00:05<00:26,  2.05s/it]
Training loss: 2.09078:  20%|████                | 3/15 [00:05<00:21,  1.80s/it]
Traceback (most recent call last):
  File "/tf/doctr/references/detection/train_pytorch.py", line 473, in <module>
    main(args)
  File "/tf/doctr/references/detection/train_pytorch.py", line 380, in main
    fit_one_epoch(model, train_loader, batch_transforms, optimizer, scheduler, amp=args.amp)
  File "/tf/doctr/references/detection/train_pytorch.py", line 109, in fit_one_epoch
    for images, targets in pbar:
  File "/usr/local/lib/python3.11/dist-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
    return self._process_data(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
    data.reraise()
  File "/usr/local/lib/python3.11/dist-packages/torch/_utils.py", line 705, in reraise
    raise exception
IndexError: Caught IndexError in DataLoader worker process 3.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
            ~~~~~~~~~~~~^^^^^
  File "/usr/local/lib/python3.11/dist-packages/doctr/datasets/datasets/base.py", line 67, in __getitem__
    img_transformed, target[class_name] = self.sample_transforms(img, bboxes)
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/doctr/transforms/modules/base.py", line 56, in __call__
    x, target = t(x, target)
                ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/doctr/transforms/modules/pytorch.py", line 168, in forward
    _target["boxes"][:, ::2] = 1 - target["boxes"][:, [2, 0]]
                                   ~~~~~~^^^^^^^^^
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

@felixdittrich92
Copy link
Contributor

Hi @SalehBM 👋,
Looks like you have modified the augmentations right ?

Could you please update to the latest changes from the main branch ? It's already fixed :)

_target[:, ::2] = 1 - target[:, [2, 0]]

@SalehBM
Copy link
Author

SalehBM commented May 17, 2024

Hey @felixdittrich92
You're right, I deleted some data augmentation lines because they caused an error stating that T does not have the attribute "RandomResize". To resolve this, I decided to remove all lines related to it.

T.OneOf([
T.RandomApply(T.RandomCrop(ratio=(0.6, 1.33)), 0.25),
T.RandomResize(scale_range=(0.4, 0.9), preserve_aspect_ratio=0.5, symmetric_pad=0.5, p=0.25),
]),

T.OneOf([
T.RandomApply(T.RandomCrop(ratio=(0.6, 1.33)), 0.25),
T.RandomResize(scale_range=(0.4, 0.9), preserve_aspect_ratio=0.5, symmetric_pad=0.5, p=0.25),
]),

@felixdittrich92
Copy link
Contributor

felixdittrich92 commented May 17, 2024

Hey @SalehBM 👋,

But the transformations are available and correct as mentioned it looks like your train script is up to date (with main branch) but the doctr code isn't ^^

At the end it's your decision which augmentations you want to apply, but if you want to try everything as is in the current train script it should work.

you need only to checkout to the main branch -> git pull -> pip install -e.

One point about using TensorFlow training on GPU:
We use threading under the hood but that doesn't work well with TF actually so please disable it before.

example:

DOCTR_MULTIPROCESSING_DISABLE=TRUE USE_TF=1 python doctr/references/detection/train_tensorflow.py ...

@SalehBM
Copy link
Author

SalehBM commented May 19, 2024

Looks great problem solved!
Thank you, @felixdittrich92!

@SalehBM SalehBM closed this as completed May 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants