Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Fix prepare icdar2013 dataset with lmdb flag #1822

Open
wants to merge 2 commits into
base: dev-1.x
Choose a base branch
from

Conversation

hugotong6425
Copy link
Contributor

@hugotong6425 hugotong6425 commented Mar 29, 2023

Motivation

When I run

python tools/dataset_converters/prepare_dataset.py \
    icdar2013 \
    --task textrecog \
    --nproc 8 \
    --lmdb

with

# mmocr/dataset_zoo/icdar2013/textrecog.py

# This configuration prepares the ICDAR13 857 and 1015
# version, and uses ICDAR13 1015 version by default.
# You may uncomment the lines if you want to you the original version,
# which contains 1095 samples.
# You can check out the generated base config and use the 857
# version by using its corresponding config variables in your model.

data_root = 'data/icdar2013'
cache_path = 'data/cache'

train_preparer = dict(
    obtainer=dict(
        type='NaiveDataObtainer',
        cache_path=cache_path,
        files=[
            dict(
                url='https://rrc.cvc.uab.es/downloads/'
                'Challenge2_Training_Task3_Images_GT.zip',
                save_name='ic13_textrecog_train_img_gt.zip',
                md5='6f0dbc823645968030878df7543f40a4',
                content=['image'],
                mapping=[
                    # ['ic13_textrecog_train_img_gt/gt.txt',
                    # 'annotations/train.txt'],
                    ['ic13_textrecog_train_img_gt', 'textrecog_imgs/train']
                ]),
            dict(
                url='https://download.openmmlab.com/mmocr/data/1.x/recog/'
                'icdar_2013/train_labels.json',
                save_name='ic13_train_labels.json',
                md5='008fcd0056e72c4cf3064fb4d1fce81b',
                content=['annotation'],
                mapping=[['ic13_train_labels.json', 'textrecog_train.json']]),
        ]))

# Note that we offer two versions of test set annotations as follows.Please
# choose one of them to download and comment the other. By default, we use the
# second one.
# 1. The original official annotation, which contains 1095 test
# samples.

# Uncomment the test_preparer if you want to use the original 1095 version.

test_preparer = dict(
    obtainer=dict(
        type='NaiveDataObtainer',
        cache_path=cache_path,
        files=[
            dict(
                url='https://rrc.cvc.uab.es/downloads/'
                'Challenge2_Test_Task3_Images.zip',
                save_name='ic13_textrecog_test_img.zip',
                md5='3206778eebb3a5c5cc15c249010bf77f',
                split=['test'],
                content=['image'],
                mapping=[['ic13_textrecog_test_img',
                          'textrecog_imgs/test']]),
            dict(
                url='https://rrc.cvc.uab.es/downloads/'
                'Challenge2_Test_Task3_GT.txt',
                save_name='ic13_textrecog_test_gt.txt',
                md5='2634060ed8fe6e7a4a9b8d68785835a1',
                split=['test'],
                content=['annotation'],
                mapping=[[
                    'ic13_textrecog_test_gt.txt', 'annotations/test.txt'
                ]]),  # noqa
            # The 857 version further pruned words shorter than 3 characters.
            dict(
                url='https://download.openmmlab.com/mmocr/data/1.x/recog/'
                'icdar_2013/textrecog_test_857.json',
                save_name='textrecog_test_857.json',
                md5='3bed3985b0c51a989ad4006f6de8352b',
                split=['test'],
                content=['annotation'],
            ),
        ]),
    gatherer=dict(type='MonoGatherer', ann_name='test.txt'),
    parser=dict(
        type='ICDARTxtTextRecogAnnParser', separator=', ',
        format='img, text'),  # noqa
    packer=dict(type='TextRecogPacker'),
    dumper=dict(type='JsonDumper'),
)

I got an error

Traceback (most recent call last):
  File "tools/dataset_converters/prepare_dataset.py", line 153, in <module>
    main()
  File "tools/dataset_converters/prepare_dataset.py", line 147, in main
    cfg = force_lmdb(cfg)
  File "tools/dataset_converters/prepare_dataset.py", line 96, in force_lmdb
    raise ValueError(
ValueError: train split does not come with a dumper, so most likely the annotations are MMOCR-ready and do not need any adaptation, and it cannot be dumped in LMDB format.

Modification

Fix the above error.

Checklist

Before PR:

  • I have read and followed the workflow indicated in the CONTRIBUTING.md to create this PR.
  • Pre-commit or linting tools indicated in CONTRIBUTING.md are used to fix the potential lint issues.
  • Bug fixes are covered by unit tests, the case that causes the bug should be added in the unit tests.
  • New functionalities are covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  • The documentation has been modified accordingly, including docstring or example tutorials.

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with some of those projects.
  • CLA has been signed and all committers have signed the CLA in this PR.

…g prepare_dataset.py on test set option 1 when lmdb flag
@codecov
Copy link

codecov bot commented Mar 29, 2023

Codecov Report

Patch and project coverage have no change.

Comparison is base (afe58a4) 89.19% compared to head (fd0baa6) 89.19%.

❗ Current head fd0baa6 differs from pull request most recent head df79a95. Consider uploading reports for the commit df79a95 to get more accurate results

Additional details and impacted files
@@           Coverage Diff            @@
##           dev-1.x    #1822   +/-   ##
========================================
  Coverage    89.19%   89.19%           
========================================
  Files          193      193           
  Lines        11311    11311           
  Branches      1607     1607           
========================================
  Hits         10089    10089           
  Misses         902      902           
  Partials       320      320           
Flag Coverage Δ
unittests 89.19% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Collaborator

@gaotongxiao gaotongxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In IC13, there are some wrong labels in the original annotation and they are manually fixed by researchers. Therefore, these arguments are intentionally left empty, as we just use the corrected labels (L27-L33) instead of the generated ones.

That is to say, you should be aware of a few wrong labels in your lmdb now, and this PR needs to be revised to support such a case.

@hugotong6425
Copy link
Contributor Author

Thanks for the reminder. Is there any known issue on other datasets besides icdar 2013 so I can exclude them in training?

@gaotongxiao
Copy link
Collaborator

gaotongxiao commented Mar 30, 2023

@hugotong6425 Currently not. But it's also fine to mix lmdb and json datasets for training & testing, so you can probably just work with IC13 json annotations.

Back to this PR - do you still want to go ahead and enhance the lmdb flag so that it could handle this case - essentially just doing a conversion from textrecog_train.json to textrecog_train.lmdb shall work. If you are not able to work at the moment, we can leave it to future work and just close this PR.

@hugotong6425
Copy link
Contributor Author

I can help to handle this case. I am not very familiar to the framework. Should I create a new Parser class to replace ICDARTxtTextRecogAnnParser and keep other things the same to handle this task?

# original code
gatherer=dict(type='MonoGatherer', ann_name='train.txt'),
parser=dict(
    type='ICDARTxtTextRecogAnnParser', separator=', ',
    format='img, text'),  # noqa
packer=dict(type='TextRecogPacker'),
dumper=dict(type='JsonDumper'),

@gaotongxiao
Copy link
Collaborator

Thanks. Actually, it can be done in a simpler way.

Now dumping textrecog datasets in lmdb is done in force_lmdb, which replaces any dumper with TextRecogLMDBDumper:

def force_lmdb(cfg):
"""Force the dataset to be dumped in lmdb format.
Args:
cfg (Config): Config object.
Returns:
Config: Config object.
"""
for split in ['train', 'val', 'test']:
preparer_cfg = cfg.get(f'{split}_preparer')
if preparer_cfg:
if preparer_cfg.get('dumper') is None:
raise ValueError(
f'{split} split does not come with a dumper, '
'so most likely the annotations are MMOCR-ready and do '
'not need any adaptation, and it '
'cannot be dumped in LMDB format.')
preparer_cfg.dumper['type'] = 'TextRecogLMDBDumper'

However, it fails to work when we don't want the annotation to be converted and the dumper will never be executed. In this case, we need to perform post-conversion on the annotations & images into lmdb format (right after

preparer = DatasetPreparer.from_file(cfg)
), and name it as textrecog_{train/test}.lmdb. We also have a function recog2lmdb for reference, but note that its input annotation format is slightly different from MMOCR's.

Also, if you are interested, you can join MMSIG group (Wechat: OpenMMLabwx) and earn some points for your contributions :)

…n running prepare_dataset.py on test set option 1 when lmdb flag"

This reverts commit 0fcac7e.
@hugotong6425
Copy link
Contributor Author

hugotong6425 commented Mar 31, 2023

  cfg.dataset_name = dataset
  if args.lmdb:
      cfg = force_lmdb(cfg)
  preparer = DatasetPreparer.from_file(cfg)

  # convert annotations & images into lmdb format
  img_root = get_img_root(cfg)
  label_path = get_label_path(cfg)
  output = get_output_dir(cfg)
  func_similar_to_recog2lmdb(img_root, label_path, output)

  preparer.run(args.splits)

I am not sure if I get you correctly. But if we perform post-conversion on the annotations & images into lmdb format right after preparer = DatasetPreparer.from_file(cfg) like the code above, there are 2 points I am not sure how to handle:

  1. Annotation and images are not yet downloaded at this point as they are download in preparer.run(args.splits), therefore there are nothing to convert to lmdb
  2. If user add --lmdb flag when calling prepare_dataset.py, the program will throw ValueError: train split does not come with a dumper, so most likely the annotations are MMOCR-ready and do not need any adaptation, and it cannot be dumped in LMDB format. error in cfg = force_lmdb(cfg) line because it will assert that there must be a dumper while creating DatasetPreparer

Please let me know if I understand incorrectly. Thanks!

@gaotongxiao
Copy link
Collaborator

My mistake. The conversion should be done after L149 (preparer.run(args.splits)), and force_lmdb should be adapted to skip the dumper replacement when its configuration does not even exist.

@gaotongxiao
Copy link
Collaborator

Hi @hugotong6425 , would you still be available to work on this PR?

@hugotong6425
Copy link
Contributor Author

Yes I can, but I am currently on vacation so I will come back to it next week. If others want to handle this issue, please feel free to close this issue. Happy Easter holiday btw!

@gaotongxiao
Copy link
Collaborator

Gotcha. I just wanted to know if you are still interested - and now we can keep this PR opening then. Happy Easter too!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants