[Bug] Fix prepare icdar2013 dataset with lmdb flag #1822

hugotong6425 · 2023-03-29T10:28:41Z

Motivation

When I run

python tools/dataset_converters/prepare_dataset.py \
    icdar2013 \
    --task textrecog \
    --nproc 8 \
    --lmdb

with

# mmocr/dataset_zoo/icdar2013/textrecog.py

# This configuration prepares the ICDAR13 857 and 1015
# version, and uses ICDAR13 1015 version by default.
# You may uncomment the lines if you want to you the original version,
# which contains 1095 samples.
# You can check out the generated base config and use the 857
# version by using its corresponding config variables in your model.

data_root = 'data/icdar2013'
cache_path = 'data/cache'

train_preparer = dict(
    obtainer=dict(
        type='NaiveDataObtainer',
        cache_path=cache_path,
        files=[
            dict(
                url='https://rrc.cvc.uab.es/downloads/'
                'Challenge2_Training_Task3_Images_GT.zip',
                save_name='ic13_textrecog_train_img_gt.zip',
                md5='6f0dbc823645968030878df7543f40a4',
                content=['image'],
                mapping=[
                    # ['ic13_textrecog_train_img_gt/gt.txt',
                    # 'annotations/train.txt'],
                    ['ic13_textrecog_train_img_gt', 'textrecog_imgs/train']
                ]),
            dict(
                url='https://download.openmmlab.com/mmocr/data/1.x/recog/'
                'icdar_2013/train_labels.json',
                save_name='ic13_train_labels.json',
                md5='008fcd0056e72c4cf3064fb4d1fce81b',
                content=['annotation'],
                mapping=[['ic13_train_labels.json', 'textrecog_train.json']]),
        ]))

# Note that we offer two versions of test set annotations as follows.Please
# choose one of them to download and comment the other. By default, we use the
# second one.
# 1. The original official annotation, which contains 1095 test
# samples.

# Uncomment the test_preparer if you want to use the original 1095 version.

test_preparer = dict(
    obtainer=dict(
        type='NaiveDataObtainer',
        cache_path=cache_path,
        files=[
            dict(
                url='https://rrc.cvc.uab.es/downloads/'
                'Challenge2_Test_Task3_Images.zip',
                save_name='ic13_textrecog_test_img.zip',
                md5='3206778eebb3a5c5cc15c249010bf77f',
                split=['test'],
                content=['image'],
                mapping=[['ic13_textrecog_test_img',
                          'textrecog_imgs/test']]),
            dict(
                url='https://rrc.cvc.uab.es/downloads/'
                'Challenge2_Test_Task3_GT.txt',
                save_name='ic13_textrecog_test_gt.txt',
                md5='2634060ed8fe6e7a4a9b8d68785835a1',
                split=['test'],
                content=['annotation'],
                mapping=[[
                    'ic13_textrecog_test_gt.txt', 'annotations/test.txt'
                ]]),  # noqa
            # The 857 version further pruned words shorter than 3 characters.
            dict(
                url='https://download.openmmlab.com/mmocr/data/1.x/recog/'
                'icdar_2013/textrecog_test_857.json',
                save_name='textrecog_test_857.json',
                md5='3bed3985b0c51a989ad4006f6de8352b',
                split=['test'],
                content=['annotation'],
            ),
        ]),
    gatherer=dict(type='MonoGatherer', ann_name='test.txt'),
    parser=dict(
        type='ICDARTxtTextRecogAnnParser', separator=', ',
        format='img, text'),  # noqa
    packer=dict(type='TextRecogPacker'),
    dumper=dict(type='JsonDumper'),
)

I got an error

Traceback (most recent call last):
  File "tools/dataset_converters/prepare_dataset.py", line 153, in <module>
    main()
  File "tools/dataset_converters/prepare_dataset.py", line 147, in main
    cfg = force_lmdb(cfg)
  File "tools/dataset_converters/prepare_dataset.py", line 96, in force_lmdb
    raise ValueError(
ValueError: train split does not come with a dumper, so most likely the annotations are MMOCR-ready and do not need any adaptation, and it cannot be dumped in LMDB format.

Modification

Fix the above error.

Checklist

Before PR:

I have read and followed the workflow indicated in the CONTRIBUTING.md to create this PR.
Pre-commit or linting tools indicated in CONTRIBUTING.md are used to fix the potential lint issues.
Bug fixes are covered by unit tests, the case that causes the bug should be added in the unit tests.
New functionalities are covered by complete unit tests. If not, please add more unit test to ensure the correctness.
The documentation has been modified accordingly, including docstring or example tutorials.

After PR:

If the modification has potential influence on downstream or other related projects, this PR should be tested with some of those projects.
CLA has been signed and all committers have signed the CLA in this PR.

…g prepare_dataset.py on test set option 1 when lmdb flag

codecov · 2023-03-29T10:36:27Z

Codecov Report

Patch and project coverage have no change.

Comparison is base (afe58a4) 89.19% compared to head (fd0baa6) 89.19%.

❗ Current head fd0baa6 differs from pull request most recent head df79a95. Consider uploading reports for the commit df79a95 to get more accurate results

Additional details and impacted files

@@           Coverage Diff            @@
##           dev-1.x    #1822   +/-   ##
========================================
  Coverage    89.19%   89.19%           
========================================
  Files          193      193           
  Lines        11311    11311           
  Branches      1607     1607           
========================================
  Hits         10089    10089           
  Misses         902      902           
  Partials       320      320

Flag	Coverage Δ
unittests	`89.19% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

gaotongxiao

In IC13, there are some wrong labels in the original annotation and they are manually fixed by researchers. Therefore, these arguments are intentionally left empty, as we just use the corrected labels (L27-L33) instead of the generated ones.

That is to say, you should be aware of a few wrong labels in your lmdb now, and this PR needs to be revised to support such a case.

hugotong6425 · 2023-03-30T02:03:20Z

Thanks for the reminder. Is there any known issue on other datasets besides icdar 2013 so I can exclude them in training?

gaotongxiao · 2023-03-30T05:03:14Z

@hugotong6425 Currently not. But it's also fine to mix lmdb and json datasets for training & testing, so you can probably just work with IC13 json annotations.

Back to this PR - do you still want to go ahead and enhance the lmdb flag so that it could handle this case - essentially just doing a conversion from textrecog_train.json to textrecog_train.lmdb shall work. If you are not able to work at the moment, we can leave it to future work and just close this PR.

hugotong6425 · 2023-03-30T09:34:01Z

I can help to handle this case. I am not very familiar to the framework. Should I create a new Parser class to replace ICDARTxtTextRecogAnnParser and keep other things the same to handle this task?

# original code
gatherer=dict(type='MonoGatherer', ann_name='train.txt'),
parser=dict(
    type='ICDARTxtTextRecogAnnParser', separator=', ',
    format='img, text'),  # noqa
packer=dict(type='TextRecogPacker'),
dumper=dict(type='JsonDumper'),

gaotongxiao · 2023-03-31T05:08:54Z

Thanks. Actually, it can be done in a simpler way.

Now dumping textrecog datasets in lmdb is done in force_lmdb, which replaces any dumper with TextRecogLMDBDumper:

mmocr/tools/dataset_converters/prepare_dataset.py

Lines 83 to 101 in bb591d2

    
           def force_lmdb(cfg): 
        
               """Force the dataset to be dumped in lmdb format. 
        
               Args: 
        
                   cfg (Config): Config object. 
        
               Returns: 
        
                   Config: Config object. 
        
               """ 
        
               for split in ['train', 'val', 'test']: 
        
                   preparer_cfg = cfg.get(f'{split}_preparer') 
        
                   if preparer_cfg: 
        
                       if preparer_cfg.get('dumper') is None: 
        
                           raise ValueError( 
        
                               f'{split} split does not come with a dumper, ' 
        
                               'so most likely the annotations are MMOCR-ready and do ' 
        
                               'not need any adaptation, and it ' 
        
                               'cannot be dumped in LMDB format.') 
        
                       preparer_cfg.dumper['type'] = 'TextRecogLMDBDumper'

However, it fails to work when we don't want the annotation to be converted and the dumper will never be executed. In this case, we need to perform post-conversion on the annotations & images into lmdb format (right after

mmocr/tools/dataset_converters/prepare_dataset.py

Line 148 in d56155c

preparer = DatasetPreparer.from_file(cfg)

), and name it as textrecog_{train/test}.lmdb. We also have a function recog2lmdb for reference, but note that its input annotation format is slightly different from MMOCR's.

Also, if you are interested, you can join MMSIG group (Wechat: OpenMMLabwx) and earn some points for your contributions :)

…n running prepare_dataset.py on test set option 1 when lmdb flag" This reverts commit 0fcac7e.

hugotong6425 · 2023-03-31T09:51:39Z

  cfg.dataset_name = dataset
  if args.lmdb:
      cfg = force_lmdb(cfg)
  preparer = DatasetPreparer.from_file(cfg)

  # convert annotations & images into lmdb format
  img_root = get_img_root(cfg)
  label_path = get_label_path(cfg)
  output = get_output_dir(cfg)
  func_similar_to_recog2lmdb(img_root, label_path, output)

  preparer.run(args.splits)

I am not sure if I get you correctly. But if we perform post-conversion on the annotations & images into lmdb format right after preparer = DatasetPreparer.from_file(cfg) like the code above, there are 2 points I am not sure how to handle:

Annotation and images are not yet downloaded at this point as they are download in preparer.run(args.splits), therefore there are nothing to convert to lmdb
If user add --lmdb flag when calling prepare_dataset.py, the program will throw ValueError: train split does not come with a dumper, so most likely the annotations are MMOCR-ready and do not need any adaptation, and it cannot be dumped in LMDB format. error in cfg = force_lmdb(cfg) line because it will assert that there must be a dumper while creating DatasetPreparer

Please let me know if I understand incorrectly. Thanks!

gaotongxiao · 2023-04-01T03:03:26Z

My mistake. The conversion should be done after L149 (preparer.run(args.splits)), and force_lmdb should be adapted to skip the dumper replacement when its configuration does not even exist.

gaotongxiao · 2023-04-07T08:10:00Z

Hi @hugotong6425 , would you still be available to work on this PR?

hugotong6425 · 2023-04-07T08:21:53Z

Yes I can, but I am currently on vacation so I will come back to it next week. If others want to handle this issue, please feel free to close this issue. Happy Easter holiday btw!

gaotongxiao · 2023-04-07T08:27:06Z

Gotcha. I just wanted to know if you are still interested - and now we can keep this PR opening then. Happy Easter too!

fix 'ValueError: train split does not come with a dumper' when runnin…

0fcac7e

…g prepare_dataset.py on test set option 1 when lmdb flag

mm-assistant bot assigned gaotongxiao Mar 29, 2023

gaotongxiao reviewed Mar 29, 2023

View reviewed changes

Revert "fix 'ValueError: train split does not come with a dumper' whe…

df79a95

…n running prepare_dataset.py on test set option 1 when lmdb flag" This reverts commit 0fcac7e.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Fix prepare icdar2013 dataset with lmdb flag #1822

[Bug] Fix prepare icdar2013 dataset with lmdb flag #1822

hugotong6425 commented Mar 29, 2023 •

edited

codecov bot commented Mar 29, 2023 •

edited

gaotongxiao left a comment •

edited

hugotong6425 commented Mar 30, 2023

gaotongxiao commented Mar 30, 2023 •

edited

hugotong6425 commented Mar 30, 2023

gaotongxiao commented Mar 31, 2023

hugotong6425 commented Mar 31, 2023 •

edited

gaotongxiao commented Apr 1, 2023

gaotongxiao commented Apr 7, 2023

hugotong6425 commented Apr 7, 2023

gaotongxiao commented Apr 7, 2023

[Bug] Fix prepare icdar2013 dataset with lmdb flag #1822

Are you sure you want to change the base?

[Bug] Fix prepare icdar2013 dataset with lmdb flag #1822

Conversation

hugotong6425 commented Mar 29, 2023 • edited

Motivation

Modification

Checklist

codecov bot commented Mar 29, 2023 • edited

Codecov Report

gaotongxiao left a comment • edited

Choose a reason for hiding this comment

hugotong6425 commented Mar 30, 2023

gaotongxiao commented Mar 30, 2023 • edited

hugotong6425 commented Mar 30, 2023

gaotongxiao commented Mar 31, 2023

hugotong6425 commented Mar 31, 2023 • edited

gaotongxiao commented Apr 1, 2023

gaotongxiao commented Apr 7, 2023

hugotong6425 commented Apr 7, 2023

gaotongxiao commented Apr 7, 2023

hugotong6425 commented Mar 29, 2023 •

edited

codecov bot commented Mar 29, 2023 •

edited

gaotongxiao left a comment •

edited

gaotongxiao commented Mar 30, 2023 •

edited

hugotong6425 commented Mar 31, 2023 •

edited