Issue 1392: Vietnamese dictionaries #9

jimregan · 2015-05-13T22:06:24Z

https://code.google.com/p/tesseract-ocr/issues/detail?id=1392

What steps will reproduce the problem?

Unpack vie.traineddata downloaded from Tesseract repository
Run dawg2wordlist on vie.freq-dawg & vie.word-dawg to recover original lists
Examine the content

What is the expected output? What do you see instead?

The recovered word lists are found to be incomplete and contain many erroneous entries.

Please use the included dictionaries for training data for Vietnamese language.

Apr 16, 2015
#1 zdenop

can you have a look and review Vietnamese dictionaries in langdata repository?

https://code.google.com/p/tesseract-ocr/source/browse/vie/?repo=langdata&name=master

Apr 19, 2015
2 nguyenq87
vie.wordlist.clean would need to be scrapped totally as it contains so many misspelled Vietnamese and English words, words missing diacritical marks or running on together (Vietnamese words are mostly monosyllables).

The provided vie.words_list is composed of several lists commonly used among Vietnamese-language application developers, including those from http://www.informatik.uni-leipzig.de/~duc/software/misc/wordlist.html.

The fourth column in vie.unicharambigs contains many characters that are not Vietnamese, e.g., üûñËÄ. Those characters should not be used for match target.

http://vietunicode.sourceforge.net/charset/vietalphabet.html
(http://web.archive.org/web/20150413012244/https://code.google.com/p/tesseract-ocr/issues/detail?id=1392)

https://code.google.com/p/tesseract-ocr/issues/detail?id=1392 What steps will reproduce the problem? 1. Unpack vie.traineddata downloaded from Tesseract repository 2. Run dawg2wordlist on vie.freq-dawg & vie.word-dawg to recover original lists 3. Examine the content What is the expected output? What do you see instead? The recovered word lists are found to be incomplete and contain many erroneous entries. Please use the included dictionaries for training data for Vietnamese language. Apr 16, 2015 #1 zdenop can you have a look and review Vietnamese dictionaries in langdata repository? https://code.google.com/p/tesseract-ocr/source/browse/vie/?repo=langdata&name=master Apr 19, 2015 2 nguyenq87 vie.wordlist.clean would need to be scrapped totally as it contains so many misspelled Vietnamese and English words, words missing diacritical marks or running on together (Vietnamese words are mostly monosyllables). The provided vie.words_list is composed of several lists commonly used among Vietnamese-language application developers, including those from http://www.informatik.uni-leipzig.de/~duc/software/misc/wordlist.html. The fourth column in vie.unicharambigs contains many characters that are not Vietnamese, e.g., üûñËÄ. Those characters should not be used for match target. http://vietunicode.sourceforge.net/charset/vietalphabet.html

jimregan · 2015-07-12T23:17:00Z

@nguyenq - does the new Vietnamese language pack fix this issue?

nguyenq · 2015-07-16T13:36:09Z

I haven't test the updated vie.trainneddata; however, the vie.wordlist file found in https://github.com/tesseract-ocr/langdata/tree/master/vie is still plagued with errors. Many of the words seem to be corrupted (broken UTF-8 encoding) as evidenced by the block characters ﬆ. Many words contain punctuation marks (.,":?!). Far too many are misspelled.

Why was the provided dictionary word list not accepted?

Fadi0950 · 2022-06-20T11:36:28Z

urdu words

merge

546d0d2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 1392: Vietnamese dictionaries #9

Issue 1392: Vietnamese dictionaries #9

jimregan commented May 13, 2015

jimregan commented Jul 12, 2015

nguyenq commented Jul 16, 2015

Fadi0950 commented Jun 20, 2022

Issue 1392: Vietnamese dictionaries #9

Are you sure you want to change the base?

Issue 1392: Vietnamese dictionaries #9

Conversation

jimregan commented May 13, 2015

jimregan commented Jul 12, 2015

nguyenq commented Jul 16, 2015

Fadi0950 commented Jun 20, 2022