Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wordlists and training texts contain lots of errors #1

Open
stweil opened this issue Aug 11, 2018 · 16 comments
Open

Wordlists and training texts contain lots of errors #1

stweil opened this issue Aug 11, 2018 · 16 comments

Comments

@stweil
Copy link
Contributor

stweil commented Aug 11, 2018

A short test with codespell (which only finds the most common typos for English) found more than 1000 errors in eng.wordlist.

The German wordlist deu.wordlist contains the well known B / ß confusion and also other errors.

The training texts also contain similar errors. In addition, I noticed many foreign (Turkish?) words in the German text.

Are such errors critical for the trained model which is based on that data?

@stweil stweil changed the title Wordlists contain lots of errors Wordlists and training texts contain lots of errors Aug 11, 2018
@amitdo
Copy link

amitdo commented Sep 14, 2018

The word lists and trained text were generates by using a web crawler.
Some filtering was done as a post processing step.

So the undesirable effects you mentioned are to be expected.

@stweil
Copy link
Contributor Author

stweil commented Sep 15, 2018

Using a web crawler on German texts will normally not find words like "drauBen" (instead of "draußen"), unless you crawl OCR results which were made with English language settings. It looks like Ray crawled Google Books. What happens if Google learns from Google? At some time there will be lots of evidence that "drauBen" is correct. :-) Searching for "drauBen" (with Google Search of course) already finds texts outside of Google Books, but maybe generated by Google Translate.

So using a web crawler is fine as long as it only crawls more reliable content (German text corpora, German Wikipedia, German newspapers, German books from Wikisource or Project Gutenberg, ...).

@amitdo
Copy link

amitdo commented Sep 15, 2018

tesseract-ocr/tesseract#654 (comment)

theraysmith commented on Jan 23, 2017

The text corpus is from all the www, taken several years ago, plus more
recent data from wiki-something.
The text is divided by language automatically, so there is a separate
stream for each of the Devanagari-based languages (as there is for the
Latin-based languages) and clipped to 1GB for each language.
For each language, the text is frequency counted and cleaned by multiple
methods, and sometimes this cleaning is too stringent automatically, or not
stringent enough, so forbidden_characters and desired_characters are used
as a guide in the cleanup process. There are other lang-specific numbers
like a 1-in-n discard ratio for the frequency.
For some languages, the amount of data produced at the end is very thin.

The unicharset is extracted from what remains, and the wordlist that is
published in langdata.
For the LSTM training, I resorted to using Google's parallel infrastructure
to render enough text in all the languages.
However much or little corpus text there is, the rendering process makes
50000 chunks of 50 words to render in a different combination of font and
random degradation, which results in 400000-800000 rendered textlines.
The words are chosen to approximately echo the real frequency of conjunct
clusters (characters in most languages) in the source text, while also
using the most frequent words.

This process is all done without significant manual intervention, but
counts of the number of generated textlines indicates when it has gone
badly, usually due to a lack of fonts, or a lack of corpus text.
I recently stopped training chr, iku, khm, mya after discovering that I
have no rendered textlines that contain anything other than digits and
punctuation.

Community input is therefore extremely useful, and usually results in edits
to forbidden_characters and desired_characters, which in turn guides the
filtration process.
Community-provided corpus text would be useful for languages that have very
little or no training data, given appropriate copyright/licensing clearance.

@amitdo
Copy link

amitdo commented Sep 15, 2018

wiki-something

Wikipedia? Other Wikimedia's wikis?

@wrznr
Copy link

wrznr commented Apr 16, 2019

Community-provided corpus text would be useful for languages

Let's say we provide corpus text. Is there only the slightest chance that retraining *.tessdata files is going to happen? Does anyone even know the necessary commands for rebuilding the models provided in the tessdata repos?

@zdenop
Copy link
Contributor

zdenop commented Apr 17, 2019

IMO (I did not try it yet) it should be possible at least for LTSM: see wiki training-from-scratch.
Experiences from training legacy engine (tesseract 3.x) were that nobody was able to achieve google trainined data results for standard fonts, so I would do not invest time to retrain legacy part (unless you have very specific font there current data provide bad results).

@wrznr
Copy link

wrznr commented Apr 17, 2019

Thanks for your estimation. I guess reproducing the current models would be very useful before trying to improve them. I'll give it a try. And yes, I am only interested in LSTM training.

@stweil
Copy link
Contributor Author

stweil commented Apr 17, 2019

My own experience with legacy training is different. It was quite easy to train a useable Fraktur model (frk.traineddata), but up to now I did not succeed in training a similar LSTM model from scratch.

Legacy training only requires a selection of good fonts and a short training text which includes all glyphs, so it is sufficient to make an artificial text listing those glyphs.

@wrznr
Copy link

wrznr commented Apr 17, 2019

Just to make sure: With reproducing, I refer to more or less exactly reproducing the current state of the stack models.

@stweil
Copy link
Contributor Author

stweil commented Apr 17, 2019

I am afraid that reproducing the current models won't be possible, maybe not even with Google internal information. If the text used for training was extracted from Internet sources (it looks like that), then that extraction cannot be reproduced. The original extracted text would be needed, also how it was distributed on the trained fonts and which parameters were used for text2image. If the distribution was random, it can only be reproduced if it used a pseudo randomness and if the random sequence is reproducible.

Most of the current models have known deficits, so maybe it is not a great loss if they cannot be reproduced exactly. The important thing is finding a way to get new models from scratch without those deficits, but with comparable or better quality, and with a 100 % defined training process.

@zdenop
Copy link
Contributor

zdenop commented Apr 17, 2019

just by clear regarding my statement about legacy engine: Fraktur fonts belong to special fonts.

@amitdo
Copy link

amitdo commented Apr 17, 2019

Another issue is that some of the fonts they used for training are not open source fonts and cost some $$.

@stweil
Copy link
Contributor Author

stweil commented Apr 17, 2019

@wrznr, I think that Ray's statement is the best piece of information which we currently have on the training done by Google.

The text corpus is from all the www, taken several years ago, plus more
recent data from wiki-something.
The text is divided by language automatically, so there is a separate
stream for each of the Devanagari-based languages (as there is for the
Latin-based languages) and clipped to 1GB for each language.

A 1 GB text file for a single language which was taken from "all the www" is not only too large to be easily handled, but will also contain lots of copyrighted text. That might be a major reason why such files could not be shared.

@wrznr
Copy link

wrznr commented Apr 17, 2019

@stweil I missed that piece of information. Thanks. I always thought that the training texts would be part of the data repos. If this is not the case, I really think we should make an effort and come up with re-trainable alternatives. Wikipedia could be a good source for the texts.

@stweil
Copy link
Contributor Author

stweil commented Apr 17, 2019

The small training texts in the data repos were sufficient for the legacy model. I have no idea how the larger training texts in langdata_lstm were used at Google, but obviously they are much less than a gigabyte.

Wikipedia can contribute training text, but those text uses modern language and is not formatted like printed books. Wikisource offers older texts, and other projects (like Project Gutenberg) also offer the typical book layout. I expect a higher quality from those sources than from a more random www sample. Maybe we can also use other large existing text corpora.

@wollmers
Copy link

Just my 2 cents as comment to what the basic languages models should be:

  1. modern language, let's define it for German as 1950 or later

Personally I gave up the idea to distinguish orthographies in the intervals of 1750, 1830, 1875, 1901 and 1996. Now I just divide my corpora into periods of 50 years like 1800-1849, 1850-1899, etc. It's always possible to combine them into longer periods.

Modern, because I assume the majority of users need modern language. Archives and libraries have other requirements and can help themselves.

  1. training text

From all the available corpora which I know https://wortschatz.uni-leipzig.de/de/download provides random "proper" sentences of different sizes, domains (e.g. news, web, wikipedia). For German up to 300 M-sentences, which is IMHO not very handy to process. The license is friendly:

All corpora provided for download are licensed under CC BY.

Some what what 1 M-sentences mean:

deu-at_web-public_2019_1M


                           TOTAL         UNIQUE
words:                  18180427         900958 # tokens, i.e. including punctuation tokens
chars:                 100837015            636 # graphemes, but only a few with more than 1 codepoint
bigrams:                84990490           9719
trigrams:               70289490          91441
word size avg.:             5.55

Thus 1 M-sentences need ~100 MB. According to Zipf's law the average of ~18 tokens per sentence is very constant between the German corpora. The size of the alphabet (unique chars/graphemes) is very different, because some corpora include non-Latin scripts like Greek, Arabic, Hebrew, Cyrillic, Chinese, and emoticons too.

BTW: None of the corpora I know is free of spelling errors. Even DTA has still errors like Dundes- -> Bundes-, -uug -> -ung and many long-s/f mismatches. In a crawled corpus the errors would be more.

I am not sure if size matters for training. If there would be a gain in accuracy using 1 GB text versus 100 MB, or if it would degrade. Other works using CTC/(B)LSTM show a stagnation along increasing dictionary sizes up to 90 K-words (morphemes or surface forms). HMMs degrade early, but exactly this was the reason to use CTC.

  1. character set

IMHO the current character set of deu.traineddata is too small. See #45 for missing bullet. Some of the more frequent should be included like EM DASH. Also letters outside the official alphabet a-z äöüß A-Z ÄÖÜ to allow foreign words or names, if they appear in the training texts and are part of Latin-1 and Latin-2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants