Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update deu.unicharset #43

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

OttoKerner
Copy link

The character ı is not part of the german alphabet. It is not commonly used in german texts.
All it does is to very frequently mess up OCR results, because it is mistakenly recognized instead of an i.

The character `ı` is not part of the german alphabet. It is not commonly used in german texts. 
All it does is to frequently mess up OCR results, because it is mistakenly recognized instead of an `i`.
@stweil
Copy link
Contributor

stweil commented Jul 26, 2021

Meanwhile that character is common even in German texts (especially in names), see file deu.training_text. Updating deu.unicharset won't help as long as the training text adds that character again.

I am afraid your change has to wait until there is a new training with different training text for deu. And then deu.unicharset will be created automatically, so any manual changes are overwritten anyway.

I wonder why the unicharset files are included in langdata_lstm at all. Maybe we should remove all of them.

@OttoKerner
Copy link
Author

Is there a documentation how these training texts are generated? Even a cursory glance at it tells me that turkish words are clearly over-represented in it.

@stweil
Copy link
Contributor

stweil commented Jul 27, 2021

No, sorry, we don't know details about the training which was done by Google. It looks like many training texts were extracted from web pages. Here in Mannheim Turkish words are very present in my neighborhood.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants