Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ful #154

Open
tukulor opened this issue Dec 10, 2020 · 8 comments
Open

ful #154

tukulor opened this issue Dec 10, 2020 · 8 comments

Comments

@tukulor
Copy link

tukulor commented Dec 10, 2020

ful.traineddata.txt

ful.zip

For the tukulor or ful language, with estimated (about) 80 million speakers , this is a ful.traineddata (pls remove the .txt) produced with tessercat 4 from the enclosed image and box files , from scratch, with about 1 minute training

When I put this into my /usr/share/tesseract-ocr/4.00/ folder, then it works fine

@tukulor
Copy link
Author

tukulor commented Dec 10, 2020

To make a dictionary, for tesseract, is still too complicated

tessercat should be arranged in the way, that it's doing that automatically, on the base of a file containing the words, of the language, optionally followed by the meaning in one or more other languages , or by example frases (see below)

Please fix / program that in the tesseract program, so that it includes a dictionary , from an input file of words, or parts of words, like the following. In a future version of tesseract, one should include in the ocr that one can make a --dictionary option so that, ocr-ing a dictionary, the program itself re-aranges that into such a word-list / dictionary fiole which can be input for a better traineddata file (inclusive dictionary) for tessercat itself


wulundu Katze/deu
sondu Vogel/deu
rawaandu Hund/deu "Mi gattino rawaandu ndu" "Ich habe den Hund gebissen"/deu
galle haus/deu house/eng
"Mi danyaani galle" "Ich habe kein Haus"/deu "Eu nao rtenho casa"/por
jullere Stuhl/deu
kuriire Küche/deu
laawol Weg/deu
lukujaderre Eidechse/deu lizard/en
danki Bett/deu /bed/en
kogol Garten/deu
julirde Moschee/deu mesquita/por
yahde gehen/deu

@stweil
Copy link
Contributor

stweil commented Dec 10, 2020

Is this the same as Pulaar language?

Or is it the Fula language? There seem to exist different scripts for that (based on Latin, Arabic or other scripts).

@stweil
Copy link
Contributor

stweil commented Dec 10, 2020

The data which you provided is not sufficient for the current Tesseract, but made for the old Tesseract 3 recognizer.

There is also no training text. The included word lists are empty.

@tukulor
Copy link
Author

tukulor commented Dec 11, 2020

ful.traineddata.zip
ful.daten.zip

I trained that now with tessercat 5 . the file ful.traineddata.zip is not zipeed, one can remote the .zip behind . The files used to provide that, are in the ful.daten.zip file. I didnt make yet a dictionary.

That language is extended over a big area and has many names, such as tukulor , peulle, pulaar, fula, fulfulde, bolle fulbe , ...

@tukulor
Copy link
Author

tukulor commented Dec 13, 2020

I want to know, when I producing new box files of aonther text, for training tesseract, if after the first run with any data the relevant informations from these data are already "embedded" into the traineddata file, so that for further training i don't need to use these data again (but only new data) , or if I have to let and use all data accumulated in the folder and add the next data.

And, in the first case, if later one want to "remove" sama training data which one added before, if this is possible.

With thess 5, I have the problem that after accumulating plenty data / box files of new texts, and run everything togehter, then during trainig the program crashs with any matrix error. If I add only new data (nut including the previously included data again) then the problem dont occure.

@tukulor
Copy link
Author

tukulor commented Dec 14, 2020

ful.traineddata.zip

Enclosed an updated trained data file

@tukulor
Copy link
Author

tukulor commented Dec 16, 2020

neue-daten.zip

ful.frequent_words_list.zip
ful.words_list.zip
ful.traineddata.zip

Hier weitere box und jpg files , Wortlisten, und ein neues ful.traineddata für die Sprache Fula / Tukölör / Pulaar Außer einem haben die Dateien kein zip Format, sie wurden nur umbenannt als .zip weil sonst das Hochladen nicht geht (d.h. Umbenennen ohne .zip)

@tukulor
Copy link
Author

tukulor commented Dec 18, 2020

Ich tue jetzt alle weiteren Dateien und Verbesserungen von ful.traineddata nach :

https://github.com/tukulor/ful.traineddata

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants