Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kur_ara does not have Arabic unicharset. #14

Open
Shreeshrii opened this issue Mar 23, 2018 · 12 comments
Open

kur_ara does not have Arabic unicharset. #14

Shreeshrii opened this issue Mar 23, 2018 · 12 comments

Comments

@Shreeshrii
Copy link
Contributor

Please see details at

tesseract-ocr/tessdata#88 (comment)

tesseract-ocr/langdata#116

tesseract-ocr/tessdata_best#23

@jbreiden @AlexanderP - FYI - regarding problem with packaged traineddata for kur_ara.

@AlexanderP
Copy link

@Shreeshrii I understood correctly.
trainedata need to change in packages?
tesseract-ocr-kur-ara -> tesseract-ocr-kur
tesseract-ocr-kur -> tesseract-ocr-kur-ara

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Mar 24, 2018 via email

@Shreeshrii
Copy link
Contributor Author

@AlexanderP

tesseract-ocr-kur-ara -> tesseract-ocr-kur

Yes, the above change can be made. Currently kur_ara has Latin text only.

tesseract-ocr-kur -> tesseract-ocr-kur-ara

This cannot be done since there is no kur traineddata in tessdata_fast.

@Shreeshrii
Copy link
Contributor Author

@jbreiden @theraysmith

Should I build kur_ara from the ara.traineddata eg. by replacing the wordlist?

Or is there an updated set of Arabic script traineddatas that can be uploaded before 4.0.0 release?

ref: tesseract-ocr/langdata#83 (comment)

I was going to push until I discovered a bug with the RTL word lists.
Then I also need to integrate this issues list, that I haven't looked at in a while, and rerun training.

@amitdo
Copy link

amitdo commented Mar 24, 2018

Maybe it should be 'kur_lat'.

@AlexanderP
Copy link

There is no traineddata for kur in tessdata_fast.
I will unpack and convert the dawgs to word list and see if it is possible
to correct kur_ara files.
Please do not make any change yet.

ok

@stweil
Copy link
Contributor

stweil commented Dec 17, 2019

Was this issue solved by the renaming?

@Shreeshrii
Copy link
Contributor Author

kmr is Kurdish in Latin script. Renaming has fixed that issue.

kur was Kurdish in Arabic script in Tesseract3. We have still not restored kur or kur_ara.

@stweil
Copy link
Contributor

stweil commented Dec 19, 2019

So you suggest to restore https://github.com/tesseract-ocr/tessdata/blob/3.04.00/kur.traineddata to the master branch of tessdata?

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Dec 19, 2019 via email

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Dec 20, 2019 via email

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Jan 29, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants