Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fine tuning arabic traineddata to solve extended words issue #362

Open
sifdinNh opened this issue Nov 28, 2023 · 2 comments
Open

fine tuning arabic traineddata to solve extended words issue #362

sifdinNh opened this issue Nov 28, 2023 · 2 comments

Comments

@sifdinNh
Copy link

so i want to finetune ara.traineddata in the traineddata_best repo to handle extended words like the this :

sample_9

to do that i made a list of lines with the same format like this :

.............
الســــــــيد العضـــــو د. عــــلي العتيبــــــي:
الســــــــيد العضـــــو جــــمال الحــــربي:
الســــــــيد العضـــــو د. خالــــد الفيصـــــل:
الســـــــــيد العضـــــو تركـــــي المطيــــري:
..............

i started by genereting ground truth files with .tif images and .box files

then started training with this:

make training MODEL_NAME=ara_new TESSDATA=../tesseract/tessdata START_MODEL=ara MAX_ITERATIONS=10000 LANG_TYPE=RTL

i started with 99%BCER and stoped when i had 24% BCER

when i came to test the traineddata file with evalute it with best traineddata ara.trainedata

i got a poor result

this is the result of best traineddata for arabic:
sample_5
it's giving me almost 90% accuracy

but when i tested the new trained file this is the result :
sample_5

it's like doesn't recognize anything and the main the reason i started this is to finetune it to better accuracy

@sifdinNh
Copy link
Author

@zdenop

@AhmadHakami
Copy link

uncertain if the issue arises because the model was trained on multiline in tiff, but have you attempted fine tuning with one line text in images? give it a try if not yet and share results with us

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants