Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Tamil and Sinhala traineddata #65

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Chaarangan
Copy link

@Chaarangan Chaarangan commented Sep 1, 2021

Improved by training on custom fonts.
Tesseract Version: 4.1.1

Trained Fonts

            Tamil
======================================
Step 01:
        "TAMu_Kadambri" 
        "TAMu_Kalyani" 
        "TAMu_Maduram" 
        "Lohit Tamil" 
        "Droid Sans Tamil Bold" 
        "Droid Sans Tamil" 
        "Karla Tamil Inclined Bold Italic" 
        "Karla Tamil Inclined Italic" 
        "Karla Tamil Upright Bold" 
        "Karla Tamil Upright" 
        "Noto Sans Tamil Bold" 
        "Noto Sans Tamil" 
        "Noto Sans Tamil UI Bold" 
        "Lohit Tamil Classical" 
        "Akshar Unicode" 
        "Arial Unicode MS" 
        "Arima Madurai" 
        "Arima Madurai Bold" 
        "Catamaran" 
        "Catamaran Bold" 
        "Catamaran Heavy" 
        "Catamaran Light" 
        "Catamaran Medium" 
        "Catamaran Ultra-Bold" 
        "Coiny Regular" 
        "Droid Sans Tamil" 
        "Droid Sans Tamil Bold" 
        "GIST-TMOTAbhirami Bold" 
        "GIST-TMOTAbhirami Ultra-Heavy Italic" 
        "GIST-TMOTAmala Bold" 
        "GIST-TMOTAmala Ultra-Heavy Italic" 
        "GIST-TMOTAppar Bold" 
        "GIST-TMOTAppar Ultra-Heavy Italic" 
        "GIST-TMOTChanakya" 
        "GIST-TMOTChanakya Bold" 
        "GIST-TMOTChanakya Italic" 
        "GIST-TMOTChanakya Ultra-Heavy Italic" 
        "GIST-TMOTIlango" 
        "GIST-TMOTIlango Bold" 
        "GIST-TMOTKalyani Bold" 
        "GIST-TMOTKalyani Ultra-Heavy Italic" 
        "GIST-TMOTKamal" 
        "GIST-TMOTKamal Bold" 
        "GIST-TMOTKamal Italic" 
        "GIST-TMOTKamal Ultra-Heavy Italic" 
        "GIST-TMOTKannadasan" 
        "GIST-TMOTKannadasan Italic" 
        "GIST-TMOTKannagi Bold" 
        "GIST-TMOTKannagi Ultra-Heavy Italic" 
        "GIST-TMOTKomala Bold" 
        "GIST-TMOTKomala Ultra-Heavy Italic" 
        "GIST-TMOTKrishnan Bold" 
        "GIST-TMOTKumudam" 
        "GIST-TMOTLalitha" 
        "GIST-TMOTLalitha Bold" 
        "GIST-TMOTLalitha Italic" 
        "GIST-TMOTLalitha Ultra-Heavy Italic" 
        "GIST-TMOTMadhura Bold" 
        "GIST-TMOTMina Bold" 
        "GIST-TMOTNambi" 
        "GIST-TMOTNambi Bold" 
        "GIST-TMOTNambi Italic" 
        "GIST-TMOTNambi Ultra-Heavy Italic" 
        "GIST-TMOTPadma" 
        "GIST-TMOTPadma Bold" 
        "GIST-TMOTParvathi Bold" 
        "GIST-TMOTPattinathar" 
        "GIST-TMOTPattinathar Bold" 
        "GIST-TMOTPattinathar Bold Italic" 
        "GIST-TMOTPattinathar Italic" 
        "GIST-TMOTSuman Bold" 
        "Hind Madurai" 
        "Hind Madurai Bold" 
        "Hind Madurai Light" 
        "Hind Madurai Medium" 
        "Hind Madurai Semi-Bold" 
        "Karla Tamil Inclined Bold Italic" 
        "Karla Tamil Inclined Italic" 
        "Karla Tamil Upright" 
        "Karla Tamil Upright Bold" 
        "Kavivanar" 
        "Latha" 
        "Lohit Tamil" 
        "Lohit Tamil Classical" 
        "Meera Inimai" 
        "Mukta Malar" 
        "Mukta Malar Bold" 
        "Mukta Malar Light" 
        "Mukta Malar Medium" 
        "Mukta Malar Semi-Bold" 
        "Mukta Malar Ultra-Bold" 
        "Nirmala UI" 
        "Noto Sans Tamil" 
        "Noto Sans Tamil Bold" 
        "Noto Sans Tamil UI Bold" 
        "Noto Serif Tamil" 
        "Noto Serif Tamil Bold" 
        "Pavanam" 
        "Post No Bills Jaffna" 
        "Post No Bills Jaffna Bold" 
        "Post No Bills Jaffna ExtraBold, Ultra-Bold" 
        "Post No Bills Jaffna Light, Light" 
        "Post No Bills Jaffna Medium, Medium" 
        "Post No Bills Jaffna SemiBold, Semi-Bold" 
        "SUNDARAM-0806" 
        "SUNDARAM-0807" 
        "SUNDARAM-0808" 
        "SUNDARAM-0810" 
        "SUNDARAM-0812" 
        "SUNDARAM-0819" 
        "SUNDARAM-0820" 
        "SUNDARAM-0821" 
        "SUNDARAM-0823" 
        "SUNDARAM-0824" 
        "SUNDARAM-0827" 
        "SUNDARAM-0830" 
        "SUNDARAM-0831" 
        "SUNDARAM-1341"
        "SUNDARAM-1351" 
        "SUNDARAM-1352" 
        "SUNDARAM-2852" 
        "SUNDARAM-2865" 
        "SUNDARAM-3811" 
        "SakalBharati" 
        "TABUni-Tamil021" 
        "TABUni-Tamil032" 
        "TAMUni-Tamil042" 
        "TAMUni-Tamil046" 
        "TAMUni-Tamil150" 
        "TAMUni-Tamil195" 
        "TAMu_Kadambri" 
        "TAMu_Kalyani" 
        "TAMu_Maduram" 
        "TAU-Achu" 
        "TAU-Achu Italic," 
        "TAU-Barathi" 
        "TAU-Barathi Bold" 
        "TAU-Barathi Bold Italic" 
        "TAU-Barathi Italic" 
        "TAU-Ezhil" 
        "TAU-Ezhil Bold, Bold" 
        "TAU-Ezhil Italic, Italic" 
        "TAU-Kabilar" 
        "TAU-Kabilar Bold" 
        "TAU-Kabilar Bold Italic" 
        "TAU-Kabilar Italic" 
        "TAU-Kambar" 
        "TAU-Kambar Bold" 
        "TAU-Kambar Bold Italic" 
        "TAU-Kambar Italic" 
        "TAU-Kaveri" 
        "TAU-Kaveri Bold" 
        "TAU-Kaveri Bold Italic" 
        "TAU-Kaveri Italic" 
        "TAU-Kurinji" 
        "TAU-Kurinji Bold, Bold" 
        "TAU-Kurinji Italic, Medium Italic" 
        "TAU-Malar" 
        "TAU-Malar Bold, Bold" 
        "TAU-Malar Italic, Italic" 
        "TAU-Marutham"
        "TAU-Marutham Bold," 
        "TAU-Marutham Italic," 
        "TAU-Mullai Bold, Bold" 
        "TAU-Mullai Italic" 
        "TAU-Mullai Italic, Italic" 
        "TAU-Neythal" 
        "TAU-Neythal Bold, Bold" 
        "TAU-Neythal Italic, Italic" 
        "TAU-Nilavu Bold, Bold" 
        "TAU-Nilavu Italic" 
        "TAU-Nilavu Italic, Italic" 
        "TAU-Valluvar" 
        "TAU-Valluvar Bold" 
        "TAU-Valluvar Bold Italic" 
        "TAU-Valluvar Italic" 
        "Vijaya Bold" 

Step 02:    
    "Chemmozhi Comic"
    "Chemmozhi Paranar"
    "Chemmozhi Thendral"
    "Chemmozhi Thenee Regular"
    "Chemmozhi Times" 
    "Chemmozhi Vaigai" 
    "Sri Tamil Bold"
    "Sri Tamil Oblique"
    "Sri Tamil, Oblique Bold"
    "Sri Tamil Regular" 
    "Sri Tamil Sans Regular"
    "Sri Tamil Sans Oblique"

            Sinhala
======================================
"Noto Sans Sinhala, Bold" 
"BhashitaScreen" 
"Bhashitha2Sans" 
"Bhashitha,Bold" 
"BhashitaComplex" 
"BhashitaComplexSans" 
"BhashitaComplexSans,Bold" 
"DinaminaUniWeb" 
"SARASAVI UNICODE" 
"Malithi web" 
"Hodipotha" 
"WARNA" 

Pages: 250
Iterations: 5000

Tamil
Finished! Error rate = 0.485
Sinhala
Finished! Error rate = 4.237

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Nov 17, 2021

Thank you for for submitting Improved Tamil and Sinhala traineddata. However this repo only holds the files that were trained at Google by Ray Smith. You should submit these to the repo tessdata_contrib. Please make sure to include a readme file similar to https://github.com/tesseract-ocr/tessdata_contrib/blob/main/khmLimon.md which shows the improvements with your new version of files. Thanks!

Edit:
Research Paper: Tamizhi-Net OCR: Creating A Quality Large Scale Tamil-Sinhala-English Parallel Corpus Using Deep Learning Based Printed Character Recognition (PCR)
Github Repo: aaivu-tamizhi-net-OCR

@Shreeshrii
Copy link
Contributor

@stweil Is it possible to move this PR to tessdata_contrib?

@Chaarangan
Copy link
Author

Thank you for for submitting Improved Tamil and Sinhala traineddata. However this repo only holds the files that were trained at Google by Ray Smith. You should submit these to the repo tessdata_contrib. Please make sure to include a readme file similar to https://github.com/tesseract-ocr/tessdata_contrib/blob/main/khmLimon.md which shows the improvements with your new version of files. Thanks!

Edit: Research Paper: Tamizhi-Net OCR: Creating A Quality Large Scale Tamil-Sinhala-English Parallel Corpus Using Deep Learning Based Printed Character Recognition (PCR) Github Repo: aaivu-tamizhi-net-OCR

Thanks a lot, sir. I will follow the given details and update it soon.

@stweil
Copy link
Contributor

stweil commented Nov 18, 2021

@stweil Is it possible to move this PR to tessdata_contrib?

I am afraid that is not possible. We need a new pull request for tessdata_contrib.

@Chaarangan, please add documentation as suggested by @Shreeshrii to your model files. It is also important to describe all steps (command lines, fonts used, installation instructions for fonts, training texts, ...) which where used to generate the new models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants