You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The lang codes shall be converted from Tesseract codes to standard 2-letter codes.
A mapping structure needs to be created (I have done the mapping before codes_lookup.xml but it definitely must be updated) which can be used in a function ie.
I can create the mapping file but I do not feel competent doing the coding.
The text was updated successfully, but these errors were encountered:
filak
changed the title
ALTO output - add support for LANG attribute in TextBlock/TextLine elements
Feature Request: ALTO output - add support for LANG attribute in TextBlock/TextLine elements
Apr 5, 2023
The ALTO specification says "Attribute to record language of the string. The language should be recorded at the highest level possible." So an implementation must not set LANG for textlines when all lines in a textblock have the same language.
And there is another problem. Strictly speaking Tesseract does not detect the language of a text. It uses models for the recognition. Some of those models include a dictionary for a certain language and are named using 3-letter ISO codes. But even if the text was detected by eng.traineddata that does not always mean that the detected text is English.
How would we handle a typical case where a self-trained model without dictionary or a script model like Latin.traineddate was used?
Would LANG be set as expected when Tesseract was called with more than one language model?
My point is that Tesseract outputs language info into hocr but in alto there is none.
There is some conditional logic - if there is paragraph_lang => no lang output for TextLine. Is it sufficient to satisfy the "highest level possible" requirement ?
The auto mapping seems overkill. What if it is left for the user to decide what value will go into the LANG attribute(s) by using some optional parameter ?
Your Feature Request
It might be relatively simple to do this by looking at the hocrrenderrer
tesseract/src/api/hocrrenderer.cpp
Line 243 in 5f297dc
tesseract/src/api/hocrrenderer.cpp
Line 302 in 5f297dc
It could be adapted in altorenderer
tesseract/src/api/altorenderer.cpp
Line 215 in 424b17f
ie.
The lang codes shall be converted from Tesseract codes to standard 2-letter codes.
A mapping structure needs to be created (I have done the mapping before codes_lookup.xml but it definitely must be updated) which can be used in a function ie.
I can create the mapping file but I do not feel competent doing the coding.
The text was updated successfully, but these errors were encountered: