Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: ALTO output - add support for LANG attribute in TextBlock/TextLine elements #4046

Open
filak opened this issue Apr 5, 2023 · 2 comments
Labels
output issues related output formats

Comments

@filak
Copy link

filak commented Apr 5, 2023

Your Feature Request

It might be relatively simple to do this by looking at the hocrrenderrer

paragraph_lang = res_it->WordRecognitionLanguage();

paragraph_lang = res_it->WordRecognitionLanguage();
if (paragraph_lang) {
  hocr_str << " lang='" << paragraph_lang << "'";
}

const char *lang = res_it->WordRecognitionLanguage();

const char *lang = res_it->WordRecognitionLanguage();
if (lang && (!paragraph_lang || strcmp(lang, paragraph_lang))) {
  hocr_str << " lang='" << lang << "'";
}

It could be adapted in altorenderer

if (res_it->IsAtBeginningOf(RIL_PARA)) {

ie.

    if (res_it->IsAtBeginningOf(RIL_PARA)) {
      alto_str << "\t\t\t\t\t<TextBlock ID=\"block_" << tcnt << "\"";
      AddBoxToAlto(res_it, RIL_PARA, alto_str);
      paragraph_lang = res_it->WordRecognitionLanguage();
      if (paragraph_lang) {
        alto_str << " LANG='" << paragraph_lang << "'";
      }
      alto_str << "\n";
    }

    if (res_it->IsAtBeginningOf(RIL_TEXTLINE)) {
      alto_str << "\t\t\t\t\t\t<TextLine ID=\"line_" << lcnt << "\"";
      AddBoxToAlto(res_it, RIL_TEXTLINE, alto_str);
      const char *lang = res_it->WordRecognitionLanguage();
      if (lang && (!paragraph_lang || strcmp(lang, paragraph_lang))) {
        alto_str << " LANG='" << lang << "'";
      }
      alto_str << "\n";
    }

The lang codes shall be converted from Tesseract codes to standard 2-letter codes.

A mapping structure needs to be created (I have done the mapping before codes_lookup.xml but it definitely must be updated) which can be used in a function ie.

 alto_str << " LANG='" << GetLangCodeForAlto(lang) << "'";

I can create the mapping file but I do not feel competent doing the coding.

@filak filak changed the title ALTO output - add support for LANG attribute in TextBlock/TextLine elements Feature Request: ALTO output - add support for LANG attribute in TextBlock/TextLine elements Apr 5, 2023
@stweil
Copy link
Contributor

stweil commented Apr 5, 2023

The ALTO specification says "Attribute to record language of the string. The language should be recorded at the highest level possible." So an implementation must not set LANG for textlines when all lines in a textblock have the same language.

And there is another problem. Strictly speaking Tesseract does not detect the language of a text. It uses models for the recognition. Some of those models include a dictionary for a certain language and are named using 3-letter ISO codes. But even if the text was detected by eng.traineddata that does not always mean that the detected text is English.

How would we handle a typical case where a self-trained model without dictionary or a script model like Latin.traineddate was used?

Would LANG be set as expected when Tesseract was called with more than one language model?

@filak
Copy link
Author

filak commented Apr 6, 2023

My point is that Tesseract outputs language info into hocr but in alto there is none.

There is some conditional logic - if there is paragraph_lang => no lang output for TextLine. Is it sufficient to satisfy the "highest level possible" requirement ?

The auto mapping seems overkill. What if it is left for the user to decide what value will go into the LANG attribute(s) by using some optional parameter ?

ie.

 tesseract input.tiff output -l eng --altolang en

@amitdo amitdo added the output issues related output formats label Apr 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
output issues related output formats
Projects
None yet
Development

No branches or pull requests

3 participants