Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ALTO renderer: move to v4, add Glyphs #2815

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

bertsky
Copy link
Contributor

@bertsky bertsky commented Dec 12, 2019

This adds RIL_SYMBOL bboxes and text to the ALTO output via Glyph, which was introduced with v4, hence the namespace update. Looking at the changelog of the schema XSD, I don't see any problems in terms of backwards incompatibility (at least for the features we have been using so far here).

The output seems to always validate on what I have seen so far. But there might be surprises. @jakesebright or @stweil maybe you want to take a look?

I don't know of any tools that can actually visualize ALTO Glyphs yet. PageViewer does not render them yet AFAICT. I have no Aletheia though.

@bertsky
Copy link
Contributor Author

bertsky commented Dec 13, 2019

The second commit adds VariantType, i.e. OCR alternatives via ChoiceIterator (as in hocr renderer). This yields only 1 Variant equal to the Glyph content itself with LSTMs – unless lstm_choice_mode is non-zero –, or any number of variants otherwise (i.e. old models or LSTMs with lstm_choice_mode>0).

@stweil
Copy link
Contributor

stweil commented Dec 13, 2019

Which software uses the additional information in the ALTO file?

@bertsky
Copy link
Contributor Author

bertsky commented Dec 13, 2019

Which software uses the additional information in the ALTO file?

As for current tools, I don't know. (v4 is still pretty new.)

In principle, any software that wants to have precise coordinates of individual characters, and with the second commit, notably, post-processing software.

With visualizers like PageViewer (which will very likely also show Glyphs for ALTO soon) this is also a nice option for debugging.

@bertsky
Copy link
Contributor Author

bertsky commented Dec 13, 2019

@stweil do you want me to make glyph and/or variant output optional, let's say with a variable alto_add_glyphs?

@stweil
Copy link
Contributor

stweil commented Dec 13, 2019

My primary focus regarding ALTO is best support for those tools which require ALTO data, especially for the DFG viewer. Storing ALTO and delivering it via HTTP works best with small files, so I would not want to add information which is not useful. Issue #2700 is an example of important information which is currently missing.

Of course it should be possible to record all OCR results including glyphs and alternate choices. But isn't hOCR sufficient for that?

@bertsky
Copy link
Contributor Author

bertsky commented Dec 13, 2019

My primary focus regarding ALTO is best support for those tools which require ALTO data, especially for the DFG viewer. Storing ALTO and delivering it via HTTP works best with small files, so I would not want to add information which is not useful.

That's but one of many equally valid use-cases. Besides, why not compress the xml when sending? (The impact of these extra annotations on compressed file size should be marginal.)

Regardless, the proposed extra option should completely take care of this – whether or not this is to be enabled by default would be debatable.

Issue #2700 is an example of important information which is currently missing.

I didn't start off to solve everything ALTO here or prioritize. Besides, your #2705 already solves that, doesn't it?

Of course it should be possible to record all OCR results including glyphs and alternate choices. But isn't hOCR sufficient for that?

hOCR is inadequate for many reasons. Besides, ALTO does have that representation itself – why ignore it? Different output renderers should not compete with each other IMO – they should each try to provide the best they can.

@bertsky
Copy link
Contributor Author

bertsky commented Dec 17, 2019

With visualizers like PageViewer (which will very likely also show Glyphs for ALTO soon) this is also a nice option for debugging.

It already does now!

@bertsky
Copy link
Contributor Author

bertsky commented Dec 18, 2019

Partial CI failure (on macos) is unrelated (it cannot find cmake)...

@stweil
Copy link
Contributor

stweil commented Dec 20, 2019

@bertsky, I just tested the new code. The size of the ALTO output for a single page increased from 40091 to 223133 byte, mainly because now each glyph gets its own XML element.

I think the default should be close to the old output, that means no glyphs and compatible to the DFG viewer. Do we require a new parameter to enable more detailed output, or would it be sufficient to use lstm_choice_mode for ALTO output, too?

@bertsky
Copy link
Contributor Author

bertsky commented Dec 20, 2019

and compatible to the DFG viewer

@stweil you still did not elaborate on why exactly the output is incompatible now. Is it really file size? (I cannot believe this.) Or rather the v4 namespace? (Then we need an option for/against that.)

Do we require a new parameter to enable more detailed output, or would it be sufficient to use lstm_choice_mode for ALTO output, too?

This is completely unrelated I am afraid. lstm_choice_mode can give you glyph variants. But you are already dissatisfied with glyphs. Besides, there are always legacy oem and models.

If the issue is really size, not namespace, then it should be something like alto_add_glyphs – as already proposed above. The question then becomes: default to 1 or to 0?

@stweil
Copy link
Contributor

stweil commented Dec 20, 2019

Partial CI failure (on macos) is unrelated (it cannot find cmake)...

@bertsky, @zdenop, that was caused by a software update of the Travis build infrastructure which also replaced cmake by a newer version. Tesseract's build cache still used the old cmake link which was no longer valid.

I cleaned the cache, so it passes now.

@M3ssman
Copy link
Contributor

M3ssman commented Jan 2, 2020

@bertsky The ALTO glyph data is not relevant for presentation in viewers. Increasing file size by such magnitudes is per se not preferable because of something that has no value for web users. As you mentioned, it can be a handy debugging feature, which is only relevant in development context, so having a flag disabled by default makes sense IMHO.

@bertsky
Copy link
Contributor Author

bertsky commented Jan 2, 2020

@M3ssman thanks for reviving the discussion!

The ALTO glyph data is not relevant for presentation in viewers.

You mean for document presentation scenarios like DFG-Viewer I guess. But viewers could also be targeting evaluation (e.g. showing errors/differences between different versions visually) and GT production.

I'll gladly add the extra parameter. But before I do, can someone please confirm the issue with DFG-Viewer is not the v4 namespace? (Because if it is, then it would make more sense to make the parameter about v3 vs v4 instead of glyphs or not.)

- use TextBlock, Illustration, GraphicalElement (not just TextBlock),
  as appropriate for the internal block types
- do not enter RIL_TEXTLINE, RIL_WORD, RIL_SYMBOL and ChoiceIterator
  on anything other than TextBlocks
- refactor loop to make it more readable
@bertsky
Copy link
Contributor Author

bertsky commented Jan 28, 2020

Sorry, I just rebased to current master.

I'll gladly add the extra parameter. But before I do, can someone please confirm the issue with DFG-Viewer is not the v4 namespace? (Because if it is, then it would make more sense to make the parameter about v3 vs v4 instead of glyphs or not.)

@cneud @wrznr, do you have information / opinions on this?

@wrznr
Copy link

wrznr commented Mar 4, 2020

I agree that glyph information should not be in the ALTO output per default. Concerning the v3-vs.-v4 question, I will try to reach out to the DFG viewer team.

@sebastian-meyer
Copy link

sebastian-meyer commented Mar 4, 2020

The DFG Viewer completely ignores the namespace and parses ALTO files only down to the textline level, so it should handle v4 well. Do you have some example files I can use to verify?

(Side note: While displaying the fulltext in the DFG-Viewer should work for v4, indexing in Kitodo.Presentation is broken because of a currently fixed namespace URI. We'll have to make this more flexible: kitodo/kitodo-presentation#488)

@bertsky
Copy link
Contributor Author

bertsky commented Mar 4, 2020

Ok, so assuming the namespace URI will be more flexible in Kitodo.Presentation, does that already warrant producing v4 here, or does backwards-compatibility (for supposedly lots of old viewer installations) still triumph?

(In the former case, I would add an option alto_char_boxes=0, whereas in the latter case I would add an option alto_v4=0 which then also implies glyph output when enabled.)

@sebastian-meyer
Copy link

I'd prefer backwards compatibility! So +1 for alto_v4=0. 👍

@stweil
Copy link
Contributor

stweil commented Mar 5, 2020

What about using alto_version=3 instead of alto_v4=0? That would also work when there is a version 5 some day. alto_char_boxes=0would still be needed if there are good reasons for ALTO v4 (or later v5) without character boxes.

@amitdo
Copy link
Collaborator

amitdo commented Apr 27, 2020

@bertsky, please update the PR. Maybe @stweil will finally merge it :-)

@bertsky
Copy link
Contributor Author

bertsky commented Apr 28, 2020

please update the PR. Maybe @stweil will finally merge it :-)

I will. Sorry about the delay!

@Shreeshrii
Copy link
Collaborator

@bertsky Is this PR ready to merge?

@bertsky
Copy link
Contributor Author

bertsky commented Dec 19, 2020

@Shreeshrii no, I still have to add alto_char_boxes and alto_version parameters. Plus I have a half-finished version of PSM_AUTO_ONLY (--psm 2) that could come with the same PR. Sorry for the delay everyone. Please have a little more patience.

@amitdo
Copy link
Collaborator

amitdo commented May 1, 2021

@bertsky, do you remember this PR?

@bertsky
Copy link
Contributor Author

bertsky commented May 1, 2021

@bertsky, do you remember this PR?

@amitdo, yes I do. Apologies for keeping you all waiting for so long. I first have to bisect lots of uncommited changes, which include an implementation of PSM_AUTO_ONLY and fixes to the page/result iterator functions (to avoid missing segments or stopping short under certain rare conditions).

@amitdo
Copy link
Collaborator

amitdo commented May 1, 2021

About the implementation of PSM_AUTO_ONLY. Please consider doing it in another PR.

@bertsky
Copy link
Contributor Author

bertsky commented May 1, 2021

About the implementation of PSM_AUTO_ONLY. Please consider doing it in another PR.

Of course! I'll factor all said changes into separate PRs and test them thoroughly before publishing.

@amitdo amitdo added the stale label May 10, 2021
@amitdo
Copy link
Collaborator

amitdo commented Aug 27, 2021

Last chance before the final release of 5.0.0.

@bertsky
Copy link
Contributor Author

bertsky commented Sep 13, 2021

Last chance before the final release of 5.0.0.

@amitdo what's your (exact) timeline here? (I also have other important bugfixes related to the result iterators in the queue, which I detailled to @stweil a while ago...)

@amitdo
Copy link
Collaborator

amitdo commented Sep 13, 2021

@bertsky
Copy link
Contributor Author

bertsky commented Sep 13, 2021

#3331

https://groups.google.com/g/tesseract-ocr/c/pd_8B0wGBZc

Thanks.

IIUC this board contains conflicting statements about this PR (which contains the feature of producing image and table regions in ALTO):
https://github.com/tesseract-ocr/tesseract/projects/1#card-61640188 says this is postponed, while https://github.com/tesseract-ocr/tesseract/projects/1#card-58171156 says it is yet to do.

Anyway, I will try to get the above discussed blockers out of the way quickly.

@stweil stweil added enhancement and removed stale labels Oct 27, 2021
@bertsky
Copy link
Contributor Author

bertsky commented Jul 12, 2022 via email

@zdenop
Copy link
Contributor

zdenop commented Nov 24, 2022

@bertsky : we would like to release 5.3.0 in mid of December. Can you finish this PR for it?

@bertsky
Copy link
Contributor Author

bertsky commented Nov 24, 2022

@zdenop I'll revisit soon, yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants