follow-up to dev mailing list discuss + issue #3998: RFC: Consistent tprintf() use, ERROR: & WARNING: prefixes. #4020

GerHobbelt · 2023-02-25T13:41:13Z

follow-up to dev mailing list discuss + issue #3998: RFC: Info messages to stdout from training utils (Issue #3998)

Here's the extract/backport from my master branch:

use tprintf() for all error/warning/info/debug-diagnostics output --> fprintf(stderr, msg) --> tprintf(msg). This also anticipates migrating tesseract to using the fmt library (C++20) as done by @stweil in a dev branch.
all error messages are prefixed with 'ERROR: '. This enables users, f.e., to grep/filter the stderr/log using a regex like
/^ERROR: /
to filter out the error messages.

Application code which overrides the tprintf() API is another example: there error messages can be easily picked up and sent to different monitor channels / GUI windows / etc. in environments where there's no stdout/stderr (MSWindows GUI) or where end users are not accustomed to watching stdout/stderr.
ditto for 'WARNING: ' level messages.

CAVEAT: this is a manual transfer from my master branch + code review while I was at it. I hope I got them all... :-)

@stweil

…stdout from training utils (Issue tesseract-ocr#3998) Here's the extract/backport from my master branch: - use tprintf() for all error/warning/info/debug-diagnostics output --> fprintf(stderr, msg) --> tprintf(msg). This also anticipates migrating tesseract to using the fmt library (C++20) as done by @stweil in a dev branch. - all error messages are prefixed with 'ERROR: '. This enables users, f.e., to grep/filter the stderr/log using a regex like '/^EERROR: /' to filter out the error messages. Application code which overrides the tprintf() API is another example: there error messages can be easily picked up and sent to different monitor channels / GUI windows / etc. in environments where there's no stdout/stderr (MSWindows GUI) or where end users are not accustomed to watching stdout/stderr. - ditto for 'WARNING: ' level messages. CAVEAT: this is a manual transfer from my master branch + code review whilee I was at it. I hope I got them all... :-)

GerHobbelt · 2023-02-25T13:44:40Z

Notes:

adds an error message when UZN file loading fails -- rather than being utterly silent about that.
Edits a few error messages minimally to make them easier to understand -- at least for me. 😉
As a rule of thumb: version info goes to stdout, but anything else that's not direct OCR results goes to stderr, via tprintf. Thus one can easily redirect stdout to file and not be bothered with a having to pull apart a mix of data and diagnostics messages.

… course, one only sees it when scrolling through the pullreq after the fact.

egorpugin · 2023-02-25T15:34:24Z

This is not right.
Proper way is to add fun/macro error() or warning() or tprintf_error() ... that will wrap text with prefix ERROR: or MESSAGE: .

GerHobbelt · 2023-02-25T19:41:31Z

Ok. Do note however, that a few of those prefixes aren't really prefixes as they are preceded by a "\n" that's part of the same message string, so a function will not cover 100% of cases. 😉

Given that, I suggest I do a macro as then the result remains the same at compile time, e.g. tprintf("\n" TPRNT_ERROR("blah blah"), ....) --> expands to a single constant string. Also nice because fmt depends on the format string to be a const_expr, in case tesseract mainline decides to migrate to using fmt as was done by @stweil in a dev branch of his (and which I blatantly ripped for my own).

BTW: definitely not an ERROR() or error() macro as that has a huge chance of clashing with foreign code. ERROR, for example, is a 100% collision with a system-level ERROR define in Windows (been there, seen that, t-shirt, elsewhere)

Therefore I propose TPRNT_ERROR(msg) and TPRNT_WARNING(msg) macros. Deal? 😉

GerHobbelt · 2023-02-25T19:50:48Z

BTW: there are two important levels for the tprintf() messages:

Error
Warning

and the rest is not prefixed (ok, a rare 'INFO:' may be observed; I'm only consistent 95% of the time 😄 ) so all info/debug messages come without a prefix, like before.
This keeps the general diagnostics output (which is loud and needed if you want to find out what tesseract is doing at all, I find) clean from useless prefixes that only clutter console/logfile space.

Meanwhile the important stuff (errors and warnings) can be easily filtered/routed in applications, where those are relevant to report/log without being obscured by the diag output noise.

If you are of a different opinion re the above levels & prefixes 'rule', I'm all ears and I'll see how this patch can be tweaked.

egorpugin · 2023-02-25T20:31:57Z

If current tprint means logger, we can name new entities log_error(), log_warning(), log_info() etc.
If you want TPRNT_ERROR I think it's better to name it as TPRINT_*. No need to shorten 1 letter.

…nalyse process

…aught by CI

GerHobbelt · 2023-02-27T18:33:41Z

@egorpugin : before I tackle this, there's also tlog() and ERROR::error() which invokes fprintf(stderr, ...) directly, so the question is: what would you like?

I'm myself of two minds about this: I could do it as a tprint_error/tprint_warning/tprint_info set¹ (do we want the tlog() to be merged/mixed in with that, as we're talking code refactoring anyway), which has my minor preference, or we could do it as tprintf-the-whole-darn-lot-with-ERROR+WARNING-macros, or, syslog-like levels, e.g. tprintf(ERROR, msg...) 🤔

If we can get some kind of a management decision on this refactor action, that would be great. 😄 Then I can do it in my own fork and backport it, like the stuff done so far, but I'd rather do the refactor once as it's... some work. And finding out I did it in a direction that's not liked, well... that'll cost me later for a feature I don't mind otherwise while I keep tracking your mainline repo: increased expected cost of git merge collisions. 😭 Anyway, so's you know why I ask this.

Thanks! 👍

Side note: ERROR::error() also has an 'abort'/fatal mode a la glog (google log library, see github). I tackle that sort of thing in my backend by pumping everything through the supposed channels and in case of a fatal, doing a hard flush. That means we'll have something like log_flush() and log_abort() too: I like to keep the flush and actual abort logic separate -- own backend has debugger detection built-in so the fail-on-null-pointer-dereference is then exchanged for a jump-into-debugger call. By providing log_flush() + log_abort() anyone can override those actions easily, like I did for my own backend.

OK or do you want it done differently?

which meshes nicely with my back-end where I have different error/warning/info/debug calls, so log_error(), log_warning(), log_info() is you suggested earlier makes that a cheaper tactic at run-time. 👍 Hence the light preference for this approach. ↩

src/ccstruct/blread.cpp

egorpugin · 2023-02-27T19:50:47Z

Just leave it as is with "ERROR: .
Further refactorings will improve it.

In the second part you are talking about more like full featured logger rather than some one-function-printer.
If tesseract really needs this with sources & sinks & MT support etc, it can be handled with switching to boost log or spd log or small but consistent own implementation. It is not this ticket topic I guess.

GerHobbelt · 2023-02-27T23:01:40Z

Ok, if I understood that correctly, you want to keep this mostly as-is.

I'll do the tprintf() et al -> log_error() et al migration as a separate pullreq later. BTW: the key argument there is to NOT have spdlog/glog/whatever in tesseract itself, but to keep tesseract mostly as-is, BUT provide a somewhat easy way for OTHERS to modify or hook tesseract up to such systems if THEIR use of the tesseract library requires such measures.

In other words:

tesseract-the-application should simply keep writing to stderr as before -- only through a small set of functions, such as tprintf.
tesseract-the-library can do the same, or...
tesseract-the-library can have its tprintf(),etc. output hooked into and rerouted to whatever the application programmer needs, e.g. glog or completely different means.

Hope I made myself clear on the goal there -- which is ultimately what I need / use.
Being consistent in the format (prefix, rather) of error and warning messages is a first step towards all that.
Any refactoring in that direction can be done in a separate pullreq as I understand you suggest; it would nicely separate the changes intended for a different goal.

Apologies for any confusion I might've caused.

…s expected. Cause: omission in backporting this from tprintf+fmt code. Caught by CI.

egorpugin · 2023-02-27T23:43:23Z

How do you imagine hooking?

GerHobbelt · 2023-02-28T19:05:48Z

Something simple. After all, tesseract (the application) doesn't need the fancy footwork, nor do many uses as a library.

Currently what I've got is a blunt, hardcoded override in tprintf.cpp, but that's just for now as I'm still in the testing phase and mostly on the command line myself. In a while tesseract will be incorporated in a server, where you can fire requests at it. Which is where it becomes important to have tprintf/tlog diagnostics available per tesseract session¹. Haven't coded it yet, but I reckon I might need access to the tesseract instance as an aide to identify which session we're accessing the logging/diagnostics of, plus some application-specifics (to be determined).

Classically I'd do it as one or more C-style function pointers (callbacks), to be 'registered' via a new Tesseract API. But that's a little convoluted/messy re cleanup, and having had a look at current tprintf.cpp it looks like proper file closing (debug_file != stderr) isn't featured today, so it might be nice to drop that in a class destructor perhaps.

Right now I'm considering doing it with a C++ class instance:

create an application-specific derived class' instantiation of the tesseract-defined (virtual) base interface class (say class TesseractLoggingInterface).
tesseract always provides its own instantiation, with is hooked up/invoked by default. (Current tprintf/tlog in a class wrapper - keeping those calls but replacing vTessPrintf() so little visible change to most of the codebase)
application can register/replace its implementation-specific instance via a Tesseract BaseAPI call.

That way, little changes for tesseract itself; maybe I could clean up the current tprintf/tlog a little, but not too much. (tprintf_error + tprintf_warning + tprintf_info migration would help me, so that's a prerequisite for me to do, I think?)

Thus I get myself a cleaner interface that's easier to use and maintain once I create the server: that one can do it the way I like then. (The idea there is to use SQLite for a lot of stuff (fast & cheap storage cross-platform and also suitable for non-server / desktop machines), so current thought is to dump the log/diagnostics in SQLite as well, so callers can inspect after the fact by firing queries at the server for whatever they want. Get the image. Get the OCR output in format X. Get the thresholded image or other intermediate. Get the diagnostics/log for it. That sort of thing. Anyway, all that's is out of scope for tesseract and no running code yet.)

I could do a 'it might look like this' RFC/pullreq in the next bunch of weeks: I need to field-test this anyway and then you & others can have a look and comment/steer it towards possible integration (for the virtual base class + default class def that equals current behaviour).

Any which way, I agree it shouldn't be made part of this pullreq. Better for me to file a new / follow-up PR for this. (Which was why I did the "ERROR: ..." stuff like I did; some of the interfacing is complicated enough to not do it all in one go. 😉 )

"session", "job", whatever you name it. The grand picture there is: you queue page images (you may want to fiddle with some priorities to move some of those 'pages' up or down, because... 🤷 ), tesseract chuggs away, you get the results and any collateral you want for review. Tesseract tprintf() output is one of the outputs that's related to a page, 's all. Hence "logging available per session") ↩

Whoops! one of the fmt format lines got through. My mistake! Sigh. Of…

ccae24f

… course, one only sees it when scrolling through the pullreq after the fact.

wordlist2dawg: one ERROR should be listed as a WARNING.

99e35c6

GerHobbelt added 2 commits February 27, 2023 11:59

fix error (printf --> trintf (misnamed!) --> tprintf), caught by CI A…

9791350

…nalyse process

fix compile eerror due to missing tesseract::tprintf() prototype -- c…

5d708fa

…aught by CI

github-advanced-security bot found potential problems Feb 27, 2023

View reviewed changes

src/ccstruct/blread.cpp Fixed Show fixed Hide fixed

fix QA error due to std::string arg used in tprintf, where C string i…

390b6e8

…s expected. Cause: omission in backporting this from tprintf+fmt code. Caught by CI.

zdenop mentioned this pull request Oct 24, 2023

Send output of combine_tessdata -d to stdout instead of stderr. Fixes #4149 #4150

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

follow-up to dev mailing list discuss + issue #3998: RFC: Consistent tprintf() use, ERROR: & WARNING: prefixes. #4020

follow-up to dev mailing list discuss + issue #3998: RFC: Consistent tprintf() use, ERROR: & WARNING: prefixes. #4020

GerHobbelt commented Feb 25, 2023

GerHobbelt commented Feb 25, 2023

egorpugin commented Feb 25, 2023

GerHobbelt commented Feb 25, 2023 •

edited

GerHobbelt commented Feb 25, 2023

egorpugin commented Feb 25, 2023 •

edited

GerHobbelt commented Feb 27, 2023

egorpugin commented Feb 27, 2023

GerHobbelt commented Feb 27, 2023

egorpugin commented Feb 27, 2023

GerHobbelt commented Feb 28, 2023

follow-up to dev mailing list discuss + issue #3998: RFC: Consistent tprintf() use, ERROR: & WARNING: prefixes. #4020

Are you sure you want to change the base?

follow-up to dev mailing list discuss + issue #3998: RFC: Consistent tprintf() use, ERROR: & WARNING: prefixes. #4020

Conversation

GerHobbelt commented Feb 25, 2023

GerHobbelt commented Feb 25, 2023

egorpugin commented Feb 25, 2023

GerHobbelt commented Feb 25, 2023 • edited

GerHobbelt commented Feb 25, 2023

egorpugin commented Feb 25, 2023 • edited

GerHobbelt commented Feb 27, 2023

Footnotes

egorpugin commented Feb 27, 2023

GerHobbelt commented Feb 27, 2023

egorpugin commented Feb 27, 2023

GerHobbelt commented Feb 28, 2023

Footnotes

GerHobbelt commented Feb 25, 2023 •

edited

egorpugin commented Feb 25, 2023 •

edited