Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PrintVariables produces config file that ReadConfigFile does not properly read #3943

Open
Balearica opened this issue Oct 15, 2022 · 13 comments · May be fixed by #4074
Open

PrintVariables produces config file that ReadConfigFile does not properly read #3943

Balearica opened this issue Oct 15, 2022 · 13 comments · May be fixed by #4074

Comments

@Balearica
Copy link

Expected Behavior:

The API function PrintVariables prints current parameters to a file, and ReadConfigFile reads parameters from a file. Intuitively, ReadConfigFile should be able to read the files that PrintVariables writes. This is explicitly assumed within the ProcessPage function, where these functions are used together to "Save current config variables before switching modes" and then "Restore saved config variables".

tesseract/src/api/baseapi.cpp

Lines 1293 to 1306 in a873553

// Save current config variables before switching modes.
FILE *fp = fopen(kOldVarsFile, "wb");
if (fp == nullptr) {
tprintf("Error, failed to open file \"%s\"\n", kOldVarsFile);
} else {
PrintVariables(fp);
fclose(fp);
}
// Switch to alternate mode for retry.
ReadConfigFile(retry_config);
SetImage(pix);
Recognize(nullptr);
// Restore saved config variables.
ReadConfigFile(kOldVarsFile);

Current Behavior:

Unfortunately, this does not currently work properly. The issue is that PrintVariables prints parameter descriptions alongside key/value pairs (e.g. chs_trailing_punct1 ).,;:?! 1st Trailing punctuation), and ReadConfigFile reads the description as a value (for string parameters). An example showing this is below.

#include <tesseract/baseapi.h>

int main()
{
    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
    if (api->Init(NULL, "eng")) {
        fprintf(stderr, "Could not initialize tesseract.\n");
        exit(1);
    }

    static const char *kOldVarsFile = "failed_vars.txt";

    // Print default value of chs_trailing_punct1
    printf("Initial value: %s\n", api->GetStringVariable("chs_trailing_punct1"));
    FILE *fp = fopen(kOldVarsFile, "wb");
    api->PrintVariables(fp);
    fclose(fp);
    api->ReadConfigFile(kOldVarsFile);
    printf("After PrintVariables/ReadConfigFile: %s\n", api->GetStringVariable("chs_trailing_punct1"));

    api->End();
    delete api;
    return 0;
}

This returns the following:

Initial value: ).,;:?!
After PrintVariables/ReadConfigFile: ).,;:?!	1st Trailing punctuation

The impact of this is:

  1. ProcessPage does not work correctly when used with retry_config
  2. There is no simple interface for generating a config file with the user's current settings
    1. This is useful for saving/restoring configurations (as ProcessPage attempts to do)

Suggested Fix:

The simplest solution would be to remove the descriptions from the PrintVariables output (or at least hide that behavior behind an option). I can write a PR if others agree this makes sense. Editing ReadConfigFile to ignore the descriptions is likely also possible, but could be higher effort.


Environment

Tesseract Version: 5.2.0
Commit Number: 15200c6
Platform: Linux ubuntu 5.15.0-43-generic

@amitdo
Copy link
Collaborator

amitdo commented Oct 15, 2022

The simplest solution would be to remove the descriptions from the PrintVariables output

We don't want to do this because that function is used by the command line tool and we need the descriptions in that case.

tesseract/src/tesseract.cpp

Lines 721 to 724 in 0daf18c

if (print_parameters) {
FILE *fout = stdout;
fprintf(stdout, "Tesseract parameters:\n");
api.PrintVariables(fout);

@Balearica
Copy link
Author

@amitdo Good point. How about adding an option to PrintVariables for whether info text should be included (enabled by default, but disabled for the call within ProcessPage).

@amitdo
Copy link
Collaborator

amitdo commented Oct 16, 2022

How about using:

if (fp == stdout)

or:

(fp == stdout) ? ... : ...

in:

void ParamUtils::PrintParams(FILE *fp, const ParamsVectors *member_params) {

@Balearica
Copy link
Author

Balearica commented Oct 19, 2022

@amitdo Works for me. Added PR #3947.

I confirmed (1) running tesseract --print-parameters from the command line still prints the info text and (2) the issue demonstrated in the example above is resolved (the value of chs_trailing_punct1 is not changed).

@stweil
Copy link
Contributor

stweil commented Oct 19, 2022

The regression was added in c51691f in 2014, a long time ago, so all relevant releases are affected.

@stweil
Copy link
Contributor

stweil commented Oct 19, 2022

Obviously the argument retry_config for the API functions ProcessPages and ProcessPagesFileList which triggers the regression was not used for years. Is it useful? If not, we simply might drop the support for it.

If retry_config is useful, maybe the regression could be fixed by extending ParamUtils::ReadParamsFromFp to remove the comment part from the input lines. Or even better, TessBaseAPI::ProcessPage should save and restore parameters in memory without any file on disk.

@Balearica
Copy link
Author

I think the ability to dump the user's current parameters and easily restore them later is useful (we're implementing a feature in Tesseract.js that tries to do this using PrintVariables/ReadConfigFile).

Adding a new function that writes a config file without the info text (suggested in the PR #3947) seems like an easy and effective solution. I looked into whether ReadConfigFile could be tweaked to ignore the info text, however this seems potentially tricky/error-prone. My first thought was to read everything to the first space character as the parameter value, however at least one parameter uses space character(s) as values (the default value of page_separator is \n).

@amitdo
Copy link
Collaborator

amitdo commented Oct 19, 2022

I think the ability to dump the user's current parameters and easily restore them later is useful

Related issue: #3260

@stweil
Copy link
Contributor

stweil commented Oct 19, 2022

I looked into whether ReadConfigFile could be tweaked to ignore the info text, however this seems potentially tricky/error-prone. My first thought was to read everything to the first space character as the parameter value, however at least one parameter uses space character(s) as values (the default value of page_separator is \n).

ParamUtils::PrintParams uses KEY TAB VALUE TAB COMMENT.

@amitdo
Copy link
Collaborator

amitdo commented Oct 19, 2022

#3670 (comment)

This comment and my comments below it are also related to this issue.

@Balearica
Copy link
Author

ParamUtils::PrintParams uses KEY TAB VALUE TAB COMMENT.

Okay, if we are confident that VALUE will never include tabs, then I can edit ReadConfigFile to only read until the first tab. If this is not the case, then I can do the separate DumpParameters function.

@Balearica
Copy link
Author

@stweil Do you believe it's safe to assume that tabs will not be used as parameter values (so it is safe for ReadConfigFile to read everything up to the tab and ignore everything after)?

@Balearica
Copy link
Author

I created a new PR (#4074) based on @stweil's idea to implement by adding a new function in #3947 (review).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants