Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create pain points before running associator to resolve #3892 #3893

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Balearica
Copy link

See #3892

@amitdo
Copy link
Collaborator

amitdo commented Oct 9, 2022

The problem is that we don't know if there might be some cases where this patch will cause worst results.

@stweil, @zdenop, maybe we can accept this addition if it will be optional, using a config variable and will be turned off by default?

@Balearica
Copy link
Author

@amitdo Do you have a specific scenario in mind where this change would plausibly cause worse results? Alternatively, is there some corpus of benchmark documents where we could assess the impact of this change empirically?

The behavior this PR addresses (fully documented in #3892) is clearly a bug, as there is no reason why the associator should randomly skip letters at the end of certain words. I don't think it makes sense to avoid changing this behavior (using the default settings) in the absence of specific concerns regarding this fix. As stated above, I am happy to run an accuracy benchmark if one already exists.

@amitdo
Copy link
Collaborator

amitdo commented Oct 9, 2022

Do you have a specific scenario in mind where this change would plausibly cause worse results?

No.

Don't assume that the few currently active developers deeply know all the algorithms used in Tesseract.

Alternatively, is there some corpus of benchmark documents where we could assess the impact of this change empirically?

In the past the UNLV dataset and tools were used for testing Tesseract's accuracy.

See:

But the UNLV dataset have just English and Spanish written texts. Do you think your patch is fine for all the scripts that Tesseract supports?

Related issue: #3402.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants