Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCRing images written in Hebrew with diacritics is completely not working #4119

Open
Maxwell175 opened this issue Aug 18, 2023 · 6 comments
Open

Comments

@Maxwell175
Copy link

Current Behavior

Running tesseract on a hebrew scan: tesseract --oem 1 -l heb image00041.jpg image00041.jpg pdf

Try copying text from resulting PDF file and observe that the copied text is nothing like the original.

Tried with the default models installed from arch repos and with tessdata_best model.

Expected Behavior

OCR text should match original.

Suggested Fix

No response

tesseract -v

tesseract 5.3.2
leptonica-1.83.1
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.5.1) : libpng 1.6.40 : libtiff 4.5.1 : zlib 1.2.13 : libwebp 1.3.1 : libopenjp2 2.5.0
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.7.1 zlib/1.2.13 liblzma/5.4.3 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.5
Found libcurl/8.2.1 OpenSSL/3.1.2 zlib/1.2.13 brotli/1.0.9 zstd/1.5.5 libidn2/2.3.4 libpsl/0.21.2 (+libidn2/2.3.4) libssh2/1.11.0 nghttp2/1.55.1

Operating System

No response

Other Operating System

Manjaro

uname -a

Linux Maxwell-Main 6.3.13-2-MANJARO #1 SMP PREEMPT_DYNAMIC Sun Jul 16 16:48:53 UTC 2023 x86_64 GNU/Linux

Compiler

N/A

CPU

AMD Ryzen Threadripper 2950X

Virtualization / Containers

No response

Other Information

image00041.jpg.pdf
image00041

@stweil
Copy link
Contributor

stweil commented Aug 18, 2023

Does it work with other output formats like txt, alto or hocr? And did it work with other releases of Tesseract?

@stweil
Copy link
Contributor

stweil commented Aug 18, 2023

Please have a look at issue #238. Is that the same problem, related to RTL script?

@Maxwell175
Copy link
Author

Maxwell175 commented Aug 18, 2023

Does it work with other output formats like txt, alto or hocr? And did it work with other releases of Tesseract?

It also does not work in txt as well. It's just harder to understand what is coming from what source text, but even looking into it in detail its fairly obvious that none of the words are correct.

I have not tried old versions of tesseract, but I have tried the legacy engine and it gives different, but also incorrect results.

Please have a look at issue #238. Is that the same problem, related to RTL script?

I wouldn't be able to tell at this point because however you read it, the letters are just wrong.

@amitdo
Copy link
Collaborator

amitdo commented Aug 18, 2023

You didn't attach the txt output.

The training data used for training Hebrew with nikud marks had many issues, so it's not surprising that OCRing Hebrew images with nikud is giving you bad result.

tesseract-ocr/langdata#82

@amitdo amitdo changed the title Hebrew OCR is completely not working for scanned image OCRing images written in Hebrew with diacritics is completely not working Aug 18, 2023
@Maxwell175
Copy link
Author

that picture looks like chicken scratch...a human would have a hard time reading it...nevermind a computer

I am by no means a Hebrew expert and I have no issue reading it. This is not something unusual for uncommon or rare Hebrew books.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@Maxwell175 @stweil @amitdo and others