OCRing images written in Hebrew with diacritics is completely not working #4119

Maxwell175 · 2023-08-18T06:01:41Z

Current Behavior

Running tesseract on a hebrew scan: tesseract --oem 1 -l heb image00041.jpg image00041.jpg pdf

Try copying text from resulting PDF file and observe that the copied text is nothing like the original.

Tried with the default models installed from arch repos and with tessdata_best model.

Expected Behavior

OCR text should match original.

Suggested Fix

No response

tesseract -v

tesseract 5.3.2
leptonica-1.83.1
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.5.1) : libpng 1.6.40 : libtiff 4.5.1 : zlib 1.2.13 : libwebp 1.3.1 : libopenjp2 2.5.0
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.7.1 zlib/1.2.13 liblzma/5.4.3 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.5
Found libcurl/8.2.1 OpenSSL/3.1.2 zlib/1.2.13 brotli/1.0.9 zstd/1.5.5 libidn2/2.3.4 libpsl/0.21.2 (+libidn2/2.3.4) libssh2/1.11.0 nghttp2/1.55.1

Operating System

No response

Other Operating System

Manjaro

uname -a

Linux Maxwell-Main 6.3.13-2-MANJARO #1 SMP PREEMPT_DYNAMIC Sun Jul 16 16:48:53 UTC 2023 x86_64 GNU/Linux

Compiler

N/A

CPU

AMD Ryzen Threadripper 2950X

Virtualization / Containers

No response

Other Information

image00041.jpg.pdf

The text was updated successfully, but these errors were encountered:

stweil · 2023-08-18T06:22:08Z

Does it work with other output formats like txt, alto or hocr? And did it work with other releases of Tesseract?

stweil · 2023-08-18T06:25:54Z

Please have a look at issue #238. Is that the same problem, related to RTL script?

Maxwell175 · 2023-08-18T06:29:36Z

Does it work with other output formats like txt, alto or hocr? And did it work with other releases of Tesseract?

It also does not work in txt as well. It's just harder to understand what is coming from what source text, but even looking into it in detail its fairly obvious that none of the words are correct.

I have not tried old versions of tesseract, but I have tried the legacy engine and it gives different, but also incorrect results.

Please have a look at issue #238. Is that the same problem, related to RTL script?

I wouldn't be able to tell at this point because however you read it, the letters are just wrong.

amitdo · 2023-08-18T11:19:10Z

You didn't attach the txt output.

The training data used for training Hebrew with nikud marks had many issues, so it's not surprising that OCRing Hebrew images with nikud is giving you bad result.

tesseract-ocr/langdata#82

Maxwell175 · 2023-08-30T21:37:47Z

that picture looks like chicken scratch...a human would have a hard time reading it...nevermind a computer

I am by no means a Hebrew expert and I have no issue reading it. This is not something unusual for uncommon or rare Hebrew books.

amitdo changed the title ~~Hebrew OCR is completely not working for scanned image~~ OCRing images written in Hebrew with diacritics is completely not working Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCRing images written in Hebrew with diacritics is completely not working #4119

OCRing images written in Hebrew with diacritics is completely not working #4119

Maxwell175 commented Aug 18, 2023

stweil commented Aug 18, 2023 •

edited

stweil commented Aug 18, 2023

Maxwell175 commented Aug 18, 2023 •

edited

amitdo commented Aug 18, 2023 •

edited

Maxwell175 commented Aug 30, 2023

OCRing images written in Hebrew with diacritics is completely not working #4119

OCRing images written in Hebrew with diacritics is completely not working #4119

Comments

Maxwell175 commented Aug 18, 2023

Current Behavior

Expected Behavior

Suggested Fix

tesseract -v

Operating System

Other Operating System

uname -a

Compiler

CPU

Virtualization / Containers

Other Information

stweil commented Aug 18, 2023 • edited

stweil commented Aug 18, 2023

Maxwell175 commented Aug 18, 2023 • edited

amitdo commented Aug 18, 2023 • edited

Maxwell175 commented Aug 30, 2023

stweil commented Aug 18, 2023 •

edited

Maxwell175 commented Aug 18, 2023 •

edited

amitdo commented Aug 18, 2023 •

edited