Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invisible glyph bounds at wrong positions in PDF #2879

Open
THausherr opened this issue Feb 6, 2020 · 57 comments
Open

Invisible glyph bounds at wrong positions in PDF #2879

THausherr opened this issue Feb 6, 2020 · 57 comments
Labels

Comments

@THausherr
Copy link

THausherr commented Feb 6, 2020

Environment

Call:

"C:\Program Files\Tesseract-OCR\tesseract" scan.tif scan-ocr pdf

Current Behavior:

text bounds are not identical to visible glyphs in Adobe Reader. Example:

grafik

Expected Behavior:

text bounds should be identical to visible glyphs in Adobe Reader. In the graphic, the blue color should cover the "n".

Suggested Fix:

I suspect that the /W array is missing in the font dictionary:
grafik
So Adobe will use the /DW 500 entry (screenshot from PDF 32000 specification):
grafik

scan-ocr.pdf
scan.tif.zip

@THausherr THausherr changed the title Invisible glyph bounds at wrong position in PDF Invisible glyph bounds at wrong positions in PDF Feb 6, 2020
@jbreiden
Copy link
Contributor

jbreiden commented Feb 6, 2020 via email

@THausherr
Copy link
Author

Sorry, I don't understand what you mean. My argument is that the highlights widths don't match. Adobe gets these from the font data, and widths are different in a proportional font. And it isn't just the "n". When trying to highlight the "I" it looks like this:
grafik

@jbreiden
Copy link
Contributor

jbreiden commented Feb 6, 2020 via email

@THausherr
Copy link
Author

I had a look with the glyph contour display of PDFBox and there it matches the word bounds:
grafik

So maybe Adobe is to blame, but users will of course see this differently :-(

@THausherr
Copy link
Author

THausherr commented Feb 6, 2020

I think I found a bit more... "Introduction" has 12 characters but looks like this in the PDF content stream:
1 0 0 1 77.76 738.16 Tm /f-0-0 11 Tf 107.076 Tz [ <0049006E00740072006F00640075006300740069006F006E0020> ] TJ
this is 13 characters. The last one (0020) is a space. This space is positioned over the final "n".

@THausherr
Copy link
Author

When removing "3 Tr" so that the "invisible" font gets visible, it looks like this:
grafik
This is really 13 characters. For some reason, Adobe doesn't want to mark the final space.

@THausherr
Copy link
Author

I just see that the PDFBox screenshot shows it too: "ISO" has 4 characters, "32000" has 6 characters.

Maybe the original idea was to put the space there for text extraction? However it isn't needed, good text extractors "imagine" the space from the position differences.

If the space character is needed, then it should be positioned over the actual space.

@amitdo
Copy link
Collaborator

amitdo commented Feb 6, 2020

#1900

@THausherr
Copy link
Author

Thanks, after reading that one, I think this issue is also somewhat duplicate of ocrmypdf/OCRmyPDF#450 .

@amitdo
Copy link
Collaborator

amitdo commented Feb 6, 2020

You should check the bounding box of the whole word 'Introduction' with the hocr format. Does it also end before the last glyph?

@jbreiden
Copy link
Contributor

jbreiden commented Feb 7, 2020 via email

@THausherr
Copy link
Author

Yeah I understand that this feature was implemented to "help" low quality text extractors.

How about making the feature configurable for PDF? IMHO the majority user expectation is whatever Adobe does, that is the gold standard.

Zero width space also sounds like an interesting idea to explore. You probably have to add appropriate /W entries.

(The reason I created this issue: we're using a commercial OCR tool on a project that grows fast. The OCR is fine, but licensing is a pain, it doesn't use all CPU cores, and the logging is almost non existent, the whole thing is a black box, so I was thinking about replacing it with tesseract, but before we discuss this with the client I need to be sure the client would be satisfied and that its clients too)

@THausherr
Copy link
Author

@amitdo The bounding box is correct:

   <div class='ocr_carea' id='block_1_2' title="bbox 324 400 643 442">
    <p class='ocr_par' id='par_1_2' lang='eng' title="bbox 324 400 643 442">
     <span class='ocr_line' id='line_1_2' title="bbox 324 400 643 442; baseline 0 -1; x_size 47.393444; x_descenders 6.3934426; x_ascenders 11">
      <span class='ocrx_word' id='word_1_3' title='bbox 324 400 643 442; x_wconf 95'>Introduction</span>
     </span>
    </p>
   </div>

@amitdo
Copy link
Collaborator

amitdo commented Feb 7, 2020

Adobe Acrobat is not as popular as it used to be 10 years ago.

Default PDF viewers:

  • Windows 10 - Chromium based Edge - Pdfium
  • macOS - Preview
  • ChromeOS - Pdfium
  • Chromium / Chrome / Edge - Pdfium
  • Firefox - pdf.js
  • Linux - Evince/Okular (Poppler)

So most users will use the OS/browser's built-in PDF viewers, which is not Adobe's viewer.

The best solution is to find a method that will work on all these viewers, without a special parameter for specific viewer.

@amitdo
Copy link
Collaborator

amitdo commented Feb 7, 2020

I tested your pdf file with Chromium (pdfium), Firefox (pdf.js) and Evince (poppler).

The words bounding boxes look very good when the page is viewed with pdfium/pdf.js.

Poppler suffers from the same issue you raised above combined with a 'zebra effect'.

@THausherr
Copy link
Author

With PDF.js on firefox, double click marks the whole word, when I mark the final "n", I get a space.

With Chrome, double click shows the same effect than with Adobe Reader.

With MS Edge, same effect than with PDF.js.

@jbreiden
Copy link
Contributor

jbreiden commented Feb 8, 2020 via email

@THausherr
Copy link
Author

THausherr commented Feb 9, 2020

Thanks for the nice comment; my problem is that I haven't done C/C++ for almost 10 years except maintenance of my existing software. I don't even have a dev system up that supports current language standards so I would have to install / understand / learn that first. However I'll keep it this issue in mind when I have more time at work (because this is a work issue).

@amitdo
Copy link
Collaborator

amitdo commented Feb 9, 2020

@jbarlow83, maybe you can help us here.

@jbreiden
Copy link
Contributor

jbreiden commented Feb 9, 2020 via email

@jbreiden
Copy link
Contributor

jbreiden commented Feb 9, 2020 via email

@zvezdochiot
Copy link

Alternative: (tesseract hocr) + (hocr-pdf (https://github.com/ImageProcessing-ElectronicPublications/hocr-tools)).

@jbreiden3
Copy link

jbreiden3 commented Feb 9, 2020

Tried modifying the font to add a specific entry for U+0020. Same results, Adobe good, pdfium bad. This is the point where I pause, and people take a look for mistakes. If nobody finds anything, the next step is probably asking for help. That's Ken Sharp about the overall approach & especially the font, and Pdfium folks to help debug why the /W entry does not appear to be honored.

--- pdfrenderer.cpp.orig	2019-07-07 08:23:24.000000000 -0700
+++ pdfrenderer.cpp	2020-02-09 12:00:57.961541649 -0800
@@ -468,7 +468,6 @@
     } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
     if (res_it->IsAtBeginningOf(RIL_WORD)) {
       pdf_word += "0020";
-      pdf_word_len++;
     }
     if (word_length > 0 && pdf_word_len > 0) {
       double h_stretch =
@@ -535,6 +536,7 @@
     "  /Subtype /CIDFontType2\n"
     "  /Type /Font\n"
     "  /DW " << (1000 / kCharWidth) << "\n"
+    "  /W [ 1 [500 1] ]\n"
     ">>\n"
     "endobj\n";
   AppendPDFObject(stream.str().c_str());
@@ -544,8 +546,11 @@
   const std::unique_ptr<unsigned char[]> cidtogidmap(
       new unsigned char[kCIDToGIDMapSize]);
   for (int i = 0; i < kCIDToGIDMapSize; i++) {
-    cidtogidmap[i] = (i % 2) ? 1 : 0;
+    cidtogidmap[i] = (i % 2) ? 0x01 : 0x00;
   }
+  const int kSpaceCID = 20;
+  cidtogidmap[kSpaceCID * 2] = 0x00;
+  cidtogidmap[kSpaceCID * 2 + 1] = 0x02;
   size_t len;
   unsigned char *comp = zlibCompress(cidtogidmap.get(), kCIDToGIDMapSize, &len);
   stream.str("");

debug.pdf
font.zip

@jbarlow83
Copy link

jbarlow83 commented Feb 9, 2020

@amitdo I will look.

I'd consider using a separate Tz for the trailing space rather than modifying the font.

1.0 Tz [ <0049006E00740072006F00640075006300740069006F006E> ] TJ 0.001 Tz [ <0020> ] TJ

Seems like it would be simpler and less reliant on fonts being parsed correctly.

However I do think some artifact of the glyphlessfont is causing trouble, since using a hidden Arial (e.g. the hOCR transform method) does not have these problems for the same content stream.

@THausherr
Copy link
Author

The /W entry as it is now
grafik
means CID 1 has a width of 500, CID 2 has a width of 1. I assume that all others have default width (500). If you wanted to change the width of space, then you should have done something for CID 32.

@jbreiden3
Copy link

You are correct. Result works on both Acroread & Pdfium. File attached and ready for compatibility testing. If nobody finds trouble, I'm comfortable submitting. This variant makes no changes to the font, and sets the width of space to zero.

--- pdfrenderer.cpp.orig	2019-07-07 08:23:24.000000000 -0700
+++ pdfrenderer.cpp	2020-02-09 13:26:33.743553816 -0800
@@ -468,7 +468,6 @@
     } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
     if (res_it->IsAtBeginningOf(RIL_WORD)) {
       pdf_word += "0020";
-      pdf_word_len++;
     }
     if (word_length > 0 && pdf_word_len > 0) {
       double h_stretch =
@@ -535,6 +536,7 @@
     "  /Subtype /CIDFontType2\n"
     "  /Type /Font\n"
     "  /DW " << (1000 / kCharWidth) << "\n"
+    "  /W [ 32 [0] ]\n"
     ">>\n"
     "endobj\n";
   AppendPDFObject(stream.str().c_str());
@@ -544,8 +546,11 @@
   const std::unique_ptr<unsigned char[]> cidtogidmap(
       new unsigned char[kCIDToGIDMapSize]);
   for (int i = 0; i < kCIDToGIDMapSize; i++) {
-    cidtogidmap[i] = (i % 2) ? 1 : 0;
+    cidtogidmap[i] = (i % 2) ? 0x01 : 0x00;
   }
+  const int kSpaceCID = 0x0020;
+  cidtogidmap[kSpaceCID * 2] = 0x00;
+  cidtogidmap[kSpaceCID * 2 + 1] = 0x00;
   size_t len;
   unsigned char *comp = zlibCompress(cidtogidmap.get(), kCIDToGIDMapSize, &len);
   stream.str("");

testme1.pdf

@jbreiden3
Copy link

@jbarlow83 The problem with hidden Arial is coverage. Tesseract supports the entire basic multilingual plane and beyond. The glyphless font is equally happy with Cherokee and English.

@amitdo
Copy link
Collaborator

amitdo commented Feb 9, 2020

Chromium, Evince - the page looks good.
Firefox - no effect, the issue still exists.

@amitdo
Copy link
Collaborator

amitdo commented Feb 5, 2021

We still didn't hear from Mac users. How well does this patch work with macOS Preview?

@jbarlow83
Copy link

This patch unfortunately does not improve results on macOS Preview (Preview 10.1, macOS 10.14.6). Assuming I compared the right files. I did not apply the patch.

Visually, it's better:

Without patch (scan-ocr.pdf):
image

With patch (testme1.pdf):
image

However it removes spaces from the copy and paste text:

Without patch (scan-ocr.pdf):

programming language, PDF is based ona structured binary file format that is optimized for high performance in interactive viewing. PDF also includes objects, such as annotations and hypertext links, that are not part of the page content itself but are useful for interactive viewing and document interchange.

With patch (testme1.pdf):

programminglanguage,PDFisbasedona structuredbinaryfileformatthatisoptimizedforhighperformance in interactive viewing. PDF also includes objects, such as annotations and hypertext links, that are not part of thepagecontentitselfbutareusefulforinteractiveviewinganddocumentinterchange.

I compared the previously uploaded files indicated above without applying the patch.

@egorpugin
Copy link
Contributor

Definitely looks like one-off bug.

@amitdo
Copy link
Collaborator

amitdo commented Feb 6, 2021

Maybe it does not like the zero width space, and it will honor a 1 unit width.

@shadylpstan
Copy link

Hi, do we have any update on this issue?

@bbqf
Copy link

bbqf commented Sep 25, 2022

I'd love to contribute and finally get the fix released, but I have no access to Mac, and as I reported earlier, the fix works for me on Windows. Is there a cross-platform way to test it? I am fine with Linux/Docker/VMs but I can't help with Mac.

@jbarlow83
Copy link

@bbqf It's possible to set up VM for macOS guest on Windows or Linux. e.g. https://www.makeuseof.com/tag/macos-windows-10-virtual-machine/
I can often be persuaded to test new files and I have access to all platforms.

@amitdo
Copy link
Collaborator

amitdo commented Sep 30, 2022

@jbarlow83,

#2879 (comment)

Can you please try to implement your suggestion and test it?

@arifd
Copy link

arifd commented Mar 14, 2023

Hello!

I would like to implement this fix, since we feed the PDFs Tesseract generates into Poppler, It's not a problem if it breaks behaviour on other PDF renderers.

But I don't want to maintain a fork of Tesseract and have to compile it myself. So my idea was to extract the essence of the fix and apply them after the fact, to the PDFs that Tesseract generates. However I am not having success. Perhaps someone can advise me on exactly the mutation I need to carry out on the PDF in order to benefit from this fix?

I have tried: (code examples are in rust)

  • removing the space (U+0020) at the end of every word
for operation in &mut content.operations {
    match operation.operator.as_ref() {
        "Tj" | "TJ" => {
            for operand in operation.operands.iter_mut() {
                match operand {
                    Object::Array(ref mut arr) => {
                        for obj in arr {
                            let obj = obj.as_str_mut().unwrap();
                            if obj[obj.len() - 2..] == [0, 32] {
                                obj.truncate(obj.len() - 2);
                            }
                        }
                    }
                    _ => {}
                }
            }
        }
        _ => {}
    }
}
  • setting the width of the space character in all the fonts to 0
let fonts = doc
    .objects
    .iter()
    .filter_map(|(id, obj)| (obj.type_name() == Ok("Font")).then_some(id.to_owned()))
    .collect::<Vec<_>>();
for font in fonts {
    if let Ok(font) = doc.get_dictionary_mut(font) {
        let _32 = Object::Integer(32);
        let _0 = Object::Array(vec![Object::Integer(0)]);
        font.set(b"W".to_vec(), Object::Array(vec![_32, _0]));
    }
}

(admittedly, i'm not sure what this part of the diff is doing)

-    cidtogidmap[i] = (i % 2) ? 1 : 0;
+    cidtogidmap[i] = (i % 2) ? 0x01 : 0x00;
   }
+  const int kSpaceCID = 0x0020;
+  cidtogidmap[kSpaceCID * 2] = 0x00;
+  cidtogidmap[kSpaceCID * 2 + 1] = 0x00;

Neither options result in any visible difference in the PDF for me. Can anyone advise?

@jbarlow83
Copy link

At this point I believe improvements would come from having Tesseract generate tagged PDFs with structural markup that indicate word boundaries. @arifd that would implementing section 14.8 of the PDF RM.

@westner
Copy link

westner commented Feb 23, 2024

Any chance that this issue will be fixed sometimes after 3 years?

@egorpugin
Copy link
Contributor

Try this file.
scan.pdf

If word boundaries look "kinda" ok, I'll commit this one-off fix.
Pdf viewers do not highlight space, so we should not add it to the word.

index 6b9e248d..fb4bcd2f 100644
--- a/src/api/pdfrenderer.cpp
+++ b/src/api/pdfrenderer.cpp
@@ -466,7 +466,6 @@ char *TessPDFRenderer::GetPDFTextObjects(TessBaseAPI *api, double width, double
     } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
     if (res_it->IsAtBeginningOf(RIL_WORD)) {
       pdf_word += "0020";
-      pdf_word_len++;
     }
     if (word_length > 0 && pdf_word_len > 0) {
       double h_stretch = kCharWidth * prec(100.0 * word_length / (fontsize * pdf_word_len));

@westner
Copy link

westner commented Feb 23, 2024

In my opinion it looks good now.

@THausherr
Copy link
Author

It's an improvement.

@amitdo
Copy link
Collaborator

amitdo commented Feb 26, 2024

This patch was rejected before. See #3139.

@jbarlow83
Copy link

It works significantly better to output the word and space separately, and use horizontally scaling to calculate the width of the space so it falls exactly between the end of the current of the word and beginning of the next word.

The "words mixed together" issue happens because the actual position of the space will overlap the word boxes instead of being between them, especially if the word is particularly wide or narrow (wwwwwww vs iiiiii). So the PDF renderers are acccurately reporting what they "see".

I implemented this in OCRmyPDF's hOCR based renderer. I won't have time to add it to Tesseract for a few months but that is how to move forward.

The next other thing to do, I believe, is add a double width character and negative displacement character to the GlyphlessFont, to better handle Asian and RTL scripts respectively.

@amitdo amitdo reopened this Feb 26, 2024
@amitdo
Copy link
Collaborator

amitdo commented Mar 11, 2024

@stweil, @zdenop, @egorpugin

Let's continue the discussion that started in
#3673 (comment).

See the following comments.

@amitdo
Copy link
Collaborator

amitdo commented Mar 11, 2024

Different viewer behavior means that someone is correct and others are not.

No. It means that the pdf format is very complex and the spec itself is not clear even for pdf experts.

Also, every pdf viewer use its own 'clever guesses' techniques for some features of the format. This is very relevant here.

@amitdo
Copy link
Collaborator

amitdo commented Mar 11, 2024

@stweil, I don't like the patch which Egor applied, but if you will explicitly say you have no issue with it, I will stop talking about it.

@egorpugin
Copy link
Contributor

Why dont you like it?
Do we have a better patch?

@amitdo
Copy link
Collaborator

amitdo commented Mar 11, 2024

Why dont you like it?

Because it is not good for Apple's Preview. Evince also has some issues with it.

@egorpugin
Copy link
Contributor

It does not matter.
Patch fixes incorrect word length (+1 extra). Like word 'a' had length of 2.

@amitdo
Copy link
Collaborator

amitdo commented Mar 11, 2024

For now, a better alternative is to keep the status quo (the code before the latest applied patch). Although the text selection looks somewhat ugly (off by one), copy and paste and search functionality work better in Apple's Preview and column selection works better in Evince.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests