-
Notifications
You must be signed in to change notification settings - Fork 9.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invisible glyph bounds at wrong positions in PDF #2879
Comments
Interesting suggestion. If correct, why would it show up as an n - 1
problem in highlighting?
|
The glyphless font deliberately uses equal width for every character. I
stretch the the word using Tz in the PDF to make it fit. So I expect word
highlighting to look correct, but not character highlighting within a word.
This design was chosen to maximize compatibility across all the scripts
supported by Tesseract while minimizing complexity.
|
I think I found a bit more... "Introduction" has 12 characters but looks like this in the PDF content stream: |
I just see that the PDFBox screenshot shows it too: "ISO" has 4 characters, "32000" has 6 characters. Maybe the original idea was to put the space there for text extraction? However it isn't needed, good text extractors "imagine" the space from the position differences. If the space character is needed, then it should be positioned over the actual space. |
Thanks, after reading that one, I think this issue is also somewhat duplicate of ocrmypdf/OCRmyPDF#450 . |
You should check the bounding box of the whole word 'Introduction' with the hocr format. Does it also end before the last glyph? |
Tesseract's recognizer just finds words, and doesn't tell us anything about
spaces. Which makes sense: how would an OCR program know if there is one
space, two spaces, etc? We add the space in during PDF generation to help
some viewer with copy-paste; otherwise it is common for words to run
together. Apple's viewer is notorious for this. I'm a little reluctant to
put a space outside the word bounding box - there is no guarantee there
will be room for it, and I don't really want the PDF output module to get
into the layout analysis game. One possibility might be to play with the
font such that U+0020 gets zero (or non-zero) width, while every other
character maintains the same fixed width we've always had. Then adjust the
Tz word stretch appropriately.
https://github.com/tesseract-ocr/tesseract/blob/master/src/api/pdfrenderer.cpp#L471
I haven't touched the font in a while, so not sure how easy it is to make a
change like this. If you want to play with this yourself, I recommend using
the program "ttx" from fonttools to transform the font into an XML file.
Edit the file, then transform it back. I have a feeling it won't be trivial
but it might be possible. See also the design discussion at the top of
pdfrenderer.cpp, which explains how everything works.
|
Yeah I understand that this feature was implemented to "help" low quality text extractors. How about making the feature configurable for PDF? IMHO the majority user expectation is whatever Adobe does, that is the gold standard. Zero width space also sounds like an interesting idea to explore. You probably have to add appropriate /W entries. (The reason I created this issue: we're using a commercial OCR tool on a project that grows fast. The OCR is fine, but licensing is a pain, it doesn't use all CPU cores, and the logging is almost non existent, the whole thing is a black box, so I was thinking about replacing it with tesseract, but before we discuss this with the client I need to be sure the client would be satisfied and that its clients too) |
@amitdo The bounding box is correct:
|
Adobe Acrobat is not as popular as it used to be 10 years ago. Default PDF viewers:
So most users will use the OS/browser's built-in PDF viewers, which is not Adobe's viewer. The best solution is to find a method that will work on all these viewers, without a special parameter for specific viewer. |
I tested your pdf file with Chromium (pdfium), Firefox (pdf.js) and Evince (poppler). The words bounding boxes look very good when the page is viewed with pdfium/pdf.js. Poppler suffers from the same issue you raised above combined with a 'zebra effect'. |
With PDF.js on firefox, double click marks the whole word, when I mark the final "n", I get a space. With Chrome, double click shows the same effect than with Adobe Reader. With MS Edge, same effect than with PDF.js. |
I took a look at the code. It looks like one can pretty easily remap U+0020
to an alternate glyph in the cidtogmap. It's been five years since the last
significant change, and my memory is terrible, but I I'm confident we
currently map everything down to a single "glyph" in the font. That
slightly misleading code at line 549 of pdfrenderer.cpp is just filling out
the 2 byte entries one byte at a time.
So then there's the question of adding a another glyph to the font. The
design notes from Ken say we've got an unused glyph at index 0. Unused
because it gives heartburn to the Adobe parser. And then one at index one
which is used everywhere. It's not quite trivial, but I don't yet see any
reason we can't add another entry at index 2 that is identical or near to
the entry in index 1. This means tranforming the font to xml using ttx from
fonttools, doing some careful copy pasting, transforming it back, and
hoping nothing too scary happens.
Next there is the question of assigning the zero width (or near zero
width) to just that new entry. As of right now, I'm not sure exactly how to
do that. But I think Tilman's suggestion of adding a /W array to the
/CIDFont dictionary is the first thing to try. (Currently line 526 in
pdfrenderer.cpp). There's probably spot inside the font as well to specify
width, that we'll want to also set, for consistency, compatibility, and
minimal confusion.
Finally, I already mentioned that the bounding box stretch can be computed
without considering the U+0020, which is basically removing line 471 from
pdfrenderer.cpp. After that - if it works at all - then just compatibility
testing with various renderers.
I really don't know if this will work or not, but there's a chance, and
it's my best suggestion for what to try. Might make sense to contact Ken
Sharp and see if he has an opinion on the topic. Tilman, I know it's a lot
of work but if you want to try this, you will probably get it done
significantly faster than me. (Unlike 5 years ago, my day job does not
currently intersect with PDF. That doesn't totally stop me, but it does
slow things down quite a lot.)
|
Thanks for the nice comment; my problem is that I haven't done C/C++ for almost 10 years except maintenance of my existing software. I don't even have a dev system up that supports current language standards so I would have to install / understand / learn that first. However I'll keep it this issue in mind when I have more time at work (because this is a work issue). |
@jbarlow83, maybe you can help us here. |
I'll spend a little time right now and see what I can do.
… |
I tried the simplest thing possible, leaving the font alone and trying to
use that glyph at index 0. I expected Adobe Reader to completely choke, and
Pdfium/Chrome to work great. Instead, my ancient copy of Adobe Reader 9.5.5
(e.g. the one for Linux) works fine. However, Pdfium/Chrome is highlighting
beyond the end of the word. That's what you would expect if Pdfium was
ignoring the zero width on index 0.
…--- pdfrenderer.cpp.orig 2019-07-07 08:23:24.000000000 -0700
+++ pdfrenderer.cpp 2020-02-09 11:18:40.578544848 -0800
@@ -535,6 +536,7 @@
" /Subtype /CIDFontType2\n"
" /Type /Font\n"
" /DW " << (1000 / kCharWidth) << "\n"
+ " /W [ 0 [0 500] ]\n"
">>\n"
"endobj\n";
AppendPDFObject(stream.str().c_str());
@@ -546,6 +548,8 @@
for (int i = 0; i < kCIDToGIDMapSize; i++) {
cidtogidmap[i] = (i % 2) ? 1 : 0;
}
+ const int kSpaceCID = 20;
+ cidtogidmap[kSpaceCID * 2 + 1] = 0;
size_t len;
unsigned char *comp = zlibCompress(cidtogidmap.get(), kCIDToGIDMapSize,
&len);
stream.str("");
|
Alternative: (tesseract hocr) + (hocr-pdf (https://github.com/ImageProcessing-ElectronicPublications/hocr-tools)). |
Tried modifying the font to add a specific entry for U+0020. Same results, Adobe good, pdfium bad. This is the point where I pause, and people take a look for mistakes. If nobody finds anything, the next step is probably asking for help. That's Ken Sharp about the overall approach & especially the font, and Pdfium folks to help debug why the /W entry does not appear to be honored. --- pdfrenderer.cpp.orig 2019-07-07 08:23:24.000000000 -0700
+++ pdfrenderer.cpp 2020-02-09 12:00:57.961541649 -0800
@@ -468,7 +468,6 @@
} while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
if (res_it->IsAtBeginningOf(RIL_WORD)) {
pdf_word += "0020";
- pdf_word_len++;
}
if (word_length > 0 && pdf_word_len > 0) {
double h_stretch =
@@ -535,6 +536,7 @@
" /Subtype /CIDFontType2\n"
" /Type /Font\n"
" /DW " << (1000 / kCharWidth) << "\n"
+ " /W [ 1 [500 1] ]\n"
">>\n"
"endobj\n";
AppendPDFObject(stream.str().c_str());
@@ -544,8 +546,11 @@
const std::unique_ptr<unsigned char[]> cidtogidmap(
new unsigned char[kCIDToGIDMapSize]);
for (int i = 0; i < kCIDToGIDMapSize; i++) {
- cidtogidmap[i] = (i % 2) ? 1 : 0;
+ cidtogidmap[i] = (i % 2) ? 0x01 : 0x00;
}
+ const int kSpaceCID = 20;
+ cidtogidmap[kSpaceCID * 2] = 0x00;
+ cidtogidmap[kSpaceCID * 2 + 1] = 0x02;
size_t len;
unsigned char *comp = zlibCompress(cidtogidmap.get(), kCIDToGIDMapSize, &len);
stream.str(""); |
@amitdo I will look. I'd consider using a separate
Seems like it would be simpler and less reliant on fonts being parsed correctly. However I do think some artifact of the glyphlessfont is causing trouble, since using a hidden Arial (e.g. the hOCR transform method) does not have these problems for the same content stream. |
You are correct. Result works on both Acroread & Pdfium. File attached and ready for compatibility testing. If nobody finds trouble, I'm comfortable submitting. This variant makes no changes to the font, and sets the width of space to zero. --- pdfrenderer.cpp.orig 2019-07-07 08:23:24.000000000 -0700
+++ pdfrenderer.cpp 2020-02-09 13:26:33.743553816 -0800
@@ -468,7 +468,6 @@
} while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
if (res_it->IsAtBeginningOf(RIL_WORD)) {
pdf_word += "0020";
- pdf_word_len++;
}
if (word_length > 0 && pdf_word_len > 0) {
double h_stretch =
@@ -535,6 +536,7 @@
" /Subtype /CIDFontType2\n"
" /Type /Font\n"
" /DW " << (1000 / kCharWidth) << "\n"
+ " /W [ 32 [0] ]\n"
">>\n"
"endobj\n";
AppendPDFObject(stream.str().c_str());
@@ -544,8 +546,11 @@
const std::unique_ptr<unsigned char[]> cidtogidmap(
new unsigned char[kCIDToGIDMapSize]);
for (int i = 0; i < kCIDToGIDMapSize; i++) {
- cidtogidmap[i] = (i % 2) ? 1 : 0;
+ cidtogidmap[i] = (i % 2) ? 0x01 : 0x00;
}
+ const int kSpaceCID = 0x0020;
+ cidtogidmap[kSpaceCID * 2] = 0x00;
+ cidtogidmap[kSpaceCID * 2 + 1] = 0x00;
size_t len;
unsigned char *comp = zlibCompress(cidtogidmap.get(), kCIDToGIDMapSize, &len);
stream.str(""); |
@jbarlow83 The problem with hidden Arial is coverage. Tesseract supports the entire basic multilingual plane and beyond. The glyphless font is equally happy with Cherokee and English. |
Chromium, Evince - the page looks good. |
We still didn't hear from Mac users. How well does this patch work with macOS Preview? |
This patch unfortunately does not improve results on macOS Preview (Preview 10.1, macOS 10.14.6). Assuming I compared the right files. I did not apply the patch. Visually, it's better: However it removes spaces from the copy and paste text: Without patch (scan-ocr.pdf):
With patch (testme1.pdf):
I compared the previously uploaded files indicated above without applying the patch. |
Definitely looks like one-off bug. |
Maybe it does not like the zero width space, and it will honor a 1 unit width. |
Hi, do we have any update on this issue? |
I'd love to contribute and finally get the fix released, but I have no access to Mac, and as I reported earlier, the fix works for me on Windows. Is there a cross-platform way to test it? I am fine with Linux/Docker/VMs but I can't help with Mac. |
@bbqf It's possible to set up VM for macOS guest on Windows or Linux. e.g. https://www.makeuseof.com/tag/macos-windows-10-virtual-machine/ |
Can you please try to implement your suggestion and test it? |
Hello! I would like to implement this fix, since we feed the PDFs Tesseract generates into Poppler, It's not a problem if it breaks behaviour on other PDF renderers. But I don't want to maintain a fork of Tesseract and have to compile it myself. So my idea was to extract the essence of the fix and apply them after the fact, to the PDFs that Tesseract generates. However I am not having success. Perhaps someone can advise me on exactly the mutation I need to carry out on the PDF in order to benefit from this fix? I have tried: (code examples are in rust)
for operation in &mut content.operations {
match operation.operator.as_ref() {
"Tj" | "TJ" => {
for operand in operation.operands.iter_mut() {
match operand {
Object::Array(ref mut arr) => {
for obj in arr {
let obj = obj.as_str_mut().unwrap();
if obj[obj.len() - 2..] == [0, 32] {
obj.truncate(obj.len() - 2);
}
}
}
_ => {}
}
}
}
_ => {}
}
}
let fonts = doc
.objects
.iter()
.filter_map(|(id, obj)| (obj.type_name() == Ok("Font")).then_some(id.to_owned()))
.collect::<Vec<_>>();
for font in fonts {
if let Ok(font) = doc.get_dictionary_mut(font) {
let _32 = Object::Integer(32);
let _0 = Object::Array(vec![Object::Integer(0)]);
font.set(b"W".to_vec(), Object::Array(vec![_32, _0]));
}
} (admittedly, i'm not sure what this part of the diff is doing) - cidtogidmap[i] = (i % 2) ? 1 : 0;
+ cidtogidmap[i] = (i % 2) ? 0x01 : 0x00;
}
+ const int kSpaceCID = 0x0020;
+ cidtogidmap[kSpaceCID * 2] = 0x00;
+ cidtogidmap[kSpaceCID * 2 + 1] = 0x00; Neither options result in any visible difference in the PDF for me. Can anyone advise? |
At this point I believe improvements would come from having Tesseract generate tagged PDFs with structural markup that indicate word boundaries. @arifd that would implementing section 14.8 of the PDF RM. |
Any chance that this issue will be fixed sometimes after 3 years? |
Try this file. If word boundaries look "kinda" ok, I'll commit this one-off fix.
|
In my opinion it looks good now. |
It's an improvement. |
This patch was rejected before. See #3139. |
It works significantly better to output the word and space separately, and use horizontally scaling to calculate the width of the space so it falls exactly between the end of the current of the word and beginning of the next word. The "words mixed together" issue happens because the actual position of the space will overlap the word boxes instead of being between them, especially if the word is particularly wide or narrow (wwwwwww vs iiiiii). So the PDF renderers are acccurately reporting what they "see". I implemented this in OCRmyPDF's hOCR based renderer. I won't have time to add it to Tesseract for a few months but that is how to move forward. The next other thing to do, I believe, is add a double width character and negative displacement character to the GlyphlessFont, to better handle Asian and RTL scripts respectively. |
Let's continue the discussion that started in See the following comments. |
No. It means that the pdf format is very complex and the spec itself is not clear even for pdf experts. Also, every pdf viewer use its own 'clever guesses' techniques for some features of the format. This is very relevant here. |
@stweil, I don't like the patch which Egor applied, but if you will explicitly say you have no issue with it, I will stop talking about it. |
Why dont you like it? |
Because it is not good for Apple's Preview. Evince also has some issues with it. |
It does not matter. |
For now, a better alternative is to keep the status quo (the code before the latest applied patch). Although the text selection looks somewhat ugly (off by one), copy and paste and search functionality work better in Apple's Preview and column selection works better in Evince. |
Environment
(downloaded from https://digi.bib.uni-mannheim.de/tesseract/ )
Call:
"C:\Program Files\Tesseract-OCR\tesseract" scan.tif scan-ocr pdf
Current Behavior:
text bounds are not identical to visible glyphs in Adobe Reader. Example:
Expected Behavior:
text bounds should be identical to visible glyphs in Adobe Reader. In the graphic, the blue color should cover the "n".
Suggested Fix:
I suspect that the /W array is missing in the font dictionary:
So Adobe will use the /DW 500 entry (screenshot from PDF 32000 specification):
scan-ocr.pdf
scan.tif.zip
The text was updated successfully, but these errors were encountered: