New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
searching with ctrl+f doesn't work with two words #9736
Comments
I would love to work on this, |
I would suggest to first check what we have in the text layer because that may explain why the search is not working. My guess is that the space factor is not correct; see: Line 1303 in 7bb0664
This is most likely also the cause of many other open text selection issues. However, changing the value may be error-prone for other PDF files and would require good testing. We may need to check how other open source PDF viewers (such as Poppler) do this, because the problem is that the PDF specification does not indicate when a space must be used for text selection. It only defined spacing width between characters. |
Unfortunately #9736 (comment) won't help here, since this is a scanned file where every word is positioned individually with different font sizes and x/y coordinates; see e.g. the beginning of the
|
Hello guys, as I'm sure you're aware other PDF rendering projects suffer from this as well. I am currently using a web app (Nextcloud) that employs pdf.js as a PDF renderer for its browser application. Here's an example of a file that I have worked with on other utilities. This is a scanned excerpt from an aircraft's autopilot service manual, originally printed in the 1970s on unknown equipment. CenturyIIB-origscan.pdf The first file is the original scan without a text layer. The second (hocr-uncleaned) is a PDF/A that has been processed with Tesseract (v4.0) to create a hidden text layer. The third (hocr-uncleaned) has been de-skewed with unpaper (v6.1) and then OCR'd with the same version of Tesseract and output as a PDF/A as well. In both PDF/A cases the original scan has been transcoded to 300 dpi jpeg for the final output. In both the second and third cases, the 'hocr' rendering option with Tesseract was used for the OCR rendering stage (Tesseract has multiple internal renderers). If you take a look at Tesseract's issues forum on github you'll see they have made some changes to their more recent renderer in an attempt to tackle this issue as well. Here are some excerpts copied/pasted from various utilities... hocr-unlceaned on Safari 11.1 (13605.1.33.1.4)
hocr-uncleaned on Chrome 66.0.3359.181
hocr-uncleaned on Adobe Acrobat Pro X
hocr-uncleaned on pdf.js (Firefox 60.0.1)
hocr-cleaned on the same version of Safari above
hocr-cleaned on the same version of Chrome above
hocr-cleaned on the same version of Adobe Acrobat Pro above
hocr-cleaned on the same version of pdf.js (Firefox) above
For anyone who might want to reproduce my toolchain for other sample files (main/depedency)... tesseract 4.00.00alpha (for OCR) unpaper 6.1 (for de-skew, de-noise, etc) qpdf 8.0.1 (for inspection/modification/creation of pdfs) OCRmyPDF 6.2.0 (python v3 wrapper for the above utilities) All of the above are in virtually any common Linux package repo, OCRmyPDF is in pip, and modern builds of all of them are in Homebrew for OSX as well (tesseract must be tagged to their git HEAD since v4.0 is still marked beta). I have also run them all on FreeBSD (must build Tesseract, Leptonica, and unpaper from source). Tesseract/Leptonica is a great baseline to use for making such test files, in my opinion. They've brought open source OCR forward by leaps and bounds. Here is an example from a scan of an 18th century document that it even does an admirable job on, despite not knowing what 'long S's are and transcribing them into lowercase 'f's. |
WFM, most likely fixed by PR #13257. |
Attach (recommended) or Link to PDF file here:
dee752ed0f726d8785abf360ca783d91f96f9a2e.pdf
Configuration:
Steps to reproduce the problem:
pdftotext shows the correct text:
It works in chrome's built-in PDF viewer, so it's not a problem with the pdf.
Link to a viewer (if hosted on a site other than mozilla.github.io/pdf.js or as Firefox/Chrome extension):
https://newspapers.lib.utah.edu/pdfjs1.9/web/viewer.html?file=/udn_files/de/e7/dee752ed0f726d8785abf360ca783d91f96f9a2e.pdf
The text was updated successfully, but these errors were encountered: