searching with ctrl+f doesn't work with two words #9736

jazztickets · 2018-05-17T15:51:13Z

Attach (recommended) or Link to PDF file here:
dee752ed0f726d8785abf360ca783d91f96f9a2e.pdf

Configuration:

Web browser and its version: Firefox 60/Chromium 66
Operating system and its version: Linux/Windows 7
PDF.js version: v1.10.88 or v1.9.426 or the version built into Firefox 60

Steps to reproduce the problem:

Hit ctrl+f and search for "pioneer of"
Pioneer will be highlighted, but as soon as you type a space no results are found

pdftotext shows the correct text:

in nit ris hington 1D C
boerge W lacan a pioneer of butali
and an influential man aw at richfield last walk

It works in chrome's built-in PDF viewer, so it's not a problem with the pdf.

Link to a viewer (if hosted on a site other than mozilla.github.io/pdf.js or as Firefox/Chrome extension):
https://newspapers.lib.utah.edu/pdfjs1.9/web/viewer.html?file=/udn_files/de/e7/dee752ed0f726d8785abf360ca783d91f96f9a2e.pdf

The text was updated successfully, but these errors were encountered:

AbhimanyuVashisht · 2018-05-20T21:45:40Z

I would love to work on this,
@timvandermeij Please help me, where to start while working on this issue

timvandermeij · 2018-05-21T11:36:21Z

I would suggest to first check what we have in the text layer because that may explain why the search is not working. My guess is that the space factor is not correct; see:

pdf.js/src/core/evaluator.js

Line 1303 in 7bb0664

var SPACE_FACTOR = 0.3;

This is most likely also the cause of many other open text selection issues. However, changing the value may be error-prone for other PDF files and would require good testing. We may need to check how other open source PDF viewers (such as Poppler) do this, because the problem is that the PDF specification does not indicate when a space must be used for text selection. It only defined spacing width between characters.

Snuffleupagus · 2018-05-21T11:55:55Z

Unfortunately #9736 (comment) won't help here, since this is a scanned file where every word is positioned individually with different font sizes and x/y coordinates; see e.g. the beginning of the /Contents stream:

BT
1 G
1 g
1 0 0 1 52 1145.37 Tm
/F1 11 Tf
(UTAH)Tj
1 0 0 1 90 1146.36 Tm
/F1 11 Tf
(NEWS)Tj
1 0 0 1 29 1124.14 Tm
/F1 10 Tf
(the)Tj
1 0 0 1 44 1126.03 Tm
/F1 8 Tf
(oregon)Tj
1 0 0 1 71 1124.47 Tm
/F1 9 Tf
(short)Tj
1 0 0 1 92 1125.47 Tm
/F1 9 Tf
(line)Tj

...

RNCTX · 2018-05-23T18:52:19Z

Hello guys, as I'm sure you're aware other PDF rendering projects suffer from this as well. I am currently using a web app (Nextcloud) that employs pdf.js as a PDF renderer for its browser application.

Here's an example of a file that I have worked with on other utilities. This is a scanned excerpt from an aircraft's autopilot service manual, originally printed in the 1970s on unknown equipment.

CenturyIIB-origscan.pdf
CenturyIIB-tesseract_hocr-uncleaned.pdf
CenturyIIB-tesseract_hocr-cleaned.pdf

The first file is the original scan without a text layer. The second (hocr-uncleaned) is a PDF/A that has been processed with Tesseract (v4.0) to create a hidden text layer. The third (hocr-uncleaned) has been de-skewed with unpaper (v6.1) and then OCR'd with the same version of Tesseract and output as a PDF/A as well. In both PDF/A cases the original scan has been transcoded to 300 dpi jpeg for the final output.

In both the second and third cases, the 'hocr' rendering option with Tesseract was used for the OCR rendering stage (Tesseract has multiple internal renderers). If you take a look at Tesseract's issues forum on github you'll see they have made some changes to their more recent renderer in an attempt to tackle this issue as well.

Here are some excerpts copied/pasted from various utilities...

hocr-unlceaned on Safari 11.1 (13605.1.33.1.4)

The Century IIB Autopilot is an "Open Loop" system which responds only to the dynamics of the aircraft in flight, thus the only ground checks that can be accomplished are functional checks as described in this bulletin.

hocr-uncleaned on Chrome 66.0.3359.181

The Century IIB Autopilot is an "Open Loop" system which responds only to the dynamics of the aircraft in flight, thus the only ground checks that can be accomplished are functional checks as described in this bulletin.

hocr-uncleaned on Adobe Acrobat Pro X

The Century IIB Autopilot is an "Open Loop" system which responds only to the
dynamics of the aircraft in flight, thus the only ground checks that can be
accomplished are functional checks as described in this bulletin.

hocr-uncleaned on pdf.js (Firefox 60.0.1)

The
Century
IIB
Autopilot
is
an
"Open Loop"
system
which
responds
only
to
the
dynamics
of
the
aircraft
in
flight,
thus
the
only
ground
checks
that
can
be
accomplished
are
functional
checks
as
described
in
this
bulletin.

hocr-cleaned on the same version of Safari above

The Century IIB Autopilot is an "Open Loop’ system which responds only to the dynamics of the aircraft in flight, thus the only ground checks that can be accomplished are functional checks as described in this bulletin.

hocr-cleaned on the same version of Chrome above

The Century IIB Autopilot is an "Open Loop’ system which responds only to the
dynamics of the aircraft in flight, thus the only ground checks that can be
accomplished are functional checks as described in this bulletin.

hocr-cleaned on the same version of Adobe Acrobat Pro above

The Century IIB Autopilot is an "Open Loop’ system which responds only to the
dynamics of the aircraft in flight, thus the only ground checks that can be
accomplished are functional checks as described in this bulletin.

hocr-cleaned on the same version of pdf.js (Firefox) above

The 
Century 
IIB 
Autopilot 
is 
an 
"Open 
Loop’ 
system 
which 
responds 
only 
to 
the 
dynamics 
of 
the 
aircraft 
in 
flight, 
thus 
the 
only 
ground 
checks 
that 
can 
be 
accomplished 
are 
functional 
checks 
as 
described 
in 
this 
bulletin.

For anyone who might want to reproduce my toolchain for other sample files (main/depedency)...

tesseract 4.00.00alpha (for OCR)
leptonica 1.76.0
libjpeg-turbo 1.5.3
libpng 1.6.34+apng
libtiff 4.0.9

unpaper 6.1 (for de-skew, de-noise, etc)
libav 12.1
opencv 2.4.13.1
freetype2 2.8

qpdf 8.0.1 (for inspection/modification/creation of pdfs)
ghostscript 9.16

OCRmyPDF 6.2.0 (python v3 wrapper for the above utilities)

All of the above are in virtually any common Linux package repo, OCRmyPDF is in pip, and modern builds of all of them are in Homebrew for OSX as well (tesseract must be tagged to their git HEAD since v4.0 is still marked beta). I have also run them all on FreeBSD (must build Tesseract, Leptonica, and unpaper from source). Tesseract/Leptonica is a great baseline to use for making such test files, in my opinion. They've brought open source OCR forward by leaps and bounds. Here is an example from a scan of an 18th century document that it even does an admirable job on, despite not knowing what 'long S's are and transcribing them into lowercase 'f's.

Snuffleupagus · 2021-05-01T07:42:19Z

WFM, most likely fixed by PR #13257.

timvandermeij added the text-selection label May 17, 2018

RNCTX mentioned this issue Jun 13, 2018

Combined PDFs revisited the-paperless-project/paperless#365

Closed

timvandermeij mentioned this issue Nov 20, 2018

I can't able to search the using the text #10271

Closed

Snuffleupagus mentioned this issue Apr 30, 2020

Searching a sentence with spaces between the words doesn't work. #11861

Closed

timvandermeij closed this as completed May 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

searching with ctrl+f doesn't work with two words #9736

searching with ctrl+f doesn't work with two words #9736

jazztickets commented May 17, 2018

AbhimanyuVashisht commented May 20, 2018

timvandermeij commented May 21, 2018 •

edited

Snuffleupagus commented May 21, 2018 •

edited

RNCTX commented May 23, 2018 •

edited

Snuffleupagus commented May 1, 2021

searching with ctrl+f doesn't work with two words #9736

searching with ctrl+f doesn't work with two words #9736

Comments

jazztickets commented May 17, 2018

AbhimanyuVashisht commented May 20, 2018

timvandermeij commented May 21, 2018 • edited

Snuffleupagus commented May 21, 2018 • edited

RNCTX commented May 23, 2018 • edited

Snuffleupagus commented May 1, 2021

timvandermeij commented May 21, 2018 •

edited

Snuffleupagus commented May 21, 2018 •

edited

RNCTX commented May 23, 2018 •

edited