(getTextContent()
): inaccurate spacing on current version vs older (v2.0.550)
#17839
Labels
getTextContent()
): inaccurate spacing on current version vs older (v2.0.550)
#17839
I've noticed that older versions of pdf.js have more accurate spacing on the pdfs i'm parsing. I'm only using pdf.js to extract the text. I'd like to use the latest version of pdf.js however the spacing is too inaccurate for me to fix post-process. Are there any options that will allow me to fine-tune the spacing?
Things i've tried:
.getDocument({... , disableFontFace: true})
.getTextContent({disableNormalization: true})
Attach (recommended) or Link to PDF file here: congressional-daily-record-170-13.pdf
Configuration:
Steps to reproduce the problem:
v4.0.379 outputs the following (text is shortened for Github):
Notice how spacing is missing. e.g.
WEDNESDAY, JANUARY 24, 2024 No. 13House of RepresentativesThe
Notice how spacing is in the wrong location. e.g.
called to order by the Honorable PETERW ELCH
v2.0.550 outputs the following (text is shortened for Github):
Notice how spacing is more accurate. e.g.
WEDNESDAY, JANUARY 24, 2024 No. 13 House of Representatives The
Notice how spacing is in the correct location. e.g.
called to order by the Honorable PETER WELCH
Here's a screenshot of the pdf (first page), i've highlighted the mentioned text in red:
The text was updated successfully, but these errors were encountered: