Skip to content
This repository has been archived by the owner on Feb 16, 2023. It is now read-only.

[BUG] pdfminer defaults cause excessive whitespaces in extracted text #1734

Open
tmbinc opened this issue Oct 2, 2022 · 0 comments
Open

Comments

@tmbinc
Copy link

tmbinc commented Oct 2, 2022

I ran into the same problem as #1679 when processing PDFs that had been OCR'ed with Abbyocr already: spaces between individual letters.

The issue in my case was pdfminer's default laparams, especially word_margin's default of 0.1:

>>> from pdfminer.high_level import extract_text as pdfminer_extract_text
>>> pdfminer_extract_text("0000131.pdf")
'e S T A D T W E R K E\n\nxx\n\nV e r t r a g s k o n t o - N r . :[...]

Changing word_margin=1 fixed it for me, but I'm not sure if it's universally good. (I've tried various margin values; 1.0 seems to be the smallest that worked well for me.)

>>> import pdfminer
>>> laparm = pdfminer.layout.LAParams()
>>> laparm.word_margin = 1
>>> pdfminer_extract_text("0000131.pdf", laparams = laparm)
'e STADTWERKE\n\nxxx\n\nVertragskonto-Nr.:[..]'

Relevant information

  • Host OS of the machine running paperless: debian
  • Browser: any
  • Version: "jonaswinkler/paperless-ng@sha256:b61d514e178ddfa4673e72d0440b3166d46ec977dc6bbc7a9a293adf64200f55"
  • Installation method: docker
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant