[BUG] pdfminer defaults cause excessive whitespaces in extracted text #1734

tmbinc · 2022-10-02T17:03:33Z

I ran into the same problem as #1679 when processing PDFs that had been OCR'ed with Abbyocr already: spaces between individual letters.

The issue in my case was pdfminer's default laparams, especially word_margin's default of 0.1:

>>> from pdfminer.high_level import extract_text as pdfminer_extract_text
>>> pdfminer_extract_text("0000131.pdf")
'e S T A D T W E R K E\n\nxx\n\nV e r t r a g s k o n t o - N r . :[...]

Changing word_margin=1 fixed it for me, but I'm not sure if it's universally good. (I've tried various margin values; 1.0 seems to be the smallest that worked well for me.)

>>> import pdfminer
>>> laparm = pdfminer.layout.LAParams()
>>> laparm.word_margin = 1
>>> pdfminer_extract_text("0000131.pdf", laparams = laparm)
'e STADTWERKE\n\nxxx\n\nVertragskonto-Nr.:[..]'

Relevant information

Host OS of the machine running paperless: debian
Browser: any
Version: "jonaswinkler/paperless-ng@sha256:b61d514e178ddfa4673e72d0440b3166d46ec977dc6bbc7a9a293adf64200f55"
Installation method: docker

The text was updated successfully, but these errors were encountered:

tmbinc mentioned this issue Oct 2, 2022

change pdfminer's word_magin=1 for better text extraction results #1735

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] pdfminer defaults cause excessive whitespaces in extracted text #1734

[BUG] pdfminer defaults cause excessive whitespaces in extracted text #1734

tmbinc commented Oct 2, 2022

[BUG] pdfminer defaults cause excessive whitespaces in extracted text #1734

[BUG] pdfminer defaults cause excessive whitespaces in extracted text #1734

Comments

tmbinc commented Oct 2, 2022