Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting order pre-definable? #18

Open
luke4u opened this issue Jan 6, 2020 · 3 comments
Open

Extracting order pre-definable? #18

luke4u opened this issue Jan 6, 2020 · 3 comments

Comments

@luke4u
Copy link

luke4u commented Jan 6, 2020

Hi Guys,

Just wondering for a pdf file, if the text extraction order can be defined? As pointed out here, is there similar setting to adjust the extracting order?

This images shows the error.

parsing order issue

AUB_Financials_Dec_2018_pg9.pdf

Much appreciated any insights.

Thanks.
Luke

@lebedov
Copy link
Owner

lebedov commented Jan 8, 2020

Does the sort option of the extract_text method do what you need? If not, you will have to look into wrapping pdfbox's dev API (by design, python-pdfbox only exposes pdfbox's command line interface); I have posted a gist that demonstrates how to access the API from Python that you can use as a starting point for wrapping the PDFTextStripper Java class so that you can run the setSortByPosition() method.

@lebedov lebedov closed this as completed Jan 8, 2020
@lebedov lebedov reopened this Jan 8, 2020
@lebedov
Copy link
Owner

lebedov commented Jan 8, 2021

@zevio, if you delete the pdfbox-app*jar file cached by python-pdfbox (in ~/.cache/python-pdfbox on Linux or ~/Library/Caches/python-pdfbox on MacOS), the latest jar file will be downloaded the next time you import the package.

@zevio
Copy link

zevio commented Jan 8, 2021

I was about to correct my suggestion. Actually I think the issue is not directly linked to the jar file version but to the -sort option as you previously said. The same issue currently happens with Apache Tika, that bundles PDFBox. But calling setSortByPosition() does not seem to work at my end neither changing the configuration file in Apache Tika. Still, using the -sort option with the jar file corrects most of my issues. However and surprisingly, I obtained much better results with OCR (Pytesseract) for PDF content extraction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants