Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text extraction hangs on MacOS 10.14 #14

Open
devcsrj opened this issue Oct 1, 2019 · 11 comments
Open

text extraction hangs on MacOS 10.14 #14

devcsrj opened this issue Oct 1, 2019 · 11 comments

Comments

@devcsrj
Copy link

devcsrj commented Oct 1, 2019

I am trying to use pdfbox, with this vanilla snippet:

converter = pdfbox.PDFBox()
converter.extract_text(
    input_path=str(pdf.absolute()),
    output_path=str(txt.absolute()))

But it becomes stuck. I debugged the stack tree, and it hangs at this line:

Screen Shot 2019-10-02 at 6 25 24 AM

I confirmed that a Java process is spawned:

➜ jps
5416 Jps
5385
329    <-- spawned process

But it is just stuck there.

Running the cached jar by python-pdfbox in the terminal works:

java -jar pdfbox-app-2.0.17.jar ExtractText '/Users/devcsrj/Projects/devcsrj/klerk/dist/17/SENATE/regular-1/journal-28.pdf' '/Users/devcsrj/Projects/devcsrj/klerk/dist/17/SENATE/regular-1/journal-28.txt'

So I am no longer sure what's going on. Thoughts?


Environment

Python

python-pdfbox = "==0.1.7"
python_version = "3.7"

Java

openjdk version "1.8.0_222"
OpenJDK Runtime Environment (build 1.8.0_222-20190711112007.graal.jdk8u-src-tar-gz-b08)
OpenJDK 64-Bit GraalVM CE 19.2.0 (build 25.222-b08-jvmci-19.2-b02, mixed mode)

OS

macOS Mojave 10.14.4

@adarsa
Copy link

adarsa commented Nov 12, 2019

I have the same issue. Did you find a solution to this?

@devcsrj
Copy link
Author

devcsrj commented Nov 12, 2019

@adarsa Not really no. I ended up abandoning pdfbox altogether, and used tesseract to extract text instead.

@lebedov
Copy link
Owner

lebedov commented Nov 12, 2019

Does this occur with all PDFs, or only with some? If the latter, can you attach it to this issue?

@devcsrj
Copy link
Author

devcsrj commented Nov 12, 2019

@lebedov I haven't had the chance to try it on other PDFs, but as for the file I am using in the screenshot, it is this one.

@lebedov
Copy link
Owner

lebedov commented Nov 13, 2019

I can't reproduce the hanging problem with the input PDF file you mentioned on Ubuntu Linux 18.0.4 with Python 3.7.3 and OpenJDK 11.0.4. I suspect some sort of platform-specific jpype weirdness, but I unfortunately don't have a MacOS box to debug this. I'll leave the issue open for the time being in case anyone who can investigate further has further input.

@lebedov lebedov changed the title ExtractText is stuck text extraction hangs on MacOS 10.14 Nov 13, 2019
@adarsa
Copy link

adarsa commented Nov 25, 2019

I had this issue with all pdf's I tried.

@suiyuan2009
Copy link

+1

1 similar comment
@sprakash93
Copy link

+1

@lebedov
Copy link
Owner

lebedov commented Aug 5, 2020

I finally obtained access to a MacOS box. I can't reproduce the problem with Python 3.8.5, OpenJDK 14.0.2, and python-pdfbox 0.1.8 on MacOS 10.15.6; processing the indicated file succeeds without any error.

@peterHeuz
Copy link

I had the same issue also on macOS Mojave and this Java JDK version:
java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)

I installed the openJDK 15 from here and that fixed the issue.

@lebedov
Copy link
Owner

lebedov commented Jan 12, 2021

@peterHeuz Given that more than person has encountered the issue on MacOS, I added a note to the package README.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants