Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDFPageCountError: Unable to get page count. I/O Error: Couldn't open file 'C:\Users\cdragomir2\Desktop\dataiku\Non Phub Samples\New folder (3)\007-084841-1 to 31 Dec'22': No error. #251

Open
Crispisu opened this issue Jan 11, 2023 · 5 comments

Comments

@Crispisu
Copy link

Crispisu commented Jan 11, 2023

Hi All,
I am trying to use pdf2image, but I am getting this error:

PDFPageCountError: Unable to get page count.
I/O Error: Couldn't open file 'C:\Users\user_name\Desktop\folder_name\folder2_name\folder3_name\007-084841-1 to 31 Dec'22': No error.

It is confusing as it doesn't give any error, it just says 'No error'

My code is:

doc = convert_from_path("C:\\Users\\user_name\\Desktop\\folder_name\\folder2_name\\folder3_name\\007-084841-1 to 31 Dec'22")
path, fileName = os.path.split("C:\\Users\\user_name\\Desktop\\folder_name\\folder2_name\\folder3_name\\007-084841-1 to 31 Dec'22")
fileBaseName, fileExtension = os.path.splitext(fileName)

for page_number, page_data in enumerate(doc):
    txt = pytesseract.image_to_string(Image.fromarray(page_data)).encode("utf-8")
    print("Page # {} - {}".format(str(page_number),txt))

Can anyone help me please?

@jjbiggins
Copy link

I investigated this a bit. More information would be helpful to nail it down.

What version of pdf2image are you use? And, what python version?

I don't have an easily accessible Windows machine, so I didn't confirm, but looks like Popen in pdfinfo func is throwing an error. I couldn't replicate it, but I know issues in the past occur because pdfinfo was not in PATH. So, I would check that it's there first.

Aside from that, it appears stderr isn't being handled correctly. I believe if stderr=PIPE was replaced with stderr=STDOUT, which is an alias to stderr, it would work.

Also, windows has the STARTUPINFO class impacts stdin,stdout,stderr on windows. In the most up-to-date code in the repo, you'll notice that the process instances are created using the STARTUPINFO.

The pdfinfo function has evolve dover the various versions of pdf2image, as has subprocess evolved, particular for windows, from 3.7 till now. So, knowing those would help narrow down the issue.

@Crispisu
Copy link
Author

@jjbiggins Thank you so much for looking into it. I have just managed to figure it out, it was just a stupid mistake on my side and even if it I feel embarrassed to say what it was...I will say it in case someone else makes this stupid mistake...
Forgot to add file extension! :(
Thank you once again for your help!

@jjbiggins
Copy link

jjbiggins commented Jan 16, 2023

I see. I was curious about that filename.

To me, that obviously makes sense why it would raise the PDFPageCountError. However, the error message, "No error", seems undesirable.

For example, in your case, where the file doesn't exist, due to the extension being omitted. I would expect an error such as:

pdf2image.exceptions.PDFPageCountError: Unable to get page count. I/O Error: Couldn't open file 'hello.pdf': No such file or directory.

Now, I generated that on macOS with python3.11 and the most recent pdf2image code. I would have to defer to someone more knowledgeable, but, intuitively, it seems like saying "No error" when there clearly is one is an issue.

However, depending on your version of pdf2image. This may have been resolved.

@Crispisu
Copy link
Author

Agreed, that "No error" message was very confusing for me as well, as I had no clue how to debug.
It would be great if an error message like you said would be thrown.

pdf2image version:
pdf2image 1.16.2

Thank you so much!

@jjbiggins
Copy link

After looking into this, this message comes directly from pdfinfo binary; thus, it is dependent on the version of pdfinfo being used.

For example, if you were using the pdfinfo binary from Xpdf-4.04, no message would be displayed. However, if using pdfinfo version 22.09.0 from poppler you get the more detailed output.

In both cases, pdfinfo's uses fopen() call, to open the pdf. Throwing Errno 2, ENOENT, No such file or directory.

Only in the poppler version is the errno's description, "No such file or directory", appended to pdfinfo's error message, and, consequently, available to be captured by stderr in pdf2image.

There's not a great way to handle this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants