Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong page range given: the first page (1) can not be after the last page (0). #234

Open
camipozas opened this issue May 24, 2022 · 9 comments

Comments

@camipozas
Copy link
Contributor

Describe the bug
I am running an image in Docker to read a pdf, convert it to image and later to text (there are scanned documents) and I get the following error, does anyone know why? I can't share the document :(

To Reproduce

  File "/usr/local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 479, in pdfinfo_from_path
    raise ValueError
ValueError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/build/app/main.py", line 67, in <module>
    main()
  File "/opt/build/app/main.py", line 48, in main
    text_contract = read_pdf(contract)
  File "/opt/build/app/main.py", line 26, in read_pdf
    images_from_path = convert_from_path(
  File "/usr/local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 98, in convert_from_path
    page_count = pdfinfo_from_path(pdf_path, userpw, poppler_path=poppler_path)["Pages"]
  File "/usr/local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 488, in pdfinfo_from_path
    raise PDFPageCountError(
pdf2image.exceptions.PDFPageCountError: Unable to get page count.
Syntax Error: Gen inside xref table too large (bigger than INT_MAX)
Syntax Error: Invalid XRef entry 3
Syntax Error: Top-level pages object is wrong type (null)
Command Line Error: Wrong page range given: the first page (1) can not be after the last page (0).

Desktop (please complete the following information):

  • OS: [e.g. iOS] Linux - Pop Os

Additional context
Dockerfile

FROM python:3.9
ENV LANG en_US.UTF-8

WORKDIR /opt/build

RUN apt update && apt-get install -y tesseract-ocr libtesseract-dev libleptonica-dev pkg-config poppler-utils


ADD requirements.txt requirements.txt
RUN pip install -r requirements.txt
# Copy env variables
ADD .env .env

# trained models
ADD tessdata/ tessdata/
ENV TESSDATA_PREFIX /opt/build/tessdata/

ADD app/ app/
RUN mkdir input

ENTRYPOINT ["python"]
CMD ["app/main.py"]
@Belval
Copy link
Owner

Belval commented May 24, 2022

Thank you for taking the time to fill the issue template, it's much easier to help.

Is this only with one or a few PDFs?

Also, can you run pdftoppm -r 200 -jpeg your_file.pdf out and see if that also gives you an error?

@camipozas
Copy link
Contributor Author

Hello, I was doing analysis of the pdfs that gave me an error and they all had docusign, but it also happens that others with docusing usually run correctly. I don't know how to upgrade poppler-utils in docker. I'd read this before, Pdf2Image library failing to read pdf signed using docusign

@camipozas
Copy link
Contributor Author

camipozas commented Jun 3, 2022

Hello, I solved the mistake. The solution is create an ubuntu image, then install python (my case) and then install my things. It's the only way for now...
When I get inside the container I saw this version of poppler:

poppler-utils:
  Installed: 20.09.0-3.1
  Candidate: 20.09.0-3.1
  Version table:
 * 20.09.0-3.1 500
        500 http://deb.debian.org/debian bullseye/main amd64 Packages
        100 /var/lib/dpkg/status

And I know that I need +21.03.00...so after doing the solution, the image have:

poppler-utils:
  Installed: 22.02.0-2
  Candidate: 22.02.0-2
  Version table:
 * 22.02.0-2 500
        500 http://archive.ubuntu.com/ubuntu jammy/main amd64 Packages
        100 /var/lib/dpkg/status

If anyone has a question please contact me, happy to help.

@faltunik
Copy link

faltunik commented Jun 9, 2022

What I still don't understand what cause miscount?

@camipozas
Copy link
Contributor Author

@faltunik sorry I don't know what cause the issue in details..I only know a priori the cause and the solution

@Belval
Copy link
Owner

Belval commented Jun 27, 2022

This is a poppler issue unfortunately so there is not much that can be done on my side. I might add a check that raises a warning so that people are aware.

@camipozas
Copy link
Contributor Author

if you want can I add the documentation to your project. I can make a fork and then upload the PR.

@Belval
Copy link
Owner

Belval commented Jun 27, 2022

I appreciate the offer, but I am not sure what's the best way/place to document this yet.

It could be:

For the code warning it would using the warning module (https://docs.python.org/3/library/warnings.html#warnings.warn):

warnings.warn(f"Detected popper version {poppler_version_major}.{poppler_version_minor} is known to fail on some PDFs in rare cases")

Code warning is more intrusive and might be overkill depending on how common this issue is.

@puneetjindal
Copy link

@camipozas How do you check whether a particular pdf is a scanned pdf or not?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants