Wrong page range given: the first page (1) can not be after the last page (0). #234

camipozas · 2022-05-24T14:56:29Z

Describe the bug
I am running an image in Docker to read a pdf, convert it to image and later to text (there are scanned documents) and I get the following error, does anyone know why? I can't share the document :(

To Reproduce

  File "/usr/local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 479, in pdfinfo_from_path
    raise ValueError
ValueError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/build/app/main.py", line 67, in <module>
    main()
  File "/opt/build/app/main.py", line 48, in main
    text_contract = read_pdf(contract)
  File "/opt/build/app/main.py", line 26, in read_pdf
    images_from_path = convert_from_path(
  File "/usr/local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 98, in convert_from_path
    page_count = pdfinfo_from_path(pdf_path, userpw, poppler_path=poppler_path)["Pages"]
  File "/usr/local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 488, in pdfinfo_from_path
    raise PDFPageCountError(
pdf2image.exceptions.PDFPageCountError: Unable to get page count.
Syntax Error: Gen inside xref table too large (bigger than INT_MAX)
Syntax Error: Invalid XRef entry 3
Syntax Error: Top-level pages object is wrong type (null)
Command Line Error: Wrong page range given: the first page (1) can not be after the last page (0).

Desktop (please complete the following information):

OS: [e.g. iOS] Linux - Pop Os

Additional context
Dockerfile

FROM python:3.9
ENV LANG en_US.UTF-8

WORKDIR /opt/build

RUN apt update && apt-get install -y tesseract-ocr libtesseract-dev libleptonica-dev pkg-config poppler-utils


ADD requirements.txt requirements.txt
RUN pip install -r requirements.txt
# Copy env variables
ADD .env .env

# trained models
ADD tessdata/ tessdata/
ENV TESSDATA_PREFIX /opt/build/tessdata/

ADD app/ app/
RUN mkdir input

ENTRYPOINT ["python"]
CMD ["app/main.py"]

Belval · 2022-05-24T15:57:41Z

Thank you for taking the time to fill the issue template, it's much easier to help.

Is this only with one or a few PDFs?

Also, can you run pdftoppm -r 200 -jpeg your_file.pdf out and see if that also gives you an error?

camipozas · 2022-06-01T16:22:30Z

Hello, I was doing analysis of the pdfs that gave me an error and they all had docusign, but it also happens that others with docusing usually run correctly. I don't know how to upgrade poppler-utils in docker. I'd read this before, Pdf2Image library failing to read pdf signed using docusign

camipozas · 2022-06-03T23:00:57Z

Hello, I solved the mistake. The solution is create an ubuntu image, then install python (my case) and then install my things. It's the only way for now...
When I get inside the container I saw this version of poppler:

poppler-utils:
  Installed: 20.09.0-3.1
  Candidate: 20.09.0-3.1
  Version table:
 * 20.09.0-3.1 500
        500 http://deb.debian.org/debian bullseye/main amd64 Packages
        100 /var/lib/dpkg/status

And I know that I need +21.03.00...so after doing the solution, the image have:

poppler-utils:
  Installed: 22.02.0-2
  Candidate: 22.02.0-2
  Version table:
 * 22.02.0-2 500
        500 http://archive.ubuntu.com/ubuntu jammy/main amd64 Packages
        100 /var/lib/dpkg/status

If anyone has a question please contact me, happy to help.

faltunik · 2022-06-09T02:55:09Z

What I still don't understand what cause miscount?

camipozas · 2022-06-09T14:00:30Z

@faltunik sorry I don't know what cause the issue in details..I only know a priori the cause and the solution

Belval · 2022-06-27T02:18:28Z

This is a poppler issue unfortunately so there is not much that can be done on my side. I might add a check that raises a warning so that people are aware.

camipozas · 2022-06-27T02:23:53Z

if you want can I add the documentation to your project. I can make a fork and then upload the PR.

Belval · 2022-06-27T03:00:06Z

I appreciate the offer, but I am not sure what's the best way/place to document this yet.

It could be:

Add a warning here: https://github.com/Belval/pdf2image#limitations--known-issues
Add a warning here: https://github.com/Belval/pdf2image/blob/master/docs/installation.md#installing-poppler
Add a code warning here: https://github.com/Belval/pdf2image/blob/master/pdf2image/pdf2image.py#L123

For the code warning it would using the warning module (https://docs.python.org/3/library/warnings.html#warnings.warn):

warnings.warn(f"Detected popper version {poppler_version_major}.{poppler_version_minor} is known to fail on some PDFs in rare cases")

Code warning is more intrusive and might be overkill depending on how common this issue is.

puneetjindal · 2024-01-12T02:59:31Z

@camipozas How do you check whether a particular pdf is a scanned pdf or not?

Belval added the documentation label Jun 27, 2022

camipozas mentioned this issue Aug 24, 2022

documentation about solution for docusign issue #240

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong page range given: the first page (1) can not be after the last page (0). #234

Wrong page range given: the first page (1) can not be after the last page (0). #234

camipozas commented May 24, 2022

Belval commented May 24, 2022

camipozas commented Jun 1, 2022

camipozas commented Jun 3, 2022 •

edited

faltunik commented Jun 9, 2022

camipozas commented Jun 9, 2022

Belval commented Jun 27, 2022

camipozas commented Jun 27, 2022

Belval commented Jun 27, 2022

puneetjindal commented Jan 12, 2024

Wrong page range given: the first page (1) can not be after the last page (0). #234

Wrong page range given: the first page (1) can not be after the last page (0). #234

Comments

camipozas commented May 24, 2022

Belval commented May 24, 2022

camipozas commented Jun 1, 2022

camipozas commented Jun 3, 2022 • edited

faltunik commented Jun 9, 2022

camipozas commented Jun 9, 2022

Belval commented Jun 27, 2022

camipozas commented Jun 27, 2022

Belval commented Jun 27, 2022

puneetjindal commented Jan 12, 2024

camipozas commented Jun 3, 2022 •

edited