Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page number duplicated in multi-page PDFs #247

Open
kym6464 opened this issue Dec 4, 2022 · 1 comment
Open

Page number duplicated in multi-page PDFs #247

kym6464 opened this issue Dec 4, 2022 · 1 comment

Comments

@kym6464
Copy link

kym6464 commented Dec 4, 2022

Describe the bug

Given a multi-page PDF, the page number is encoded twice in the output file name: once by pdf2image and again by pdftoppm/pdftocairo.

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

(1) Download multipage.pdf

(2) Run this code from the same directory as multipage.pdf:

import pathlib
from pdf2image import convert_from_path

pdf_file = pathlib.Path(r"./multipage.pdf")
convert_from_path(pdf_file, output_folder=".", output_file=pdf_file.stem, fmt='jpeg')

(3) The previous step should produce 10 JPG files. Notice the filename of each follows format: {PPM-root}{PPPP}-{number}.jpg

Expected behavior

Filenames should only have the page number encoded once (which the pdfto* already handles): {PPM-root}-{number}.jpg

Screenshots

File tree showing outputs for pdf2image, pdftoppm, and pdftocairo:

│   driver.py
│   multipage.pdf
│
├───output_pdf2image
│       multipage0001-01.jpg
│       multipage0001-02.jpg
│       multipage0001-03.jpg
│       multipage0001-04.jpg
│       multipage0001-05.jpg
│       multipage0001-06.jpg
│       multipage0001-07.jpg
│       multipage0001-08.jpg
│       multipage0001-09.jpg
│       multipage0001-10.jpg
│
├───output_pdftocairo
│       multipage-01.jpg
│       multipage-02.jpg
│       multipage-03.jpg
│       multipage-04.jpg
│       multipage-05.jpg
│       multipage-06.jpg
│       multipage-07.jpg
│       multipage-08.jpg
│       multipage-09.jpg
│       multipage-10.jpg
│
└───output_pdftoppm
        multipage-01.jpg
        multipage-02.jpg
        multipage-03.jpg
        multipage-04.jpg
        multipage-05.jpg
        multipage-06.jpg
        multipage-07.jpg
        multipage-08.jpg
        multipage-09.jpg
        multipage-10.jpg

Desktop (please complete the following information):

  • OS: Windows
  • 1.16.0

Workaround

I think the issue is with counter_generator. If we pass a generator for output_file, then counter_generator is never called and we can produce the expected outputs:

import pathlib
from pdf2image import convert_from_path

pdf_file = pathlib.Path(r"./multipage.pdf")

def constant_generator():
	while True:
		yield pdf_file.stem

convert_from_path(pdf_file, output_folder=".", output_file=constant_generator(), fmt='jpeg')
@jerryrelmore
Copy link

I saw this behavior on a project yesterday - like you, I wasn't expecting that output in the file names. I checked generators.py to look at the counter_generator function. If you look more closely at the output file names, it's not duplicating page numbers - rather, it's appending the number of the thread that handles the page conversion.

A simple fix is to change this in generators.py:

@threadsafe
def counter_generator(prefix="", suffix="", padding_goal=4):
    """Returns a joined prefix, iteration number, and suffix"""
    i = 0
    while True:
        i += 1
        yield str(prefix) + str(i).zfill(padding_goal) + str(suffix)

to:

@threadsafe
def counter_generator(prefix="", suffix="", padding_goal=4):
    """Returns a joined prefix, iteration number, and suffix"""
    i = 0
    while True:
        i += 1
        yield str(prefix) + str(suffix)

Looks like there's a PR out waiting on merge to do just that and a bit more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants