Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: ImageExtraction not extracting all the images in pdf #162

Open
luojunhui1 opened this issue Apr 30, 2023 · 6 comments
Open

BUG: ImageExtraction not extracting all the images in pdf #162

luojunhui1 opened this issue Apr 30, 2023 · 6 comments

Comments

@luojunhui1
Copy link

Describe the bug
not extracting all the images in pdf

To Reproduce

  1. For a pdf file with 9 pages, there is one image in page 6, page 7, page 8 (page num start with 0), respectively
  2. the ImageExtraction only detected the image in page 7 but ignored the images in page 6 and page 8
# read the Document
    doc: typing.Optional[Document] = None
    text_l: SimpleTextExtraction = SimpleTextExtraction()
    image_l: ImageExtraction = ImageExtraction()

    with open(file_path, "rb") as in_file_handle:
        doc = PDF.loads(in_file_handle, [text_l, image_l])

    # check whether we have read a Document
    assert doc is not None

    images = []

    for page in range(0, 9):
        if "XObject" in doc.get_page(page)["Resources"]:
            for k, v in doc.get_page(page)["Resources"]["XObject"].items():
                print("%d\t%s" % (page, k))
    
    for page, content in image_l.get_images().items():
        images += (content)
        print(f"image page: {page}")

Expected behaviour
the ImageExtraction listenser should return all the images

Screenshots
image

Desktop (please complete the following information):

  • OS: Windows10
  • borb version 2.1.10

Additional context
Add any other context about the problem here.

@luojunhui1 luojunhui1 changed the title BUG BUG: ImageExtraction not extracting all the images in pdf Apr 30, 2023
@jorisschellekens
Copy link
Owner

Please attach the input PDF

@luojunhui1
Copy link
Author

@jorisschellekens i deleted some sensitive infomation from the original PDF, and the output is still not correct. the complete test code is as below

def test_pdf_with_borb(self):
        doc: typing.Optional[Document] = None
        text_l: SimpleTextExtraction = SimpleTextExtraction()
        image_l: ImageExtraction = ImageExtraction()

        file_path = PROJECT_DIR + "data/test/input_doc2.pdf"
        with open(file_path, "rb") as in_file_handle:
            doc = PDF.loads(in_file_handle, [text_l, image_l])

        # check whether we have read a Document
        assert doc is not None

        images = []
        page_num = int(doc.get_document_info().get_number_of_pages())
        print(f"page num: {page_num}")

        for page in range(0, page_num):
            if "XObject" in doc.get_page(page)["Resources"]:
                for k, v in doc.get_page(page)["Resources"]["XObject"].items():
                    print("%d\t%s" % (page, k))
        
        for page, content in image_l.get_images().items():
            images += (content)
            print(f"image page: {page}")

the test output screenshot is
image

input_doc2.pdf

@jorisschellekens
Copy link
Owner

I checked the images in your PDF.
It turns out borb does not currently support them (yet).
That's why they are not extracted.

@luojunhui1
Copy link
Author

what can i do to extract these images correctly? could you give me any advice, thanks a lot

@jorisschellekens
Copy link
Owner

You would have to implement your own version of an ImageTransformer (package io and read).

Essentially you need to:

  • identify when this transformer needs to be triggered
  • what this transformer needs to do to convert the raw bytes to a PIL Image

@hdoer
Copy link

hdoer commented Aug 23, 2023

I also encountered this problem. There are some pictures in png format in my pdf. I found it can not extract. There are following steps:

  1. write a PngImageTransformer
  2. write a new loads function like PDF.loads()
  3. add some code to insert PngImageTransformer instance to ReadAnyObjectTransformer: readAnyObjectTransformer.get_children().insert(0, PngImageTransformer())
  4. got the image use get_images function.

Have to say, I am learning the code. Maybe it's not the best solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants