BUG: ImageExtraction not extracting all the images in pdf #162

luojunhui1 · 2023-04-30T08:03:33Z

Describe the bug
not extracting all the images in pdf

To Reproduce

For a pdf file with 9 pages, there is one image in page 6, page 7, page 8 (page num start with 0), respectively
the ImageExtraction only detected the image in page 7 but ignored the images in page 6 and page 8

# read the Document
    doc: typing.Optional[Document] = None
    text_l: SimpleTextExtraction = SimpleTextExtraction()
    image_l: ImageExtraction = ImageExtraction()

    with open(file_path, "rb") as in_file_handle:
        doc = PDF.loads(in_file_handle, [text_l, image_l])

    # check whether we have read a Document
    assert doc is not None

    images = []

    for page in range(0, 9):
        if "XObject" in doc.get_page(page)["Resources"]:
            for k, v in doc.get_page(page)["Resources"]["XObject"].items():
                print("%d\t%s" % (page, k))
    
    for page, content in image_l.get_images().items():
        images += (content)
        print(f"image page: {page}")

Expected behaviour
the ImageExtraction listenser should return all the images

Screenshots

Desktop (please complete the following information):

OS: Windows10
borb version 2.1.10

Additional context
Add any other context about the problem here.

jorisschellekens · 2023-04-30T19:53:49Z

Please attach the input PDF

luojunhui1 · 2023-05-01T07:20:54Z

@jorisschellekens i deleted some sensitive infomation from the original PDF, and the output is still not correct. the complete test code is as below

def test_pdf_with_borb(self):
        doc: typing.Optional[Document] = None
        text_l: SimpleTextExtraction = SimpleTextExtraction()
        image_l: ImageExtraction = ImageExtraction()

        file_path = PROJECT_DIR + "data/test/input_doc2.pdf"
        with open(file_path, "rb") as in_file_handle:
            doc = PDF.loads(in_file_handle, [text_l, image_l])

        # check whether we have read a Document
        assert doc is not None

        images = []
        page_num = int(doc.get_document_info().get_number_of_pages())
        print(f"page num: {page_num}")

        for page in range(0, page_num):
            if "XObject" in doc.get_page(page)["Resources"]:
                for k, v in doc.get_page(page)["Resources"]["XObject"].items():
                    print("%d\t%s" % (page, k))
        
        for page, content in image_l.get_images().items():
            images += (content)
            print(f"image page: {page}")

the test output screenshot is

input_doc2.pdf

jorisschellekens · 2023-05-01T08:36:05Z

I checked the images in your PDF.
It turns out borb does not currently support them (yet).
That's why they are not extracted.

luojunhui1 · 2023-05-01T15:28:02Z

what can i do to extract these images correctly? could you give me any advice, thanks a lot

jorisschellekens · 2023-05-02T16:49:39Z

You would have to implement your own version of an ImageTransformer (package io and read).

Essentially you need to:

identify when this transformer needs to be triggered
what this transformer needs to do to convert the raw bytes to a PIL Image

hdoer · 2023-08-23T09:23:36Z

I also encountered this problem. There are some pictures in png format in my pdf. I found it can not extract. There are following steps：

write a PngImageTransformer
write a new loads function like PDF.loads()
add some code to insert PngImageTransformer instance to ReadAnyObjectTransformer: readAnyObjectTransformer.get_children().insert(0, PngImageTransformer())
got the image use get_images function.

Have to say, I am learning the code. Maybe it's not the best solution.

luojunhui1 changed the title ~~BUG~~ BUG: ImageExtraction not extracting all the images in pdf Apr 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: ImageExtraction not extracting all the images in pdf #162

BUG: ImageExtraction not extracting all the images in pdf #162

luojunhui1 commented Apr 30, 2023

jorisschellekens commented Apr 30, 2023

luojunhui1 commented May 1, 2023

jorisschellekens commented May 1, 2023

luojunhui1 commented May 1, 2023

jorisschellekens commented May 2, 2023

hdoer commented Aug 23, 2023 •

edited

BUG: ImageExtraction not extracting all the images in pdf #162

BUG: ImageExtraction not extracting all the images in pdf #162

Comments

luojunhui1 commented Apr 30, 2023

jorisschellekens commented Apr 30, 2023

luojunhui1 commented May 1, 2023

jorisschellekens commented May 1, 2023

luojunhui1 commented May 1, 2023

jorisschellekens commented May 2, 2023

hdoer commented Aug 23, 2023 • edited

hdoer commented Aug 23, 2023 •

edited