Problem converting from Image coordinates to PDF coordinates #261

Pocoyo7798 · 2023-03-26T16:08:25Z

Hi!
I have a code to extract tables from pdf files. To identify the tables i´m using a layoutparser, hence I need to convert the image coordinates into pdf coordinates. To do this, I have a code where the pdf is converted into image using pdf2image, the layout model extract runs in each page image getting the blocks coordinates and type, the image size is obtained using pillow and the pdf page size is obtained using PyPDF2. Having this the convertion is done using the following equation for all 4 box coordinates (x1, y1, x1, y2)
x1 = image_box_x_1 * pdf_width / image_width
The code is the following:

def find_blocks_layoutparser(file_path: str, pdf, model):
page_list = convert_from_path(file_path)
block_boxes = []
extracted_blocks = {}
page_index = 0
# Initiate the parser model
for page in page_list:
page.save(f'page{page_index}.jpg')
# Detect all block in a page
layout = model.detect(page)
boxes = []
width, height = page.size
pdf_page = pdf.pages[page_index]
pdf_size = pdf_page.mediabox
pdf_width = pdf_size[2] - pdf_size[0]
pdf_height = pdf_size[3] - pdf_size[1]
for entry in layout:
# Retrieve the bounding box
x1 = entry.block.x_1 / width * float(pdf_width)
x2 = entry.block.x_2 / width * float(pdf_width)
y1 = entry.block.y_1 / height * float(pdf_height)
y2 = entry.block.y_2 / height * float(pdf_height)
boxes.append([x1, y1, x2, y2])

The tectangles obtained are the follwing:

This is the first pdf that I had this problem, every test before this was ok. Since, the block coordinates are correct for each page image (I verify it). I think the problem is with the conversion of the pdf to image. Someone have any idea on how to solve this problem?

Thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem converting from Image coordinates to PDF coordinates #261

Problem converting from Image coordinates to PDF coordinates #261

Pocoyo7798 commented Mar 26, 2023

Problem converting from Image coordinates to PDF coordinates #261

Problem converting from Image coordinates to PDF coordinates #261

Comments

Pocoyo7798 commented Mar 26, 2023