bug/some tables in PDF not getting recognized #2997

Ritesh1137 · 2024-05-09T18:27:49Z

Describe the bug
I am processing PDF files of insurance sales brochures to identify tables. With high res strategy and infer-table set to True, I can identify most tables in the document consistently but am not able to identify two particular tables for some reason.

To Reproduce
Process the file at this link PDF File with tables - Insurance sales brochure

code to process docs:

! pip install langchain unstructured[all-docs] pydantic lxml langchainhub
! sudo apt-get install poppler-utils tesseract-ocr

from typing import Any

from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf
# Get elements
raw_pdf_elements = partition_pdf(
    filename=path + "EndowmentPlan_JeevanLakshya.pdf",
    # Unstructured first finds embedded image blocks
    extract_images_in_pdf=False,
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    infer_table_structure=True,
    # Post processing to aggregate text once we have the title
    chunking_strategy="by_title",
    # Chunking params to aggregate text blocks
 # for v1, v2 = 3000, 1000
    max_characters=3500,
    new_after_n_chars=1500,
    combine_text_under_n_chars=250,
    image_output_dir_path=path,
)

Expected behavior
In the processed files, all tables except the tables given below are processed and available as HTML.

Screenshots
These tables are not processed correctly and are coming as text elements and not table elements.

Environment Info
Running on google colab, default free-tier.

The text was updated successfully, but these errors were encountered:

christinestraub · 2024-05-10T05:22:05Z

I recommend you to use our API and try specifying the model - "hi_res_model_name`="layout_v1.1.0". This model is not supported in open source.

elements = partition_via_api(
    filename=filename,
    api_key=<api_key>,
    strategy="hi_res",
    hi_res_model_name="layout_v1.1.0"
    chunking_strategy="by_title",
    max_characters=3500,
    new_after_n_chars=1500,
    combine_text_under_n_chars=250,
)

If you are gonna stick with open source, I advise on trying "zero out the background color" as a preprocessing before passing into partition:

basically identify the background color first
then convert those pixels into white background

MthwRobinson · 2024-05-28T12:46:17Z

Per @christinestraub 's suggestion, recommend using the API for access to higher performance table extraction models.

Ritesh1137 added the bug Something isn't working label May 9, 2024

scanny added the pdf label May 9, 2024

scanny added the awaiting-response label May 10, 2024

MthwRobinson closed this as completed May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug/some tables in PDF not getting recognized #2997

bug/some tables in PDF not getting recognized #2997

Ritesh1137 commented May 9, 2024

christinestraub commented May 10, 2024

MthwRobinson commented May 28, 2024

bug/some tables in PDF not getting recognized #2997

bug/some tables in PDF not getting recognized #2997

Comments

Ritesh1137 commented May 9, 2024

christinestraub commented May 10, 2024

MthwRobinson commented May 28, 2024