Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/some tables in PDF not getting recognized #2997

Closed
Ritesh1137 opened this issue May 9, 2024 · 2 comments
Closed

bug/some tables in PDF not getting recognized #2997

Ritesh1137 opened this issue May 9, 2024 · 2 comments
Labels
awaiting-response bug Something isn't working pdf

Comments

@Ritesh1137
Copy link

Describe the bug
I am processing PDF files of insurance sales brochures to identify tables. With high res strategy and infer-table set to True, I can identify most tables in the document consistently but am not able to identify two particular tables for some reason.

To Reproduce
Process the file at this link PDF File with tables - Insurance sales brochure

code to process docs:

! pip install langchain unstructured[all-docs] pydantic lxml langchainhub
! sudo apt-get install poppler-utils tesseract-ocr

from typing import Any

from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf
# Get elements
raw_pdf_elements = partition_pdf(
    filename=path + "EndowmentPlan_JeevanLakshya.pdf",
    # Unstructured first finds embedded image blocks
    extract_images_in_pdf=False,
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    infer_table_structure=True,
    # Post processing to aggregate text once we have the title
    chunking_strategy="by_title",
    # Chunking params to aggregate text blocks
 # for v1, v2 = 3000, 1000
    max_characters=3500,
    new_after_n_chars=1500,
    combine_text_under_n_chars=250,
    image_output_dir_path=path,
)

Expected behavior
In the processed files, all tables except the tables given below are processed and available as HTML.

Screenshots
These tables are not processed correctly and are coming as text elements and not table elements.

tables

Environment Info
Running on google colab, default free-tier.

@Ritesh1137 Ritesh1137 added the bug Something isn't working label May 9, 2024
@scanny scanny added the pdf label May 9, 2024
@christinestraub
Copy link
Contributor

I recommend you to use our API and try specifying the model - "hi_res_model_name`="layout_v1.1.0". This model is not supported in open source.

elements = partition_via_api(
    filename=filename,
    api_key=<api_key>,
    strategy="hi_res",
    hi_res_model_name="layout_v1.1.0"
    chunking_strategy="by_title",
    max_characters=3500,
    new_after_n_chars=1500,
    combine_text_under_n_chars=250,
)

If you are gonna stick with open source, I advise on trying "zero out the background color" as a preprocessing before passing into partition:

  • basically identify the background color first
  • then convert those pixels into white background

@MthwRobinson
Copy link
Contributor

Per @christinestraub 's suggestion, recommend using the API for access to higher performance table extraction models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-response bug Something isn't working pdf
Projects
None yet
Development

No branches or pull requests

4 participants