You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I am processing PDF files of insurance sales brochures to identify tables. With high res strategy and infer-table set to True, I can identify most tables in the document consistently but am not able to identify two particular tables for some reason.
! pip install langchain unstructured[all-docs] pydantic lxml langchainhub
! sudo apt-get install poppler-utils tesseract-ocr
from typing import Any
from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf
# Get elements
raw_pdf_elements = partition_pdf(
filename=path + "EndowmentPlan_JeevanLakshya.pdf",
# Unstructured first finds embedded image blocks
extract_images_in_pdf=False,
# Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
# Titles are any sub-section of the document
infer_table_structure=True,
# Post processing to aggregate text once we have the title
chunking_strategy="by_title",
# Chunking params to aggregate text blocks
# for v1, v2 = 3000, 1000
max_characters=3500,
new_after_n_chars=1500,
combine_text_under_n_chars=250,
image_output_dir_path=path,
)
Expected behavior
In the processed files, all tables except the tables given below are processed and available as HTML.
Screenshots
These tables are not processed correctly and are coming as text elements and not table elements.
Environment Info
Running on google colab, default free-tier.
The text was updated successfully, but these errors were encountered:
Describe the bug
I am processing PDF files of insurance sales brochures to identify tables. With high res strategy and infer-table set to True, I can identify most tables in the document consistently but am not able to identify two particular tables for some reason.
To Reproduce
Process the file at this link PDF File with tables - Insurance sales brochure
code to process docs:
Expected behavior
In the processed files, all tables except the tables given below are processed and available as HTML.
Screenshots
These tables are not processed correctly and are coming as text elements and not table elements.
Environment Info
Running on google colab, default free-tier.
The text was updated successfully, but these errors were encountered: