Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table picker for PDF #2

Open
sambitdash opened this issue Jul 12, 2017 · 4 comments
Open

Table picker for PDF #2

sambitdash opened this issue Jul 12, 2017 · 4 comments

Comments

@sambitdash
Copy link
Owner

Natural tabular objects in a PDF document should ideally be picked up for extraction.

The intent of the project is API development, hence it will be headless for most part. There may not be a WYSIWYG picker available unlike a reader. A heuristic table picker should scan the document for existence of table like structures and dump them in tabular HTML/CSS format or extracted image objects. In cased document tagging is enabled, the table picker can use the tagged text.

@hhaensel
Copy link

hhaensel commented May 9, 2022

I have written some lines of code to extract tabular data. Currently it is keyword based to determine the textlayouts to include. I also managed to make short IJulia notebook where you can interactively select text in a Plotly chart.
@sambitdash Would you be interested in including that code in your package?
Otherwise I might release my own package but I feel that this functionality would nicely fit into PDFIO.

@sambitdash
Copy link
Owner Author

@hhaensel thank you for your interest. I want to understand what level of complex cases can this software handle. If you submit a PR, I can review it and let you know if they are useful for this SDK.

@hhaensel
Copy link

hhaensel commented May 9, 2022

Sounds perfect, I'll submit a PR tomorrow.
The code extracts a vector of TextLayouts as a function of page(s) and keywords, then scans for common elements in rows and columns as a function of their layout box. The layout boxes can be scaled in order to reduce the probability of overlapping areas. Optionally a Plotly graph displays the elements and their recognised arrangement with a color code.

Looking forward to your feedback.

@hhaensel
Copy link

Sorry, currently in overload, will take some more time ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants