Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different number of cells and bbox #4

Open
valedica-core opened this issue Apr 17, 2024 · 4 comments
Open

Different number of cells and bbox #4

valedica-core opened this issue Apr 17, 2024 · 4 comments

Comments

@valedica-core
Copy link

valedica-core commented Apr 17, 2024

Hi, thanks for sharing your work. I'm facing a problem that doesn't seem to be tackled by the library, even if kind of expected: what if the table structure extraction model outputs a different number of non-empty html cells compared to the table cell bbox detection model?

[reproducible example removed, since private data, thanks @ridhoalattas for yours!]

Did you encounter this problem before? Did you have some smart way to solve this problem?

@ridhoalattas
Copy link

ridhoalattas commented Apr 19, 2024

do you found the solution? @valedica-core

ive exactly same case
image

but got missmatch result
image

ive already slicing the image every 200 pixel by height to avoid leaking gpu. the result is better indeed but got those missmatch result

please help @ShengYun-Peng

@ShengYun-Peng
Copy link
Contributor

Thank you both for the question! I believe the above example is OOD for our training set as the spanning cell text is not aligned with any column in the header. A high-quality training dataset with abundatnt tables of various styles, colors, and designs will be ideal to resolve this.

How to perfectly align the cells predicted by the bbox branch and the structure branch is an open question now. Currently, UniTable divides the table parsing as structure+bbox+cell content. I hope there's a new way to divide and parse the table so that UniTable and the new method can double-check each other's outputs. Feel free to share your thoughts on this @valedica-core @ridhoalattas !

@valedica-core
Copy link
Author

Thanks @ShengYun-Peng for the reply and again for this library.

I saw some other authors (2023 Nam Tuan Ly et al.) have proposed multi-task models to align a bit more the sub tasks, but I'd say the beauty of this library's approach is the simplicity. I wonder then why not employing a single decoder model that outputs both html tags and bbox at the same time, is it feasible?

About how I tried to fix the current issue, I first tried to frame it as a minimalization problem: find the best assignment of bboxes to html cells -possibly by skipping cells or splitting a cell to fit more than one box- that minimized the number of misplaced bboxes, where misplaced means that left-right or above-below relationships between bboxes are broken once you assign them to cells. However, with big tables and a lot of empty cells, the problem seems untreatable.

The simpler and drastic approach I ended up with is to just use the bboxes, drawing the axes of the grid using local minima of number of intersections with bboxes. It has some problem with spanning cells or columns, but I'm not that interested in it since my goal is to have lossy markdown representation as output.

@ShengYun-Peng
Copy link
Contributor

Thanks @valedica-core ! It seems like table cell alignment is a good research direction. We tried GPT-4o for spanning cells, and UniTable still did better on table structure understanding. https://x.com/RealAnthonyPeng/status/1790431978829087123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants