Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New figure/table segmentation approach and models #963

Draft
wants to merge 29 commits into
base: master
Choose a base branch
from

Conversation

kermitt2
Copy link
Owner

@kermitt2 kermitt2 commented Nov 7, 2022

Unstable and work in progress!

(follow-up of the fix-vector-graphics branch)

This is a working version for a revision of the cascade process in Grobid, which changes the overall approach for figures and tables and simplifies the full text model (see also #950):

  • as very first step, bitmap and vector graphics are clustered and considered as possible parts of figure/table area, this includes in particular a revised vector graphic analysis and clustering
  • from these candidate figures and table positions, we have 2 figure and table segmentation models (figure-segmenter) which try to identify the valid complete figure and table zones, including captions, figure/table titles and notes, as well as sub-figures. This is done by extending up and down through the layout tokens from these candidate figure/table graphic positions
  • then the segmentation model is applied on the document as before, excluding identified figure/table zones
  • the fulltext model now excludes figure and table complete zones, which make it simpler (less tokens and labels) and smaller (smaller input sequences)
  • the figure and table models are still applied to the identified full figure and table zones to identify captions, figure/table title, figure/table reference label and notes, as before

It is expected that these changes will result in better figure and table recognition (because now more strictly anchored on graphic elements), less errors in the text body where passages are mismatched with figures/tables parts (because they use figure and table markers) and overall simpler training data to be produced (because the figure/table mess are taken out of the fulltext model).

One goal it to anchor the process to identified clustered graphic elements without machine vision techniques and avoid the very costly rasterization step. If it doesn't give state-of-the-art accuracy, we can still use a R-CNN just for this step as fallback (so unfortunately with rasterization - this would be the simpler approach, but let's try to do it differently and light first).

As a possible continuation, we can then divide the full text model into 2 models, one dedicated to the overall backbone of the text body (sections, paragraphs) and one to the recurrent content (paragraph content). The second one might then be short enough to use deep learning models efficiently.

Problem: this is not going to work on tables or figures which are just text without any graphic elements, it's not frequent but it happens.

@kermitt2 kermitt2 marked this pull request as draft November 7, 2022 17:33
@kermitt2 kermitt2 added this to the 0.8.0 milestone Nov 7, 2022
@coveralls
Copy link

coveralls commented Dec 7, 2022

Coverage Status

coverage: 39.915% (-0.04%) from 39.959%
when pulling d189cb5 on new-figure-table-models
into 6bd974d on master.

@kermitt2 kermitt2 modified the milestones: 0.8.0, 0.9.0 Nov 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants