New figure/table segmentation approach and models #963

kermitt2 · 2022-11-07T17:33:11Z

Unstable and work in progress!

(follow-up of the fix-vector-graphics branch)

This is a working version for a revision of the cascade process in Grobid, which changes the overall approach for figures and tables and simplifies the full text model (see also #950):

as very first step, bitmap and vector graphics are clustered and considered as possible parts of figure/table area, this includes in particular a revised vector graphic analysis and clustering
from these candidate figures and table positions, we have 2 figure and table segmentation models (figure-segmenter) which try to identify the valid complete figure and table zones, including captions, figure/table titles and notes, as well as sub-figures. This is done by extending up and down through the layout tokens from these candidate figure/table graphic positions
then the segmentation model is applied on the document as before, excluding identified figure/table zones
the fulltext model now excludes figure and table complete zones, which make it simpler (less tokens and labels) and smaller (smaller input sequences)
the figure and table models are still applied to the identified full figure and table zones to identify captions, figure/table title, figure/table reference label and notes, as before

It is expected that these changes will result in better figure and table recognition (because now more strictly anchored on graphic elements), less errors in the text body where passages are mismatched with figures/tables parts (because they use figure and table markers) and overall simpler training data to be produced (because the figure/table mess are taken out of the fulltext model).

One goal it to anchor the process to identified clustered graphic elements without machine vision techniques and avoid the very costly rasterization step. If it doesn't give state-of-the-art accuracy, we can still use a R-CNN just for this step as fallback (so unfortunately with rasterization - this would be the simpler approach, but let's try to do it differently and light first).

As a possible continuation, we can then divide the full text model into 2 models, one dedicated to the overall backbone of the text body (sections, paragraphs) and one to the recurrent content (paragraph content). The second one might then be short enough to use deep learning models efficiently.

Problem: this is not going to work on tables or figures which are just text without any graphic elements, it's not frequent but it happens.

coveralls · 2022-12-07T10:01:43Z

coverage: 39.915% (-0.04%) from 39.959%
when pulling d189cb5 on new-figure-table-models
into 6bd974d on master.

kermitt2 added 22 commits July 4, 2021 08:05

use batik

2721f64

batik integration

1cbb815

Merge branch 'master' into fix-vector-graphics

aa654fd

test svg element merging

5e9e53a

Merge branch 'master' into fix-vector-graphics

7b8917d

cleaning

3f1058a

Merge branch 'master' into fix-vector-graphics

b18bb88

start FigureSegmenterParser

9eecec7

some progress on new models

7a10012

Merge branch 'master' into fix-vector-graphics

d3f0df3

review direction

e964a41

Merge branch 'master' into fix-vector-graphics

0255b80

createTraining for figure segmenter

6d19e34

Merge branch 'master' into fix-vector-graphics

a92f692

fix crop box for reference over 2 pages

01cf993

update figure-segmenter features

0d4981d

complete create training for figures

6d52e2e

various fixes

432d40b

cleaning

ea542e2

update fulltext model with updated vector graphic processing

711e6c9

Merge branch 'master' into fix-vector-graphics

31bfc34

fix conflict

f868fe0

kermitt2 marked this pull request as draft November 7, 2022 17:33

kermitt2 added this to the 0.8.0 milestone Nov 7, 2022

kermitt2 mentioned this pull request Nov 20, 2022

How to handle figure with multiple charts #915

Open

lfoppiano and others added 4 commits December 6, 2022 20:33

add stacktrace in circleci build

121a1cb

try github actions

2b69b37

fix merge conflict

0e722f4

fix merge

ab0a514

kermitt2 mentioned this pull request May 1, 2023

Coordinates of caption elements #1008

Open

kermitt2 added 2 commits September 14, 2023 17:04

fix conflict with latest master

fb33e5d

minor doc update

e2bf621

kermitt2 modified the milestones: 0.8.0, 0.9.0 Nov 18, 2023

Merge branch 'master' into new-figure-table-models

d189cb5

lfoppiano mentioned this pull request Dec 17, 2023

Is Grobid able to OCR papers ? #507

Open

lfoppiano mentioned this pull request Jan 25, 2024

general paragraph text wrongly recognized as "figDesc/div/p" #1077

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New figure/table segmentation approach and models #963

New figure/table segmentation approach and models #963

kermitt2 commented Nov 7, 2022

coveralls commented Dec 7, 2022 •

edited

New figure/table segmentation approach and models #963

Are you sure you want to change the base?

New figure/table segmentation approach and models #963

Conversation

kermitt2 commented Nov 7, 2022

coveralls commented Dec 7, 2022 • edited

coveralls commented Dec 7, 2022 •

edited