How to get section hierarchy from fulltext? #1074

ValeKnappich · 2024-01-09T14:41:24Z

Hi,

I am using grobid to extract the pdf full text (/processFulltextDocument).
It works great except that all sections are put on the same level and there doesn't seem to be a way to extract the hierarchical relationships.

Is there any way to do that with grobid?

Thanks

The text was updated successfully, but these errors were encountered:

kermitt2 · 2024-01-14T19:29:40Z

Hi @ValeKnappich !

Yes, currently the sections are "flat". In the past, grobid was actually creating a hierarchy of sections, but it was not working well and could lead to well-formedness problems in the resulting XML (although not frequent). Given that it was not reliable, it was removed until something better is done for supporting this feature.

See issue #377

ValeKnappich · 2024-01-14T20:35:23Z

Hi @kermitt2!

Thanks for the reply. As the Issue #377 is from 2019, I assume the problem is not going to be resolved any time soon.

Do you know any good workarounds for this? I tried using pdfplumber to get all text, find the heading and compare font sizes. But its somewhat tedious and is far from perfect.

kermitt2 · 2024-01-14T21:21:26Z

Yes @ValeKnappich, it is not a priority, as compared to the new figure/table recognition approach for example.

In my previous approach, I was clustering the section headers based on font size, style and font name to try to identify header "levels" (all these font information in Grobid comes from pdfalto, similar to pdfplumber, but written in C++ and 20-50 times faster - scaling is one of the top requirement in Grobid). But as for you, it was far from perfect.

We could also use the PDF outline information (the kind of table of content embedded in PDF). Maybe in 50% of the cases when present, it gives a reliable section hierarchy with the coordinates of the sections headers. The problem is that in the other 50% of the cases, it is crap and noise, and would lead to errors, so I have disable also for the moment the usage of PDF outline information.

When present, the numbering gives also information about the hierarchy.

Maybe an interesting approach could be to combine all these features in a dedicated classifier, which would predict the hierarchical levels of a list of headers.

ValeKnappich · 2024-01-15T10:02:43Z

Thanks @kermitt2! If I decide to spend more time on this and find a decently working solution, I will post an update here.

com3dian · 2024-03-29T11:02:09Z

@ValeKnappich I came across this issue randomly, but maybe my new python package grobidmonkey might help? to get the outline hierarchy you can simply do

from grobidmonkey import reader
monkeyReader = reader.MonkeyReader('monkey') # or 'lxml' or 'x2d'

# read paper outline
outline = monkeyReader.readOutline('/path/to/your/paper.pdf.tei.xml')

outline is an anytree.RenderTree object, to print that you can use

for pre, fill, node in outline:
    print("%s%s" % (pre, node.name))

and the output will be like

Article
├── 1 Introduction
├── 2 Proposed Method
│   ├── 2.1 ...
│   ├── 2.2 ...
│   └── 2.3 ...
├── 3 Experiments and Results
│   ├── 3.1 ...
│   ├── 3.2 ...
│   └── 3.3 ...
└── 4 Conclusion

My approach is based on the section 'index' in TEI_XML output, if you have any feedbacks please let me know. Hope this will help!

ValeKnappich · 2024-03-29T13:34:45Z

@com3dian thanks for reaching out.

Indeed, your approach works as long as the headings are numbered (I guess thats where the <head n= comes from).

However, thats not always the case for me.

You might want to account for that in your implementation in some way. At the moment, it will throw an error if the <head> tag does not have the attribute n.

com3dian · 2024-03-29T15:44:27Z

@ValeKnappich Hi thanks for the feedback.

You are right, this package developed based the wrapped code I used in my own project, so there is likely some issues. Can you share some papers/TEI-XMLs that <head> do not include the attribute n so that I can see if I could possibly improve the package?

PS: I have also tried with fontsize and fonttype solution, I feel like if you include both feature in your classifier might help. I have seen some journal templates has subsection titles almost same fontsize as contents, but in that case they usually use another fonttype.

kermitt2 added the duplicate Redundant redundant issue issue label Jan 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get section hierarchy from fulltext? #1074

How to get section hierarchy from fulltext? #1074

ValeKnappich commented Jan 9, 2024

kermitt2 commented Jan 14, 2024

ValeKnappich commented Jan 14, 2024

kermitt2 commented Jan 14, 2024

ValeKnappich commented Jan 15, 2024

com3dian commented Mar 29, 2024

ValeKnappich commented Mar 29, 2024

com3dian commented Mar 29, 2024

How to get section hierarchy from fulltext? #1074

How to get section hierarchy from fulltext? #1074

Comments

ValeKnappich commented Jan 9, 2024

kermitt2 commented Jan 14, 2024

ValeKnappich commented Jan 14, 2024

kermitt2 commented Jan 14, 2024

ValeKnappich commented Jan 15, 2024

com3dian commented Mar 29, 2024

ValeKnappich commented Mar 29, 2024

com3dian commented Mar 29, 2024