Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get section hierarchy from fulltext? #1074

Open
ValeKnappich opened this issue Jan 9, 2024 · 7 comments
Open

How to get section hierarchy from fulltext? #1074

ValeKnappich opened this issue Jan 9, 2024 · 7 comments
Labels
duplicate Redundant redundant issue issue

Comments

@ValeKnappich
Copy link

Hi,

I am using grobid to extract the pdf full text (/processFulltextDocument).
It works great except that all sections are put on the same level and there doesn't seem to be a way to extract the hierarchical relationships.

Is there any way to do that with grobid?

Thanks

@kermitt2
Copy link
Owner

Hi @ValeKnappich !

Yes, currently the sections are "flat". In the past, grobid was actually creating a hierarchy of sections, but it was not working well and could lead to well-formedness problems in the resulting XML (although not frequent). Given that it was not reliable, it was removed until something better is done for supporting this feature.

See issue #377

@kermitt2 kermitt2 added the duplicate Redundant redundant issue issue label Jan 14, 2024
@ValeKnappich
Copy link
Author

Hi @kermitt2!

Thanks for the reply. As the Issue #377 is from 2019, I assume the problem is not going to be resolved any time soon.

Do you know any good workarounds for this? I tried using pdfplumber to get all text, find the heading and compare font sizes. But its somewhat tedious and is far from perfect.

@kermitt2
Copy link
Owner

Yes @ValeKnappich, it is not a priority, as compared to the new figure/table recognition approach for example.

In my previous approach, I was clustering the section headers based on font size, style and font name to try to identify header "levels" (all these font information in Grobid comes from pdfalto, similar to pdfplumber, but written in C++ and 20-50 times faster - scaling is one of the top requirement in Grobid). But as for you, it was far from perfect.

We could also use the PDF outline information (the kind of table of content embedded in PDF). Maybe in 50% of the cases when present, it gives a reliable section hierarchy with the coordinates of the sections headers. The problem is that in the other 50% of the cases, it is crap and noise, and would lead to errors, so I have disable also for the moment the usage of PDF outline information.

When present, the numbering gives also information about the hierarchy.

Maybe an interesting approach could be to combine all these features in a dedicated classifier, which would predict the hierarchical levels of a list of headers.

@ValeKnappich
Copy link
Author

Thanks @kermitt2! If I decide to spend more time on this and find a decently working solution, I will post an update here.

@com3dian
Copy link

@ValeKnappich I came across this issue randomly, but maybe my new python package grobidmonkey might help? to get the outline hierarchy you can simply do

from grobidmonkey import reader
monkeyReader = reader.MonkeyReader('monkey') # or 'lxml' or 'x2d'

# read paper outline
outline = monkeyReader.readOutline('/path/to/your/paper.pdf.tei.xml')

outline is an anytree.RenderTree object, to print that you can use

for pre, fill, node in outline:
    print("%s%s" % (pre, node.name))

and the output will be like

Article
├── 1 Introduction
├── 2 Proposed Method
│   ├── 2.1 ...
│   ├── 2.2 ...
│   └── 2.3 ...
├── 3 Experiments and Results
│   ├── 3.1 ...
│   ├── 3.2 ...
│   └── 3.3 ...
└── 4 Conclusion

My approach is based on the section 'index' in TEI_XML output, if you have any feedbacks please let me know. Hope this will help!

@ValeKnappich
Copy link
Author

@com3dian thanks for reaching out.

Indeed, your approach works as long as the headings are numbered (I guess thats where the <head n= comes from).

However, thats not always the case for me.

You might want to account for that in your implementation in some way. At the moment, it will throw an error if the <head> tag does not have the attribute n.

@com3dian
Copy link

@ValeKnappich Hi thanks for the feedback.

You are right, this package developed based the wrapped code I used in my own project, so there is likely some issues. Can you share some papers/TEI-XMLs that <head> do not include the attribute n so that I can see if I could possibly improve the package?

PS: I have also tried with fontsize and fonttype solution, I feel like if you include both feature in your classifier might help. I have seen some journal templates has subsection titles almost same fontsize as contents, but in that case they usually use another fonttype.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate Redundant redundant issue issue
Projects
None yet
Development

No branches or pull requests

3 participants