Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

general paragraph text wrongly recognized as "figDesc/div/p" #1077

Open
sawyerzheng opened this issue Jan 25, 2024 · 4 comments
Open

general paragraph text wrongly recognized as "figDesc/div/p" #1077

sawyerzheng opened this issue Jan 25, 2024 · 4 comments
Assignees
Labels
bug From Hemiptera and especially its suborder Heteroptera error cases Some error/test case for future improvements

Comments

@sawyerzheng
Copy link

I am using a docker container of docker pull lfoppiano/grobid:0.8.0

v0.7.3 also tested

  • What is your Java version (java --version)?

just used official docker: lfoppiano/grobid

  • In case of build or run errors, please submit the error while running gradlew with --stacktrace and --info for better log traces (e.g. ./gradlew run --stacktrace --info) or attach the log file logs/grobid-service.log.

No this file, as using docker


Problem

  1. The general paragraph text which is not belong to a figure is wrongly recognized as a figDesc
  2. Part of the wrongly recognized text as figDesc also in the general paraph text "body/div/p"
    • This mean its repeated in two part of tei xml: "body/figure/figDesc/div/p" and "body/div/p"

original pdf area

image

extracted xml

image

Reference materials

Used pdf

176_liu2010.pdf

Result tei xml

note: github not accept .xml file, I modified its suffix as .txt

176_liu2010.pdf.tei.xml.txt

@lfoppiano
Copy link
Collaborator

Thanks @sawyerzheng for reporting the issue.

Indeed, there are two problems:

  1. the paragraph is wrongly labeled as a figure. This is a common problem that we are (slowly) working on in PR New figure/table segmentation approach and models #963
    For the time being, we could add your example as training data, however, unfortunately, because this is Elsevier and the article is copyrighted, it's not possible to redistribute it as training data.
    Nevertheless, should you find the same problem in other papers with a Creative Commons licence, we could use it as a test case.

  2. the paragraph from "In inert gas" is duplicated and out of order. It should be related to the figures processing. Please give me a couple of weeks, I should be able to fix it.

@lfoppiano lfoppiano self-assigned this Jan 25, 2024
@lfoppiano lfoppiano added the bug From Hemiptera and especially its suborder Heteroptera label Jan 25, 2024
@sawyerzheng
Copy link
Author

Thank you very much for your time.

So far, I have only been able to find one example PDF. If I come across a not copyrighted PDF with the similar problem in the future, I will upload it there.

@sawyerzheng
Copy link
Author

sawyerzheng commented Apr 1, 2024

I found one pdf with open access. The pdf has similar problem.

This parse result from grobid gpu docker version: grobid/grobid:0.7.2

Snipaste_2024-04-01_16-32-30

image


image

pdf: https://www.nature.com/articles/s41597-024-03160-z
s41597-024-03160-z.pdf

@sawyerzheng sawyerzheng reopened this Apr 1, 2024
@lfoppiano lfoppiano added the error cases Some error/test case for future improvements label Apr 1, 2024
@lfoppiano
Copy link
Collaborator

Indeeed. Thanks for finding an example we will surely add it to the training data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug From Hemiptera and especially its suborder Heteroptera error cases Some error/test case for future improvements
Projects
None yet
Development

No branches or pull requests

2 participants