Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error case of a PDF with tons of hidden duplicated text #1083

Open
lfoppiano opened this issue Feb 7, 2024 · 4 comments
Open

Error case of a PDF with tons of hidden duplicated text #1083

lfoppiano opened this issue Feb 7, 2024 · 4 comments
Labels
error cases Some error/test case for future improvements

Comments

@lfoppiano
Copy link
Collaborator

I stumbled upon an interesting case where I though something was wrong in Grobid, however, at more close analysis I found that there is a lot of hidden text that is duplicated. We could consider this as training data for the segmentation model, perhaps?

image

PDF: energies-14-08509.pdf
TEI: energies-14-08509.pdf.tei.xml.txt

@lfoppiano lfoppiano added the error cases Some error/test case for future improvements label Feb 7, 2024
@kermitt2
Copy link
Owner

kermitt2 commented Feb 7, 2024

Hi Luca !

This is mentioned in issue #826, see "the repeated hidden text". I think "dealing with the invisible" in general should be done by visual clues, e.g. detect text white on white, as preprocessing before the segmentation model.

@lfoppiano
Copy link
Collaborator Author

Oh. Indeed. I did not know there was already an issue open on the same subject.

@lfoppiano
Copy link
Collaborator Author

It seems quite common for MDPI

@kermitt2
Copy link
Owner

Indeed, it's massive in MPDL !

I think the transparent text comes from the peer review content... they might be managing some version control that keeps older version of some text content, and instead of removing it, it might be included white on white?

For instance, in the following article:
https://hal.science/hal-03313471/ (same with the PDF on the MDPL site)
Some white on white text has "FOR PEER REVIEW" in it when it repeats page header:

Soc. Sci. 2021, 10, x FOR PEER REVIEW 

25 of 38 <lb/>

4. Discussion <lb/>Although the interviews...

Sometimes the repeated white text does not match the actual final black printed text, so it supports the remaining "peer review text" hypothesis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error cases Some error/test case for future improvements
Projects
None yet
Development

No branches or pull requests

2 participants