Error case of a PDF with tons of hidden duplicated text #1083

lfoppiano · 2024-02-07T06:19:33Z

I stumbled upon an interesting case where I though something was wrong in Grobid, however, at more close analysis I found that there is a lot of hidden text that is duplicated. We could consider this as training data for the segmentation model, perhaps?

PDF: energies-14-08509.pdf
TEI: energies-14-08509.pdf.tei.xml.txt

kermitt2 · 2024-02-07T11:48:04Z

Hi Luca !

This is mentioned in issue #826, see "the repeated hidden text". I think "dealing with the invisible" in general should be done by visual clues, e.g. detect text white on white, as preprocessing before the segmentation model.

lfoppiano · 2024-02-07T13:34:30Z

Oh. Indeed. I did not know there was already an issue open on the same subject.

lfoppiano · 2024-02-09T01:24:01Z

It seems quite common for MDPI

kermitt2 · 2024-02-11T16:40:26Z

Indeed, it's massive in MPDL !

I think the transparent text comes from the peer review content... they might be managing some version control that keeps older version of some text content, and instead of removing it, it might be included white on white?

For instance, in the following article:
https://hal.science/hal-03313471/ (same with the PDF on the MDPL site)
Some white on white text has "FOR PEER REVIEW" in it when it repeats page header:

Soc. Sci. 2021, 10, x FOR PEER REVIEW 

25 of 38 <lb/>

4. Discussion <lb/>Although the interviews...

Sometimes the repeated white text does not match the actual final black printed text, so it supports the remaining "peer review text" hypothesis.

lfoppiano added the error cases Some error/test case for future improvements label Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error case of a PDF with tons of hidden duplicated text #1083

Error case of a PDF with tons of hidden duplicated text #1083

lfoppiano commented Feb 7, 2024

kermitt2 commented Feb 7, 2024

lfoppiano commented Feb 7, 2024

lfoppiano commented Feb 9, 2024

kermitt2 commented Feb 11, 2024

Error case of a PDF with tons of hidden duplicated text #1083

Error case of a PDF with tons of hidden duplicated text #1083

Comments

lfoppiano commented Feb 7, 2024

kermitt2 commented Feb 7, 2024

lfoppiano commented Feb 7, 2024

lfoppiano commented Feb 9, 2024

kermitt2 commented Feb 11, 2024