You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I stumbled upon an interesting case where I though something was wrong in Grobid, however, at more close analysis I found that there is a lot of hidden text that is duplicated. We could consider this as training data for the segmentation model, perhaps?
This is mentioned in issue #826, see "the repeated hidden text". I think "dealing with the invisible" in general should be done by visual clues, e.g. detect text white on white, as preprocessing before the segmentation model.
I think the transparent text comes from the peer review content... they might be managing some version control that keeps older version of some text content, and instead of removing it, it might be included white on white?
For instance, in the following article: https://hal.science/hal-03313471/ (same with the PDF on the MDPL site)
Some white on white text has "FOR PEER REVIEW" in it when it repeats page header:
Soc. Sci. 2021, 10, x FOR PEER REVIEW
25 of 38 <lb/>
4. Discussion <lb/>Although the interviews...
Sometimes the repeated white text does not match the actual final black printed text, so it supports the remaining "peer review text" hypothesis.
I stumbled upon an interesting case where I though something was wrong in Grobid, however, at more close analysis I found that there is a lot of hidden text that is duplicated. We could consider this as training data for the segmentation model, perhaps?
PDF: energies-14-08509.pdf
TEI: energies-14-08509.pdf.tei.xml.txt
The text was updated successfully, but these errors were encountered: