Questions about data annotation in GROBID #1067

SirLYC · 2023-12-05T12:25:26Z

Hello kermitt2,

I'm currently working on annotating data for the GROBID project and have a few questions regarding the annotation process. I would appreciate it if you could provide some guidance on the following issues:

In the General Principles section of the documentation, it is mentioned that the text flow should not be changed. Does this mean that the order and content of the text flow in the pre-annotated data cannot be altered or removed? Or does it mean that the internal content of each XML text node cannot be modified, but the external order can be freely adjusted? For example, in the pre-trained data of references.referenceSegmenter.tei.xml, due to the PDF text editing order issue, the extracted text flow contains the content after each reference's number first, followed by the number's content, and some main text content mixed in between. In this case, am I allowed to:
- Remove non-reference content
- Move the text flow of the reference number to its corresponding reference entry
When I finish modifying segmentation.xml and proceed to modify fulltext.xml, I find that some content inputted into fulltext.xml does not belong to the body, or some body content is recognized as front during the segmentation stage. In this case, should I remove the content that does not belong to the body and add back the missing content in the body? Additionally, I would like to know if I am allowed to adjust the order of text tokens if the extracted body order does not conform to the human reading order (while still ensuring that the text child nodes remain unchanged)?

I look forward to your response, and thank you for your assistance!

Best regards!

The text was updated successfully, but these errors were encountered:

kimn1944 · 2023-12-10T07:17:55Z

I would like to second @SirLYC's point num 2. In my own testing I see that a piece of text marked as <body> in the segmentation.xml is incorrectly showing up in the header.xml instead of fulltext.xml. I also wonder if we are able to retrain this behavior by moving that element from header.xml to fulltext.xml.

lfoppiano · 2024-01-12T06:26:01Z

Dear @SirLYC
sorry for the late answers.

The text flow should not be modified, which means that you can only add, move or remove the HTML tags that define each entity. If the data is out of order because of the transformation from PDF, it should not corrected otherwise the model will learn a condition that will not likely happen.
In general I prefer to work transversely because certain annotations might be complex and we become more efficient in correcting them by type rather than by document. So I would first correct all segmentation files (ignoring the ones that are already corrected), then when finished, re-train the segmentation model, re-generate the training data, and move to the next model in the cascade: e.g. full text or header.

lfoppiano added question There's no such thing as a stupid question training guidelines Related to the annotation guidelines for training data labels Apr 26, 2024

lfoppiano mentioned this issue Apr 26, 2024

Fix heading annotation in fulltext evaluation and add header levels #1105

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about data annotation in GROBID #1067

Questions about data annotation in GROBID #1067

SirLYC commented Dec 5, 2023 •

edited

kimn1944 commented Dec 10, 2023 •

edited

lfoppiano commented Jan 12, 2024

Questions about data annotation in GROBID #1067

Questions about data annotation in GROBID #1067

Comments

SirLYC commented Dec 5, 2023 • edited

kimn1944 commented Dec 10, 2023 • edited

lfoppiano commented Jan 12, 2024

SirLYC commented Dec 5, 2023 •

edited

kimn1944 commented Dec 10, 2023 •

edited