Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about data annotation in GROBID #1067

Open
SirLYC opened this issue Dec 5, 2023 · 2 comments
Open

Questions about data annotation in GROBID #1067

SirLYC opened this issue Dec 5, 2023 · 2 comments
Labels
question There's no such thing as a stupid question training guidelines Related to the annotation guidelines for training data

Comments

@SirLYC
Copy link

SirLYC commented Dec 5, 2023

Hello kermitt2,

I'm currently working on annotating data for the GROBID project and have a few questions regarding the annotation process. I would appreciate it if you could provide some guidance on the following issues:

  1. In the General Principles section of the documentation, it is mentioned that the text flow should not be changed. Does this mean that the order and content of the text flow in the pre-annotated data cannot be altered or removed? Or does it mean that the internal content of each XML text node cannot be modified, but the external order can be freely adjusted? For example, in the pre-trained data of references.referenceSegmenter.tei.xml, due to the PDF text editing order issue, the extracted text flow contains the content after each reference's number first, followed by the number's content, and some main text content mixed in between. In this case, am I allowed to:

    • Remove non-reference content
    • Move the text flow of the reference number to its corresponding reference entry
  2. When I finish modifying segmentation.xml and proceed to modify fulltext.xml, I find that some content inputted into fulltext.xml does not belong to the body, or some body content is recognized as front during the segmentation stage. In this case, should I remove the content that does not belong to the body and add back the missing content in the body? Additionally, I would like to know if I am allowed to adjust the order of text tokens if the extracted body order does not conform to the human reading order (while still ensuring that the text child nodes remain unchanged)?

I look forward to your response, and thank you for your assistance!

Best regards!

@kimn1944
Copy link

kimn1944 commented Dec 10, 2023

I would like to second @SirLYC's point num 2. In my own testing I see that a piece of text marked as <body> in the segmentation.xml is incorrectly showing up in the header.xml instead of fulltext.xml. I also wonder if we are able to retrain this behavior by moving that element from header.xml to fulltext.xml.

@lfoppiano
Copy link
Collaborator

Dear @SirLYC
sorry for the late answers.

  1. The text flow should not be modified, which means that you can only add, move or remove the HTML tags that define each entity. If the data is out of order because of the transformation from PDF, it should not corrected otherwise the model will learn a condition that will not likely happen.

  2. In general I prefer to work transversely because certain annotations might be complex and we become more efficient in correcting them by type rather than by document. So I would first correct all segmentation files (ignoring the ones that are already corrected), then when finished, re-train the segmentation model, re-generate the training data, and move to the next model in the cascade: e.g. full text or header.

@lfoppiano lfoppiano added question There's no such thing as a stupid question training guidelines Related to the annotation guidelines for training data labels Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question There's no such thing as a stupid question training guidelines Related to the annotation guidelines for training data
Projects
None yet
Development

No branches or pull requests

3 participants