Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotating formulas, listings and figures #1100

Open
Schroedi opened this issue Apr 18, 2024 · 4 comments
Open

Annotating formulas, listings and figures #1100

Schroedi opened this issue Apr 18, 2024 · 4 comments

Comments

@Schroedi
Copy link

Schroedi commented Apr 18, 2024

Hi, thanks for your awesome work!

I have some annotation questions:

  1. Formula labeling
    https://grobid.readthedocs.io/en/latest/training/fulltext/#formulas
    Advises to not include the brackets in the label. The training data includes them, though. One of multiple samples: https://github.com/kermitt2/grobid/blob/be9e6523d71518544e1394f5be56bda0e55819ef/grobid-trainer/resources/dataset/shorttext/corpus/tei/submission_106.training.shorttext.tei.xml#L10C177-L10C177

  2. Listings
    How should I annotate listings like Algorithm 1 in [1]?
    Are they figures? If so, what would be the label?

<figure>
  <head>Algorithm </head>
  <label>1</label>
  <figDesc>Online fitting of E from events and images<lb/></figDesc>
</figure>
  1. Figures
    I assume I should add missing figures to the figure.tei.xml file?
    They probably should follow the order in which they appear within the fulltext?
The following is obsolete: I found the `trash` tag in the training data

Should they contain all text+tags from the fulltext and additionally annotate the relevant parts (head, label, figDesc)? Here is an example from [1] again: ```xml

Random Saccades 50 100 150 200 250 300 160 180 200 220 240 260 280 300 Smooth Pursuit 140 160 180 200 220 240 260 0 50 100 150 200 250 300 Pixel Coordinate Pixel Coordinate Pupil in Camera Space Gaze Point in Screen Space Gaze Point in Screen Space 20°6 3°2 0°6 3°4 0°9 5°9 5°9 5°P ixel Coordinate Pixel Coordinate Fig. 6 . Fitted pupil locations and gaze point estimates for smooth pursuit motion and random saccadic motion are shown for four different users in different colors. The figure is organized into grids; the first row plots smooth pursuit data and the second row plots random saccadic data. ``` Should I keep the first part or remove it?

[1] arXiv:2004.03577v3

@lfoppiano
Copy link
Collaborator

@Schroedi regarding point 1, the documentation is referred to the fulltext model, you should check at the data under grobid-trainer/resources/dataset/fulltext. If you look at the annotation there, they should all be following the guidelines.

The example you are linking is another model, which I'm not sure it's even used (it's last update was in 2017 😅).

@lfoppiano
Copy link
Collaborator

lfoppiano commented Apr 26, 2024

Regarding point 3, if there are missing figures in the figure.tei.xml it means that in one of the upstream models, either the segmentation or the fulltext models something is wrongly tagged.

In this case you should examine the generated training data generated from the models upstream. See Fig 2 in https://grobid.readthedocs.io/en/latest/Principles/ for more information of what is upstream and downstream.

I recommend you to work in batches of documents, and check each model's data at the same time, then move to the next model. Usually takes time to get familiar with each models' structure and working on the same model before moving to the next may be more efficient. It's just a recommendation, though.

I do my best to explain what I have been doing, feel free to point me to the unclear parts. 😅

  1. First check the generated data for the segmentation model:
    1. if there are corrections keep the corrected file
    2. If the model is good already ignore it
  2. Then move to the fulltext model's generated files. There are three possibilities:
    1. the segmentation model before did not loose data, so the body part of the article is completed. You can correct the file,
    2. the segmentation model mislabeled a substantial part of the document and this part is missing. You should ignore the file for the time being, until the segmentation model is retrained including the current's document segmentation training file.
    3. the segmentation model mislabeled a substantial part of the document in the sense that more data is available, you could remove the surplus and correct the rest of the file.
  3. here you can repeat point 2 for the next downstream model (e.g. figure model)
  4. After you finished a batch of documents, you can retrain the segmentation model, and regenerate the training data for the documents that the fulltext model generated file missed data (point 2.2).

As the training process, this explanation can be performed in an iterative way. Let me know if there are points that are not clear.

@Schroedi
Copy link
Author

Thank your for taking your time and your detailed answer! It really helped me.

@Schroedi regarding point 1, the documentation is referred to the fulltext model, you should check at the data under grobid-trainer/resources/dataset/fulltext. If you look at the annotation there, they should all be following the guidelines.

The example you are linking is another model, which I'm not sure it's even used (it's last update was in 2017 😅).

You're right. I think this one should be fixed though: #1107

The last open point is 2. Listing. Is there any special handling or should they just be figures?

@lfoppiano
Copy link
Collaborator

Thanks for the PR #1107, we might merge it at the next iteration on the models (which might happens in a few months) so that we don't forget about it.

For the listing, I don't really know, I quickly checked but did not find any training data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants