Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some information in the header in respect on how the document was procesed #1103

Open
lfoppiano opened this issue Apr 24, 2024 · 0 comments

Comments

@lfoppiano
Copy link
Collaborator

lfoppiano commented Apr 24, 2024

I find sometimes difficult to detect whether certain processes has been performed by the grobid engine, one for example is the sentence segmentation, which would require to pass through the entire, or part of the document, to detect it.

I would suggest to add somewhere (in the header), maybe in the part related to the application, which type of processing has been used to produce the output TEI, and this would help understand part of the structure of the underlying TEI.

Some of the information that could be included:

  • sentence segmentation
  • git revision (the version alone might not be enough since the patch-released are not so frequent - although this would be more relevant for development / testing)
  • consolidation was used
  • used models architecture

In case of consolidation, for example, could be useful to avoid re-running on subsequent processes. E.g. DataStet processing TEI that are already consolidated would not need to repeat the process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant