Add some information in the header in respect on how the document was procesed #1103

lfoppiano · 2024-04-24T02:04:58Z

I find sometimes difficult to detect whether certain processes has been performed by the grobid engine, one for example is the sentence segmentation, which would require to pass through the entire, or part of the document, to detect it.

I would suggest to add somewhere (in the header), maybe in the part related to the application, which type of processing has been used to produce the output TEI, and this would help understand part of the structure of the underlying TEI.

Some of the information that could be included:

sentence segmentation
git revision (the version alone might not be enough since the patch-released are not so frequent - although this would be more relevant for development / testing)
consolidation was used
used models architecture

In case of consolidation, for example, could be useful to avoid re-running on subsequent processes. E.g. DataStet processing TEI that are already consolidated would not need to repeat the process.

lfoppiano added the enhancement label Apr 24, 2024

lfoppiano mentioned this issue Apr 24, 2024

Add some information in the header in respect on how the document was procesed #1102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add some information in the header in respect on how the document was procesed #1103

Add some information in the header in respect on how the document was procesed #1103

lfoppiano commented Apr 24, 2024 •

edited

Add some information in the header in respect on how the document was procesed #1103

Add some information in the header in respect on how the document was procesed #1103

Comments

lfoppiano commented Apr 24, 2024 • edited

lfoppiano commented Apr 24, 2024 •

edited