Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements font styles in the output XML #936

Open
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

lfoppiano
Copy link
Collaborator

This PR is implementing the styles italic, bold superscript and subscript in the output xml.
See information at #160

@coveralls
Copy link

coveralls commented Jul 25, 2022

Coverage Status

coverage: 40.696% (+0.8%) from 39.903%
when pulling 188cda5 on feature/add-styles-xml
into be9e652 on master.

@lfoppiano lfoppiano marked this pull request as ready for review September 13, 2022 06:57
@kermitt2 kermitt2 added this to the 0.7.3 milestone Nov 7, 2022
@kermitt2 kermitt2 modified the milestones: 0.7.3, 0.8.0 Apr 23, 2023
@kermitt2
Copy link
Owner

Hi @lfoppiano !

This branch will require quite a few tests I think (I suspect it will raise problems to some of the grobid modules and I need to check the consistency with Pub2TEI), so I pushed its release to version 0.8.0.

One thing related to "document structure" versus "narrative style" is the bold style for section titles. I think it's like the italic/bold for the reference markers, the logical "section title" structure is already captured by the <head> element, so I would ignore the style for all section titles.

For example in the attached pdf, the style should be ignored here:

           <div xmlns="http://www.tei-c.org/ns/1.0">
                <head n="1"><hi rend="bold">Introduction</hi></head>

In contrast, the style here should be kept because it corresponds to an highlight within the flow of the paragraph text:

                <p>12. <hi rend="bold">Average tf-idf similarity between citance and title of the cited paper (F12):</hi> We calculate the similarity of each citance with the title of the cited paper and take an average of it.</p>
                <p>13. <hi rend="bold">Maximum tf-idf similarity between citance and title of the cited paper (F13):</hi> We take the maximum of similarity of the citances with the title of the cited paper.</p>

Does it make sense?

qss_a_00170.pdf

@lfoppiano
Copy link
Collaborator Author

@kermitt2 yes, no problem to push it further.

OK to the change you propose.

@lfoppiano lfoppiano self-assigned this Apr 27, 2023
@lfoppiano
Copy link
Collaborator Author

The crazy part was to merge the master back in this branch 😅

For example in the attached pdf, the style should be ignored here:

I've made the change and now the text within the <head> will not have the style applied:

<div
                xmlns="http://www.tei-c.org/ns/1.0">
                <head n="1">Introduction</head>
                <p>Literature searches are crucial to discover

In contrast, the style here should be kept because it corresponds to an highlight within the flow of the paragraph text:

                <p>12. <hi rend="bold">Average tf-idf similarity between citance and title of the cited paper (F12):</hi> We calculate the similarity of each citance with the title of the cited paper and take an average of it.</p>
                <p>13. <hi rend="bold">Maximum tf-idf similarity between citance and title of the cited paper (F13):</hi> We take the maximum of similarity of the citances with the title of the cited paper.</p>

I'm not sure what you mean in this case 🙂

# Conflicts:
#	grobid-core/src/main/java/org/grobid/core/document/TEIFormatter.java
#	grobid-core/src/test/java/org/grobid/core/document/TEIFormatterTest.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sub/superscript are displayed as plain text characters in the TEI output
3 participants