Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sub/superscript are displayed as plain text characters in the TEI output #160

Open
MedKhem opened this issue Feb 13, 2017 · 5 comments · May be fixed by #936
Open

Sub/superscript are displayed as plain text characters in the TEI output #160

MedKhem opened this issue Feb 13, 2017 · 5 comments · May be fixed by #936
Assignees

Comments

@MedKhem
Copy link
Collaborator

MedKhem commented Feb 13, 2017

First re-flexion, identify piece of text as sub/superscript based on position, fonts, etc.

@MedKhem MedKhem self-assigned this Feb 13, 2017
@benjaminkreen
Copy link

Hey there, I had a quick question. I just started tinkering with grobid and I was wondering if the superscript/subscript identification can be added through training such as giving the following training data:

β-cell Endoplasmic Reticulum Ca2+

<titleStmt>
  <title level="a" type="main">β-cell Endoplasmic Reticulum Ca<sup>2+</sup></title>
</titleStmt>

thanks for the input

@kermitt2
Copy link
Owner

subscript and superscript flags are attached to the tokens so we could serialize with <sup> and <sub> elements yes.
Similarly we could add <hi> for bold and italic tokens.

@lfoppiano lfoppiano assigned lfoppiano and unassigned MedKhem Jul 12, 2022
@lfoppiano
Copy link
Collaborator

lfoppiano commented Jul 20, 2022

I'm starting to work on implementing this feature.

What should be done when the token contains combinations? Like italic + bold, or italic+bold+superscript?

Also it seems that the place to add this part would be in the TEIFormatter.java which is quite big already. In particular, I wish I could avoid have to modify the method segmentIntoSentences but it seems quite hard not to...

@kermitt2 any advice on this?

@kermitt2
Copy link
Owner

With the current recognition, the "style" features could support indeed in principle at least italic, bold, superscript/subscript.

The TEI guidelines introduce <hi> to encode "graphically distinct" text and there is no constraint on the values, see here. We often see values space-separated (for example <hi rend="bold italic superscript">).

In TEI, there's also <rendition> element which uses CSS, which might be more predictable and would not required to further customize the XML schema.

I think what's complicated are the relations and the possible clash with other structures/tagging.

  • when this style information should be ignored: For instance a reference marker is often in bold or a superscript number. But the logical "reference" structure is already captured by the <ref> labeling and there is no point in keeping rendering information here. The <hi> is only relevant to text without any other explicit other logical mark-up. The only exception I think would be superscript/subscript inside a formula.

  • to maintain hierarchical structures: The <hi> element would be an inline annotation (like <ref>) so it is always under structure tags like <p> or <s>. This is indeed complicated for sentence segmentation, because the sentence segments are introduced after the initial serialization, directly on the TEI objects (this was to simplify the serialization! working with a tree structure ensures that we have well-formed XML at the end). The <hi> tags should be manageable as the <ref> tags in segmentIntoSentences - except if the bold/italic/etc covers more than the sentence. If a highlight style to be labeled covers more than the current sentence, we would need to close it with the end of the sentence, and re-open it in the next sentence.

@lfoppiano lfoppiano linked a pull request Jul 22, 2022 that will close this issue
@lfoppiano lfoppiano linked a pull request Jul 25, 2022 that will close this issue
@lfoppiano
Copy link
Collaborator

I think it's now implemented by injecting <hi rend="bold italic">. The flow of decoration is interrupted by references (that was easy) and sentences (that was a pain).

I've also tried to modularise a bit the code in methods, so that could be unit tested as different components.

I tried not to run the realignment of the code 😅 which usually make a mess...

I'm sending some examples:
Examples.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants