Sub/superscript are displayed as plain text characters in the TEI output #160

MedKhem · 2017-02-13T15:16:50Z

First re-flexion, identify piece of text as sub/superscript based on position, fonts, etc.

benjaminkreen · 2017-04-20T15:58:55Z

Hey there, I had a quick question. I just started tinkering with grobid and I was wondering if the superscript/subscript identification can be added through training such as giving the following training data:

β-cell Endoplasmic Reticulum Ca²⁺

<titleStmt>
  <title level="a" type="main">β-cell Endoplasmic Reticulum Ca<sup>2+</sup></title>
</titleStmt>

thanks for the input

kermitt2 · 2020-08-13T18:54:30Z

subscript and superscript flags are attached to the tokens so we could serialize with <sup> and <sub> elements yes.
Similarly we could add <hi> for bold and italic tokens.

lfoppiano · 2022-07-20T06:14:15Z

I'm starting to work on implementing this feature.

What should be done when the token contains combinations? Like italic + bold, or italic+bold+superscript?

Also it seems that the place to add this part would be in the TEIFormatter.java which is quite big already. In particular, I wish I could avoid have to modify the method segmentIntoSentences but it seems quite hard not to...

@kermitt2 any advice on this?

kermitt2 · 2022-07-20T10:21:31Z

With the current recognition, the "style" features could support indeed in principle at least italic, bold, superscript/subscript.

The TEI guidelines introduce <hi> to encode "graphically distinct" text and there is no constraint on the values, see here. We often see values space-separated (for example <hi rend="bold italic superscript">).

In TEI, there's also <rendition> element which uses CSS, which might be more predictable and would not required to further customize the XML schema.

I think what's complicated are the relations and the possible clash with other structures/tagging.

when this style information should be ignored: For instance a reference marker is often in bold or a superscript number. But the logical "reference" structure is already captured by the <ref> labeling and there is no point in keeping rendering information here. The <hi> is only relevant to text without any other explicit other logical mark-up. The only exception I think would be superscript/subscript inside a formula.
to maintain hierarchical structures: The <hi> element would be an inline annotation (like <ref>) so it is always under structure tags like <p> or <s>. This is indeed complicated for sentence segmentation, because the sentence segments are introduced after the initial serialization, directly on the TEI objects (this was to simplify the serialization! working with a tree structure ensures that we have well-formed XML at the end). The <hi> tags should be manageable as the <ref> tags in segmentIntoSentences - except if the bold/italic/etc covers more than the sentence. If a highlight style to be labeled covers more than the current sentence, we would need to close it with the end of the sentence, and re-open it in the next sentence.

lfoppiano · 2022-07-28T02:51:15Z

I think it's now implemented by injecting <hi rend="bold italic">. The flow of decoration is interrupted by references (that was easy) and sentences (that was a pain).

I've also tried to modularise a bit the code in methods, so that could be unit tested as different components.

I tried not to run the realignment of the code 😅 which usually make a mess...

I'm sending some examples:
Examples.zip

MedKhem added the enhancement label Feb 13, 2017

MedKhem self-assigned this Feb 13, 2017

de-code mentioned this issue May 3, 2017

Space between regular character and sub-script character/number #179

Open

lfoppiano assigned lfoppiano and unassigned MedKhem Jul 12, 2022

lfoppiano linked a pull request Jul 22, 2022 that will close this issue

Implements font styles in the output XML #936

Open

lfoppiano linked a pull request Jul 25, 2022 that will close this issue

Implements font styles in the output XML #936

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sub/superscript are displayed as plain text characters in the TEI output #160

Sub/superscript are displayed as plain text characters in the TEI output #160

MedKhem commented Feb 13, 2017

benjaminkreen commented Apr 20, 2017

kermitt2 commented Aug 13, 2020

lfoppiano commented Jul 20, 2022 •

edited

kermitt2 commented Jul 20, 2022

lfoppiano commented Jul 28, 2022

Sub/superscript are displayed as plain text characters in the TEI output #160

Sub/superscript are displayed as plain text characters in the TEI output #160

Comments

MedKhem commented Feb 13, 2017

benjaminkreen commented Apr 20, 2017

kermitt2 commented Aug 13, 2020

lfoppiano commented Jul 20, 2022 • edited

kermitt2 commented Jul 20, 2022

lfoppiano commented Jul 28, 2022

lfoppiano commented Jul 20, 2022 •

edited