Mask certain tokens during annotation #681

myedibleenso · 2022-11-19T01:06:33Z

There are many cases where a token is better off invisible to a sequence tagger or shift-reduce parser (ex. the bullet in a bulleted list, ®, °, etc.). If such symbols have not been seen in training, they may have unexpected effects on the output sequence. It would be convenient to provide masked tokens for specific steps in the annotation pipeline using a regular expression or list of strings.

kwalcock · 2023-02-23T16:16:04Z

This is with the veil:

This is without:

Here's the extra code:

    val veil = text.indices
        .filter { index =>
          val char = text.charAt(index)
          char == '®' || char == '*'
        }
        .map { index => index to index }
    val veiledText = new VeiledText(text, veil)
    val document = processor.annotate(veiledText.mkDocument(processor))

MihaiSurdeanu · 2023-02-23T16:27:38Z

Nice!

myedibleenso added the enhancement label Nov 19, 2022

myedibleenso mentioned this issue Nov 20, 2022

Optionally preserve unrecognized tokens #680

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mask certain tokens during annotation #681

Mask certain tokens during annotation #681

myedibleenso commented Nov 19, 2022

kwalcock commented Feb 23, 2023

MihaiSurdeanu commented Feb 23, 2023

Mask certain tokens during annotation #681

Mask certain tokens during annotation #681

Comments

myedibleenso commented Nov 19, 2022

kwalcock commented Feb 23, 2023

MihaiSurdeanu commented Feb 23, 2023