Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mask certain tokens during annotation #681

Open
myedibleenso opened this issue Nov 19, 2022 · 2 comments
Open

Mask certain tokens during annotation #681

myedibleenso opened this issue Nov 19, 2022 · 2 comments

Comments

@myedibleenso
Copy link
Member

There are many cases where a token is better off invisible to a sequence tagger or shift-reduce parser (ex. the bullet in a bulleted list, ®, °, etc.). If such symbols have not been seen in training, they may have unexpected effects on the output sequence. It would be convenient to provide masked tokens for specific steps in the annotation pipeline using a regular expression or list of strings.

@kwalcock
Copy link
Member

This is with the veil:

Screenshot 2023-02-23 at 9 06 05 AM

This is without:

Screenshot 2023-02-23 at 9 09 55 AM

Here's the extra code:

    val veil = text.indices
        .filter { index =>
          val char = text.charAt(index)
          char == '®' || char == '*'
        }
        .map { index => index to index }
    val veiledText = new VeiledText(text, veil)
    val document = processor.annotate(veiledText.mkDocument(processor))

@MihaiSurdeanu
Copy link
Contributor

Nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants