Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally preserve unrecognized tokens #680

Open
myedibleenso opened this issue Nov 19, 2022 · 4 comments
Open

Optionally preserve unrecognized tokens #680

myedibleenso opened this issue Nov 19, 2022 · 4 comments

Comments

@myedibleenso
Copy link
Member

The tokenizer cannot handle just any character sequence. There are scenarios where a user would like to preserve these symbols in the Document that is produced.

I haven't had a hand in the implementation of the tokenizer, but is it possible to reinsert these symbols as tokens in the doc we return? lemmata, tags, etc. should probably default to some sort of UNK tag.

Related: if processors moves to a transformers backbone for annotation as planned, will the tokenizer be replaced by a wordpiece tokenizer or will predictions be re-mapped to the word-like tokens recognized by the current tokenizer?

@MihaiSurdeanu
Copy link
Contributor

  • We will preserve our word-level tokenizer, and then use subword tokenization just within words. So, from the outside, the API stays the same.
  • I have mixed feelings about preserving weird symbols. Tokenization and simpler tasks may work, but this will mess up parsing for sure...

@myedibleenso
Copy link
Member Author

I have mixed feelings about preserving weird symbols. Tokenization and simpler tasks may work, but this will mess up parsing for sure...

Agreed. It needs to be done in a way that doesn't affect annotation (see #681)

@myedibleenso
Copy link
Member Author

Although if processors moves to using transformers, re-introducing tokens after annotation might be confusing when inspecting things like attention weights.

@kwalcock
Copy link
Member

See #716 and #290. The UNK value is presently an empty string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants