Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow whitespace-only pieces #984

Open
bauwenst opened this issue Feb 26, 2024 · 0 comments
Open

Allow whitespace-only pieces #984

bauwenst opened this issue Feb 26, 2024 · 0 comments

Comments

@bauwenst
Copy link

bauwenst commented Feb 26, 2024

From what I understand, the allow_whitespace_only_pieces training argument, implemented in the word-level pretokeniser at this line, allows multiple spaces to appear next to each other in the strings that result from the pretokeniser (let's call them "pre-tokens"). Because the trainer gets its substrings from inside pre-tokens, having multiple spaces in one pre-token allows it to learn tokens consisting of more than one space.

I have two questions:

  1. Is this not a confusing way to name this option? When allow_whitespace_only_pieces is false, it produces pre-tokens that consist of whitespace only, which is completely counterintuitive. (It also means that there will be at least one token allowed that is whitespace-only.)
  2. For my application, what I need is what you would actually expect the option "allow whitespace-only pieces" to do, which is to produce pre-tokens with only whitespace and never mix whitespace with non-whitespace in tokens. Is this straight-forward to do by setting training options, or does it need extra implementation?

To illustrate all of this with an example: the sentence This is a test sentence. is split as follows in the three cases outlined above:

  • allow_whitespace_only_pieces = false: This ▁is ▁a ▁ ▁ ▁ ▁test ▁sentence. (seemingly allows pieces that are whitespace-only)
  • allow_whitespace_only_pieces = true: This ▁is ▁a ▁▁▁▁test ▁sentence.
  • What I need: This ▁ is ▁ a ▁▁▁▁ test ▁ sentence.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant