duplicate tokens in tokenizers #78

mmoskal · 2024-03-18T20:40:19Z

For example, the llama tokenizer has "<0x20>" as 35 and "▁" (space) as 29871, as well as "<0x21>" as 36 and "!" as 29991, etc.

We need to:

pick the canonical form (29871 probably)
have a mapping on the side that if 29871 is allowed also allows 35 in TokenSet (apply it after "compute_bias()" etc).

The text was updated successfully, but these errors were encountered:

mmoskal · 2024-03-18T21:25:31Z

mostly done, need to call apply_duplicates() in more places in particular somewhere around return_logit_bias() and possibly after any user-level update to token set

mmoskal self-assigned this Mar 18, 2024

mmoskal added a commit that referenced this issue Mar 18, 2024

account for duplicate tokens; see #78

1791cde

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duplicate tokens in tokenizers #78

duplicate tokens in tokenizers #78

mmoskal commented Mar 18, 2024

mmoskal commented Mar 18, 2024

duplicate tokens in tokenizers #78

duplicate tokens in tokenizers #78

Comments

mmoskal commented Mar 18, 2024

mmoskal commented Mar 18, 2024