token count is inconsistent with OpenAI tokenizer #17

GorvGoyl · 2023-11-21T03:44:21Z

As shown below:

text:

<|im_start|>dd<|im_sep|>OpenAI's large language models (sometimes referred to as GPT's) process text using tokens, which are common sequences of characters found in a set of text. The models learn to understand the statistical relationships between these tokens, and excel at producing the next token in a sequence of tokens.<|im_end|><|im_start|>assistant<|im_sep|><|im_end|><|im_start|>assistant<|im_sep|>

The text was updated successfully, but these errors were encountered:

syntaxtrash · 2023-12-11T07:44:05Z

any update to this? They work fine without the special characters.

https://platform.openai.com/tokenizer

https://tiktokenizer.vercel.app/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

token count is inconsistent with OpenAI tokenizer #17

token count is inconsistent with OpenAI tokenizer #17

GorvGoyl commented Nov 21, 2023

syntaxtrash commented Dec 11, 2023

token count is inconsistent with OpenAI tokenizer #17

token count is inconsistent with OpenAI tokenizer #17

Comments

GorvGoyl commented Nov 21, 2023

syntaxtrash commented Dec 11, 2023