Skip to content

yenniejun/tokenizers-languages

Repository files navigation

title emoji colorFrom colorTo sdk sdk_version app_file pinned license
Tokenizers Languages
🐠
pink
green
streamlit
1.19.0
app.py
false
cc

LLM Tokenizers in Multiple Languages

This is the repo for the HuggingFace Space corresponding with the article, All languages are NOT created (tokenized) equal.

Screenshot of the corresponding HuggingFace Space

The Space explores token length for various LLM tokenizers on many different languages.

Introduction to the project

Large language models such as ChatGPT process and generate text sequences by first splitting the text into smaller units called tokens. This process of tokenization is not uniform across languages, leading to disparities in the number of tokens produced for equivalent expressions in different languages. For example, a sentence in Burmese or Amharic may require 10x more tokens than a similar message in English.

Dataset

MASSIVE is a parallel dataset introduced by Amazon consisting of 1 million realistic, parallel short texts translated across 52 languages and 18 domains. I used the dev split of the dataset, which consists of 2033 texts translated into each of the languages. The dataset is available on HuggingFace and is licensed under the CC BY 4.0 license.

Word cloud of the word "hey" translated into 51 languages, from the Massive dataset

Releases

No releases published

Packages

No packages published

Languages