LLM Tokenizers in Multiple Languages

title	emoji	colorFrom	colorTo	sdk	sdk_version	app_file	pinned	license
Tokenizers Languages	🐠	pink	green	streamlit	1.19.0	app.py	false	cc

LLM Tokenizers in Multiple Languages

This is the repo for the HuggingFace Space corresponding with the article, All languages are NOT created (tokenized) equal.

The Space explores token length for various LLM tokenizers on many different languages.

Introduction to the project

Large language models such as ChatGPT process and generate text sequences by first splitting the text into smaller units called tokens. This process of tokenization is not uniform across languages, leading to disparities in the number of tokens produced for equivalent expressions in different languages. For example, a sentence in Burmese or Amharic may require 10x more tokens than a similar message in English.

Dataset

MASSIVE is a parallel dataset introduced by Amazon consisting of 1 million realistic, parallel short texts translated across 52 languages and 18 domains. I used the dev split of the dataset, which consists of 2033 texts translated into each of the languages. The dataset is available on HuggingFace and is licensed under the CC BY 4.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
img		img
.gitattributes		.gitattributes
LICENSE.txt		LICENSE.txt
MassiveDatasetValidationData.csv		MassiveDatasetValidationData.csv
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

img

img

.gitattributes

.gitattributes

LICENSE.txt

LICENSE.txt

MassiveDatasetValidationData.csv

MassiveDatasetValidationData.csv

README.md

README.md

app.py

app.py

requirements.txt

requirements.txt

Repository files navigation

LLM Tokenizers in Multiple Languages

Introduction to the project

Dataset

About

Releases

Packages

Languages

License

yenniejun/tokenizers-languages

Folders and files

Latest commit

History

Repository files navigation

LLM Tokenizers in Multiple Languages

Introduction to the project

Dataset

About

Topics

Resources

License

Stars

Watchers

Forks

Languages