Skip to content

alasdairforsythe/capcode

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

capcode

Lossless normalization of text for the purpose of tokenization.

Fully UTF-8 Compliant

  • Supports Unicode 13.0.0 (newer version supported depending on the implementation)
  • Works correctly with any UTF-8 encoding scheme (NFC, NFD, etc.)

Features

  • No information is lost
  • The encoded text can be decoded exactly back to the original
  • Safe: an LLM trained on this will still understand about uppercasing

About

Lossless normalization of uppercase characters

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published