Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Support unicode strings, not just byte sequences #492

Open
NickCrews opened this issue Jul 15, 2022 · 0 comments
Open

FEAT: Support unicode strings, not just byte sequences #492

NickCrews opened this issue Jul 15, 2022 · 0 comments

Comments

@NickCrews
Copy link

As far as I understand it, currently all data is represented as byte arrays.

I understand this is a very large feature request, but would this project consider supporting unicode strings in addition/instead? I really like some of the implementations in here, but I have a need to apply them to unicode strings. So for instance, a 4-byte glyph such as an emojii currently gets treated as 4 different tokens, but I would like it to be treated as one token.

I think this change could be a bit disruptive because it would introduce inconsistencies. For instance if you supplied the 4 bytes of that emojii as a string type, it would get interpreted as one token, but if you supplied those same 4 bytes as a bytestring type, it would get treated as 4 different tokens. Not necessarily breaking, but ugly.

This is totally unrelated to bioinformatics, so I understand if this is out of scope for you. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant