FEAT: Support unicode strings, not just byte sequences #492

NickCrews · 2022-07-15T21:03:11Z

As far as I understand it, currently all data is represented as byte arrays.

I understand this is a very large feature request, but would this project consider supporting unicode strings in addition/instead? I really like some of the implementations in here, but I have a need to apply them to unicode strings. So for instance, a 4-byte glyph such as an emojii currently gets treated as 4 different tokens, but I would like it to be treated as one token.

I think this change could be a bit disruptive because it would introduce inconsistencies. For instance if you supplied the 4 bytes of that emojii as a string type, it would get interpreted as one token, but if you supplied those same 4 bytes as a bytestring type, it would get treated as 4 different tokens. Not necessarily breaking, but ugly.

This is totally unrelated to bioinformatics, so I understand if this is out of scope for you. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT: Support unicode strings, not just byte sequences #492

FEAT: Support unicode strings, not just byte sequences #492

NickCrews commented Jul 15, 2022

FEAT: Support unicode strings, not just byte sequences #492

FEAT: Support unicode strings, not just byte sequences #492

Comments

NickCrews commented Jul 15, 2022