You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As far as I understand it, currently all data is represented as byte arrays.
I understand this is a very large feature request, but would this project consider supporting unicode strings in addition/instead? I really like some of the implementations in here, but I have a need to apply them to unicode strings. So for instance, a 4-byte glyph such as an emojii currently gets treated as 4 different tokens, but I would like it to be treated as one token.
I think this change could be a bit disruptive because it would introduce inconsistencies. For instance if you supplied the 4 bytes of that emojii as a string type, it would get interpreted as one token, but if you supplied those same 4 bytes as a bytestring type, it would get treated as 4 different tokens. Not necessarily breaking, but ugly.
This is totally unrelated to bioinformatics, so I understand if this is out of scope for you. Thank you!
The text was updated successfully, but these errors were encountered:
As far as I understand it, currently all data is represented as byte arrays.
I understand this is a very large feature request, but would this project consider supporting unicode strings in addition/instead? I really like some of the implementations in here, but I have a need to apply them to unicode strings. So for instance, a 4-byte glyph such as an emojii currently gets treated as 4 different tokens, but I would like it to be treated as one token.
I think this change could be a bit disruptive because it would introduce inconsistencies. For instance if you supplied the 4 bytes of that emojii as a string type, it would get interpreted as one token, but if you supplied those same 4 bytes as a bytestring type, it would get treated as 4 different tokens. Not necessarily breaking, but ugly.
This is totally unrelated to bioinformatics, so I understand if this is out of scope for you. Thank you!
The text was updated successfully, but these errors were encountered: