-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Any plan to add string iterator? #259
Comments
Can you elaborate? If you mean iterating over the code point values, we do fast transcoding to UTF-32. After UTF-32 trancoding, iteration is trivial. For large inputs, it might make sense to do block-wise UTF-32 decoding to avoid overwhelming the cache. Byte-by-byte processing is inefficient and not something we want to encourage. |
Yes, I think what I want is to iterate over code point values or something like that. char buf[] = "Foo © bar 𝌆 baz ☃ qux";
...
for (char8_t x: itor) {
std::cout << x; // and we can get each UTF8 character like "©", "𝌆", "☃"
} It is easier to implement a simple iterator. However, trying to take advantage of features such as SIMD may be difficult, for byte-by-byte iteration. |
Interesting. |
I think a character-wise iterator can anyway use SIMD backend, just silently transcode bigger chunks of the input. The question is, if in such cases performance is the key. I mean, if you want to analyse the input char-by-char means you need some extra processing of char code itself. Anyway, we need to define that it yields It would be good to take a look how the Go runtime handles iteration over UTF-8. |
It would be great if there was some way to iterate over something like each UTF8 character.
Are there plans to add this in the future?
The text was updated successfully, but these errors were encountered: