Skip to content

Commit

Permalink
string.c (mrb_utf8_strlen): handle invalid UTF-8 sequence; fix #6255
Browse files Browse the repository at this point in the history
Previous SWAR version assumes valid UTF-8 to count number of code points
in the string, but we need to handle invalid sequence as well. We now
use `search_nonascii` to skip counting single byte characters for
performance. The new version is even faster than SWAR version (probably
because `search_nonascii` uses SSE2 on Intel compatible CPU (which I use).
  • Loading branch information
matz committed Apr 29, 2024
1 parent c9ae8df commit 714ef4c
Showing 1 changed file with 10 additions and 17 deletions.
27 changes: 10 additions & 17 deletions src/string.c
Expand Up @@ -417,26 +417,19 @@ static inline uint32_t popcount(bitint x)
mrb_int
mrb_utf8_strlen(const char *str, mrb_int byte_len)
{
mrb_int len = 0;

const char *p = str;
const char *be = p + sizeof(bitint) * (byte_len / sizeof(bitint));
for (; p < be; p+=sizeof(bitint)) {
bitint t0;
const char *e = str + byte_len;
mrb_int len = 0;

memcpy(&t0, p, sizeof(bitint));
const bitint t1 = t0 & (MASK1*0xc0);
const bitint t2 = t1 + (MASK1*0x40);
const bitint t3 = t1 & t2;
len += popcount(t3);
}
len = sizeof(bitint) * (byte_len / sizeof(bitint)) - len;
while (p < e) {
const char *np = search_nonascii(p, e);

if (byte_len % sizeof(bitint)) {
const char *e = str + byte_len;
while (p < e) {
if (utf8_islead(*p)) len++;
p++;
len += np - p;
if (np == e) break;
p = np;
while (NOASCII(*p)) {
p += mrb_utf8len(p, e);
len++;
}
}
return len;
Expand Down

0 comments on commit 714ef4c

Please sign in to comment.