You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Found this code referenced inside imgui, so far as I can tell I'm not sure why the lengths array needs to contain 32 results.
The reason being is that the bits that control the length of a utf8 sequence are the leading 1s up at the front of a byte with a terminating 0 (assuming there was software out there dealing with utf8 in a bitstream).
111110 -> presumably a 5 byte sequence
1111110 -> presumably a 6 byte sequence
11111110 -> presumably a 7 byte sequence
11111111 -> presumably an 8 byte sequence
However, currently utf8 only deals with at worst 4 byte code points so while the pattern could continue at the moment there aren't 5 byte sequences. Which means you could just drop the terminating 0 for the 4 byte sequence and work from there.
Now I'm not sure about the rest of the code dealing with errors and masks and shifting around...but presumably...an equivalent function exists, but with a smaller table.
staticconstcharlengths[] = {
1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 2, 2, 3, 4
};
staticconstintmasks[] = {0x00, 0x7f, 0x1f, 0x0f, 0x07};
staticconstuint32_tmins[] = {4194304, 0, 128, 2048, 65536};
staticconstintshiftc[] = {0, 18, 12, 6, 0};
staticconstintshifte[] = {0, 6, 4, 2, 0};
unsigned char*s=buf;
intlen=lengths[s[0] >> 4]; // here we just grab the upper nibble (the 4 bits which determine the length)// I kept the 0s which appear to contribute to determining the error.// in theory the code past this point works in a similar fashion except for the error handling of the erroneous sequence of 11111
The extra steps are for error checking, so that invalidity is preserved
all the way into the error accumulator. The tests are thorough and trivial
to run: Either "make check" or just compile and run "test/tests.c" with
your favorite C compiler. At minimum any changes must pass the tests,
otherwise they've omitted a critical check. So I encourage you try out
your changes in the tests!
The 5-bit, 32-element length table is because the 4-byte sequence prefix
is actually 5 bits: 11110. If that last bit is not zero, it's invalid,
which is captured by being "zero length." If only 4 bits are examined,
that won't be checked.
Without masking, a bit that's supposed to be zero may shift to a position
that turns an invalid input (e.g. an overlong encoding) into something
that looks valid. Though perhaps some masking redundant, as it would be
caught by other checks anyway.
Your suggestion for replacing shiftc with some arithmetic appears to
valid. It just trips the error checks differently. Though it introduces a
multiplication, and the current implementation has none. Perhaps that's
worth doing to remove shiftc.
Your suggestion for replacing shifte works only if it's also masked, as
otherwise it shifts out error checks:
*e >>= (8 - 2*len) & 7;
Perhaps that's also worth some arithmetic to drop accessing shifte. (Of
course, that 2*len doesn't count as multiplication.)
Found this code referenced inside imgui, so far as I can tell I'm not sure why the
lengths
array needs to contain 32 results.The reason being is that the bits that control the length of a utf8 sequence are the leading 1s up at the front of a byte with a terminating 0 (assuming there was software out there dealing with utf8 in a bitstream).
0 -> ascii -> 1 byte
10 -> continuation -> 1 byte
110 -> 2 byte sequence
1110 -> 3 byte sequence
11110 -> 4 byte sequence
111110 -> presumably a 5 byte sequence
1111110 -> presumably a 6 byte sequence
11111110 -> presumably a 7 byte sequence
11111111 -> presumably an 8 byte sequence
However, currently utf8 only deals with at worst 4 byte code points so while the pattern could continue at the moment there aren't 5 byte sequences. Which means you could just drop the terminating 0 for the 4 byte sequence and work from there.
Now I'm not sure about the rest of the code dealing with errors and masks and shifting around...but presumably...an equivalent function exists, but with a smaller table.
For the remaining decoding section here:
I haven't exactly tested this...but it strikes me that the masking operation doesn't need a table.
Similar concepts could apply for the shiftc and shifte tables as these are multiples of 6 and 2.
I'm guessing you've probably written code like this already, but I was curious.
The text was updated successfully, but these errors were encountered: