Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smaller lengths array? #10

Open
Andersama opened this issue Sep 21, 2023 · 1 comment
Open

Smaller lengths array? #10

Andersama opened this issue Sep 21, 2023 · 1 comment

Comments

@Andersama
Copy link

Found this code referenced inside imgui, so far as I can tell I'm not sure why the lengths array needs to contain 32 results.

The reason being is that the bits that control the length of a utf8 sequence are the leading 1s up at the front of a byte with a terminating 0 (assuming there was software out there dealing with utf8 in a bitstream).

0 -> ascii -> 1 byte
10 -> continuation -> 1 byte
110 -> 2 byte sequence
1110 -> 3 byte sequence
11110 -> 4 byte sequence

111110 -> presumably a 5 byte sequence
1111110 -> presumably a 6 byte sequence
11111110 -> presumably a 7 byte sequence
11111111 -> presumably an 8 byte sequence

However, currently utf8 only deals with at worst 4 byte code points so while the pattern could continue at the moment there aren't 5 byte sequences. Which means you could just drop the terminating 0 for the 4 byte sequence and work from there.

Now I'm not sure about the rest of the code dealing with errors and masks and shifting around...but presumably...an equivalent function exists, but with a smaller table.

    static const char lengths[] = {
1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 2, 2, 3, 4
    };
    
    static const int masks[]  = {0x00, 0x7f, 0x1f, 0x0f, 0x07};
    static const uint32_t mins[] = {4194304, 0, 128, 2048, 65536};
    static const int shiftc[] = {0, 18, 12, 6, 0};
    static const int shifte[] = {0, 6, 4, 2, 0};

    unsigned char *s = buf;
    int len = lengths[s[0] >> 4]; // here we just grab the upper nibble (the 4 bits which determine the length)
    // I kept the 0s which appear to contribute to determining the error.
    // in theory the code past this point works in a similar fashion except for the error handling of the erroneous sequence of 11111

For the remaining decoding section here:

    *c  = (uint32_t)(s[0] & masks[len]) << 18;
    *c |= (uint32_t)(s[1] & 0x3f) << 12;
    *c |= (uint32_t)(s[2] & 0x3f) <<  6;
    *c |= (uint32_t)(s[3] & 0x3f) <<  0;
    *c >>= shiftc[len];

I haven't exactly tested this...but it strikes me that the masking operation doesn't need a table.

    *c  = (uint32_t)((s[0] << len) >> len) << 18;
    // alternatively
    *c  = (uint32_t)(s[0] & (0xff >> len)) << 18;

Similar concepts could apply for the shiftc and shifte tables as these are multiples of 6 and 2.

   //*c >>= shiftc[len];
   *c >>= 24 - 6*len;
   //*e >>= shifte[len];
   *e >>= 8 - 2*len;

I'm guessing you've probably written code like this already, but I was curious.

@skeeto
Copy link
Owner

skeeto commented Sep 23, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants