Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoding from uint16_t[] #15

Open
mscdex opened this issue Oct 8, 2016 · 16 comments
Open

Decoding from uint16_t[] #15

mscdex opened this issue Oct 8, 2016 · 16 comments

Comments

@mscdex
Copy link
Contributor

mscdex commented Oct 8, 2016

First off, thank you for providing this open source library (and under the BSD license).

My question is: would it be possible to see versions of the decoding functions that support uint16_t[] strings in addition to const char[] strings? I tried looking at just the ssse3 implementation (since that is what is used on my dev machine) to see how that might be implemented, but the numbers in the comments have me confused.

My particular use case is that I get a uint16_t[] from the V8 JS engine, which is how it internally stores its strings. Being able to use that value type directly would be more efficient than having to make a copy to an 8-bit array first.

I'd like to get uint16_t[] supported for each codec, since we run on a variety of platforms. Any help you can give would be much appreciated.

@aklomp
Copy link
Owner

aklomp commented Oct 9, 2016

I'm a bit unclear on what exactly you want to do, and with which data type. Do you want to base64-decode an uint16_t array which you get from V8? In that case, couldn't you just typecast the array as a series of 8-bit bytes to circumvent the type system? Something like this:

// `src` is an array of uint16_t.
// `srclen` is the length of `src` counted in 16-bit characters.
int ret = base64_decode((char *)src, srclen * 2, out, &outlen, 0);

We cast src to type char *, and double the length count, because it's assumed to be counting off 16-bit characters, which contain two octets each.

I have a feeling that's not what you mean though. Is it something more like this? V8 is giving you a base64-encoded string where each 16-bit character only has the lower byte set. Your plain ASCII strings (including your encoded base64 strings) look like this:

0x0048 | 0x0045 | 0x004C | 0x004C | 0x004F
    H        E        L        L        O

So in order to do the base64 decoding on plain, unpadded 8-bit input, you would need to preprocess the input to squeeze out every second byte. Reversely, to encode a plain 8-bit bytestream, you would have to "upsample" the output from 8-bit to this 16-bit representation by adding padding bytes. Is that what you mean?

@BurningEnlightenment
Copy link
Contributor

@aklomp FYI V8 uses UTF-16 for representing strings and abuses uint16_t as raw UTF-16 code unit type. So while the high byte is always zero for valid base64 input, it isn't sufficient to "squeeze out every second byte", because invalid input (i.e. non base64 characters) might be silently converted to valid input which might be dangerous in certain contexts.

@mscdex
Copy link
Contributor Author

mscdex commented Oct 9, 2016

Yes, what @BurningEnlightenment said. The example of simply casting to a char array won't work because of the leading zero bytes causing early termination.

Now that I've thought about it some more, it makes more sense to keep the length the same and just add a check on the upper byte and break early if it's non-zero.

@aklomp
Copy link
Owner

aklomp commented Oct 9, 2016

So the proposed solution would be to duplicate the API and the underlying logic to add first-class support for these wide chars (as I'll call them).

I'm reluctant to do this, because of separation of concerns. In my view, this library does one simple thing, but does it in the best possible way, and that is encoding and decoding of flat base64 strings. I see any enlargement of that scope as being principally undesirable.

So I'm reserved about making the big changes that would be necessary to support wide chars in the streaming API. I'd need to duplicate the API, add flags into all the codecs, and probably duplicate code. I'd really rather not, though it might be interesting to set up a development branch to test how feasible all this is.

The good news is that for the case where you don't need streaming conversions (i.e. you have a fixed buffer of known length), you can get along with implementing simple input and output wrappers around the non-streaming API functions (base64_encode() and base64_decode()). The input/output filters are a good example of separating the concerns into a module of their own.

The input filter for decoding would (1) check if any of the high bytes are set, and (2) squeeze out every second byte. It would then feed the modified buffer to base64_encode() (the non-streaming function). Both (1) and (2) can be done super efficiently with SIMD instructions. ((1) could be done by or'ing all input bytes together, then checking if any of the high bits are set. If so, return error. (2) could be done with byte shuffling.)

The output filter for encoding would simply pad the output with zero bytes. This too can be done very efficiently with SIMD instructions. Of course, SIMD instructions work on many bytes at the same time, so you'd also need to handle individual bytes bytes at the tail, and so on. But on the whole it's not infeasible.

Also, if you're not interested in duplication of the whole API and what not, and if you're fine with using a development branch, you could "hack" the streaming codecs to do the byte shuffling on the fly. Was that the kind of adaptation you were alluding to when you said that you looked at the SSSE3 implementation but were confused by the numbers? I could perhaps help you there.

@mscdex
Copy link
Contributor Author

mscdex commented Oct 9, 2016

Oops, guess I shouldn't think before coffee.... the non-plain codec changes won't be that trivial. I really only need 16-bit decoding, so that makes it a little easier.

I'm still a little stumped as to what the non-plain codec changes would need to be (as far as adjusting the calculations and such), as I'm not very familiar with the intrinsics. After some brief research it does look like there are 16-bit instrinsics that could be used, but there are other (less obvious, to me) changes that I suspect would also need to be made, such as a 16-bit-compatible dec_reshuffle().

EDIT: I don't mind duplicating the decode functions locally, but figuring out the necessary (efficient) instrinsics changes is what I guess I'd be struggling with.

EDIT 2: I should also note, I'm currently not using the streaming functionality, just base64_decode() for decoding.

@aklomp
Copy link
Owner

aklomp commented Oct 9, 2016

If you're not using the streaming functions, why not use a two-stage pipeline and preprocess the input string using SSSE3 functions before sending it to the decoder? That should be roughly as fast as doing the conversion on the fly (we're doing the same amount of work), but wouldn't require any changes to the base64 library. You'd never have to touch the decoder code. You can do the conversion in-place, because it's guaranteed that the output will always be shorter than the input.

As for the preprocessor function, it would contain a loop that branches on a condition: if the remaining input size is >= 16 bytes, then use the SSSE3 code, else use the fallback code (which works byte-by-byte). The fallback code is pretty trivial. The SSSE3 code can look something like this (off-the-cuff pseudocode and not tested):

// `in` is an __mm128i containing 8 16-bit characters.
// `o` is a uint8_t* pointing to the output buffer.

// Compare with zero.
// `iszero` will contain 0xFF where input byte was zero, 0x00 where it was not.
__mm128i iszero = _mm_cmpeq_epi8(in, _mm_zero_si128());

// Extract a bitmask from the iszero mask.
// Bits will be 1 if the corresponding byte in `in` was equal to zero:
int mask = _mm_movemask_epi8(iszero);

// Check if any of the high bytes were nonzero, return error if so:
// FIXME: this is GCC-only binary literal syntax, just to give the idea):
if ((mask & 0b0101010101010101) != 0b0101010101010101)
    return -1;

// Squeeze out every high byte, so that only the low bytes remain:
// FIXME: not sure of the byte mask pattern.
__mm128i shuf = _mm_shuffle_epi8(in, _mm_set_epi8(
    -1, -1, -1, -1,
    -1, -1, -1, -1, 
    0, 2, 4, 6,
    8, 10, 12, 14));

// Write back 16 bytes, of which 8 bytes are output:
_mm_storeu_si128((__m128i *)o, shuf);

// Move output pointer 8 bytes:
o += 8;

@mscdex
Copy link
Contributor Author

mscdex commented Oct 15, 2016

Thanks, that got me started in the right direction. FWIW the byte mask pattern ended up needing to be (at least for little endian?):

-1, -1, -1, -1,
-1, -1, -1, -1,
14, 12, 10, 8,
6, 4, 2, 0

The high byte check isn't working, but I haven't really looked at that yet.

One other problem I have when decoding is that I need to ignore invalid characters instead of returning early on such characters. Currently I'm doing something like this inside dec/ssse3.c to ignore invalid characters:

int invalid = _mm_movemask_epi8(CMPEQ(delta, 0));
if (invalid) {
  // Pack valid bytes together
  uint8_t pos = 0;
  #define CHECK_BYTE(n) \
  if (!(invalid & 1<<n)) \
    c[pos++] = c[n];
  CHECK_BYTE(0);
  CHECK_BYTE(1);
  CHECK_BYTE(2);
  CHECK_BYTE(3);
  CHECK_BYTE(4);
  CHECK_BYTE(5);
  CHECK_BYTE(6);
  CHECK_BYTE(7);
  CHECK_BYTE(8);
  CHECK_BYTE(9);
  CHECK_BYTE(10);
  CHECK_BYTE(11);
  CHECK_BYTE(12);
  CHECK_BYTE(13);
  CHECK_BYTE(14);
  CHECK_BYTE(15);

  // Shift remaining bytes to fill in gap created by invalid bytes
  memmove(c + pos, c + 16, srclen - (16 - pos));

  // Retry current position with the newly shifted in data, with the source
  // length adjusted appropriately
  srclen -= (16 - pos);
  continue;
}

This seems to work, but do you know if there is a more efficient way to do this? The only other way I could think of would be to load position changes into a new __m128i, shuffling the bytes, storing the result back into the source, and then memmove() and continue. However, that seemed like that would be slower/take more instructions than just manually copying the bytes where they need to be?

@aklomp
Copy link
Owner

aklomp commented Oct 16, 2016

The reason to ignore invalid bytes on decode is to skip newlines and carriage returns, right? I've been thinking about adding an option to the library for that actually. As you've found, it's not trivial in pure SSE to remove those bytes from the input stream and shift over the rest. But there must be a way, there's usually a crafty left-field approach if you think about it hard enough. I'll see if I can come up with something.

In the meantime, you could perhaps get away with a slight alteration in the current library. If the SSE code now encounters an invalid input char, it falls back to the bytewise, per-character handling in lib/dec/tail.c. This rechecks the input character. It does a lookup of that character in base64_table_dec[], and if the result is 255, it returns an error. You could patch the code to not return an error but silently skip the character.

This starts to get slow if you have a lot of invalid characters, since you're doing bytewise decoding on each one of them, but if there's occasionally a character here or there, it shouldn't matter too much.

@mscdex
Copy link
Contributor Author

mscdex commented Oct 16, 2016

The main purpose is to skip whitespace, but all other invalid characters should be ignored as well. The reason being I need it for backwards compatibility with an existing base64 decoder.

@aklomp
Copy link
Owner

aklomp commented Oct 16, 2016

Here's the scheme I've come up with so far. It's probably not optimal, and I didn't benchmark or even compile it.

Say we have these nine valid bytes of input. We pick all nine of them for the output:

in:       A  B  C  D  E  F  G  H  I
shuffle:  0  1  2  3  4  5  6  7  8
-------------------------------------
out:      A  B  C  D  E  F  G  H  I

Now we have the same thing but with two invalid characters. Each time we encounter an invalid character, we need to increment the shuffle mask index:

in:       A  B  C  _  D  E  _  F  G
shuffle:  0  1  2  4  5  7  8 -1 -1
-------------------------------------
out:      A  B  C  D  E  F  G  ?  ?

Also, in this example the total valid length becomes two characters shorter. We must account for that when we calculate outlen and increment the output pointer.

This is probably not the fastest possible method, but we can construct the shuffle mask on-the-fly from the bitmask. In code, it might look something like this (untested):

int i = 0;

// Invalid character bitmask:
uint16_t mask = _mm_movemask_epi8(CMPEQ(delta, 0));

// Create temporary array to construct shuffle mask:
uint8_t shm[16] __attribute__((align (16)));

// Create shuffle mask:
for (int j = 0; j < 16; j++) 
        if (!(mask & (1U << j)))
                shm[i++] = j;

// This is not strictly necessary:
while (i < 16)
        shm[i++] = -1;

// Shuffle input string:
str = _mm_shuffle_epi8(str, _mm_load_si128((__m128i *)shm));

// Number of error characters encountered
// (the string length should be decremented by this amount):
int errchars = __builtin_popcount(mask);

I didn't test this code, so maybe the endianness is wrong or something. Just to give an idea.

EDIT: the first version of the loop was incorrect.

@mayeut
Copy link
Contributor

mayeut commented Oct 20, 2016

Regarding the uint16_t to char conversion as a preprocessing step, why not just use _mm_packus_epi16. There's no need to check for invalid characters with that pre-processing stage, the check will be done as usual. "Negative" (if highest bit is set) values will be converted to 0 which is invalid and values greater than 255 will be saturated to 255 which also is invalid. This should be much faster than the SSSE3 method proposed in earlier comment.
The first stage shall also be limited to only consume data that fits L1 cache before passing it to the second stage. Loop while necessary.

mayeut added a commit to mayeut/base64 that referenced this issue Oct 22, 2016
@mayeut
Copy link
Contributor

mayeut commented Oct 22, 2016

Without taking into account the whitespace skipping (which would also apply to the char version), I made a small POC of what could be done to support uint16_t[] decoding.

I tested 2 things:

  1. Using 2 stages in a loop as mentioned before (to keep data in L1 cache)
  2. Add decode16 function for all arch (except neon32/neon64 at the moment)

What I like about the first one is that it requires less work and would allow for other preprocessing stages to be implemented the same way. But... it's slower than the second one.
What I like about the second one is that it's faster. But... Requires to add an extra codec function for each arch, this approach might not work on some architectures (No uint32_t/uint64_t generic here, I will need to test NEON but even with that, we don't know what the future will bring us).

The second implementation didn't required that much code duplication after all. Overall, it's 15% faster.
AVX2: +9%
AVX: +19%
plain: -7% (no uint64_t decode16)
SSSE3: +17%
SSE41: +17%
SSE42: +18%

Here are the results from benchmark (keeping only decoding, buffer size is output buffer size):

2 stages implementation:

Filling buffer with 10.0 MB of random data...
Testing with buffer size 10 MB, fastest of 100 * 1
AVX2   decode    7071.56 MB/sec
AVX2   decode16  3873.48 MB/sec
plain  decode    1557.58 MB/sec
plain  decode16   695.27 MB/sec
SSSE3  decode    3875.35 MB/sec
SSSE3  decode16  2712.32 MB/sec
SSE41  decode    3876.45 MB/sec
SSE41  decode16  2706.70 MB/sec
SSE42  decode    5097.40 MB/sec
SSE42  decode16  3228.69 MB/sec
AVX    decode    5284.70 MB/sec
AVX    decode16  3300.12 MB/sec
Testing with buffer size 1 MB, fastest of 100 * 10
AVX2   decode    7054.40 MB/sec
AVX2   decode16  4806.09 MB/sec
plain  decode    1562.27 MB/sec
plain  decode16   699.32 MB/sec
SSSE3  decode    3887.29 MB/sec
SSSE3  decode16  3161.14 MB/sec
SSE41  decode    3887.03 MB/sec
SSE41  decode16  3115.06 MB/sec
SSE42  decode    5108.83 MB/sec
SSE42  decode16  3916.78 MB/sec
AVX    decode    5297.71 MB/sec
AVX    decode16  4019.34 MB/sec
Testing with buffer size 100 KB, fastest of 100 * 100
AVX2   decode    7034.43 MB/sec
AVX2   decode16  5078.66 MB/sec
plain  decode    1555.46 MB/sec
plain  decode16   695.94 MB/sec
SSSE3  decode    3881.58 MB/sec
SSSE3  decode16  3242.03 MB/sec
SSE41  decode    3694.30 MB/sec
SSE41  decode16  3227.33 MB/sec
SSE42  decode    5103.54 MB/sec
SSE42  decode16  4040.98 MB/sec
AVX    decode    5290.70 MB/sec
AVX    decode16  4137.45 MB/sec
Testing with buffer size 10 KB, fastest of 1000 * 100
AVX2   decode    6919.64 MB/sec
AVX2   decode16  5463.62 MB/sec
plain  decode    1489.76 MB/sec
plain  decode16   708.64 MB/sec
SSSE3  decode    3902.22 MB/sec
SSSE3  decode16  3418.27 MB/sec
SSE41  decode    3892.04 MB/sec
SSE41  decode16  3413.28 MB/sec
SSE42  decode    4933.65 MB/sec
SSE42  decode16  4176.20 MB/sec
AVX    decode    5262.62 MB/sec
AVX    decode16  4389.26 MB/sec
Testing with buffer size 1 KB, fastest of 1000 * 1000
AVX2   decode    5658.08 MB/sec
AVX2   decode16  4749.03 MB/sec
plain  decode    1469.45 MB/sec
plain  decode16   693.57 MB/sec
SSSE3  decode    3593.95 MB/sec
SSSE3  decode16  3201.35 MB/sec
SSE41  decode    3575.33 MB/sec
SSE41  decode16  3176.95 MB/sec
SSE42  decode    4510.35 MB/sec
SSE42  decode16  3890.36 MB/sec
AVX    decode    4657.02 MB/sec
AVX    decode16  4075.72 MB/sec

Direct decoding:

Filling buffer with 10.0 MB of random data...
Testing with buffer size 10 MB, fastest of 100 * 1
AVX2   decode    7054.01 MB/sec
AVX2   decode16  5387.99 MB/sec
plain  decode    1476.16 MB/sec
plain  decode16   649.31 MB/sec
SSSE3  decode    3873.82 MB/sec
SSSE3  decode16  3752.94 MB/sec
SSE41  decode    3888.66 MB/sec
SSE41  decode16  3753.32 MB/sec
SSE42  decode    5093.97 MB/sec
SSE42  decode16  4744.95 MB/sec
AVX    decode    5278.66 MB/sec
AVX    decode16  4872.50 MB/sec
Testing with buffer size 1 MB, fastest of 100 * 10
AVX2   decode    7094.56 MB/sec
AVX2   decode16  4468.44 MB/sec
plain  decode    1480.75 MB/sec
plain  decode16   651.41 MB/sec
SSSE3  decode    3674.16 MB/sec
SSSE3  decode16  3771.94 MB/sec
SSE41  decode    3886.23 MB/sec
SSE41  decode16  3767.31 MB/sec
SSE42  decode    4794.23 MB/sec
SSE42  decode16  4772.17 MB/sec
AVX    decode    5297.40 MB/sec
AVX    decode16  4908.21 MB/sec
Testing with buffer size 100 KB, fastest of 100 * 100
AVX2   decode    7028.21 MB/sec
AVX2   decode16  5949.40 MB/sec
plain  decode    1479.39 MB/sec
plain  decode16   651.18 MB/sec
SSSE3  decode    3885.90 MB/sec
SSSE3  decode16  3764.85 MB/sec
SSE41  decode    3886.18 MB/sec
SSE41  decode16  3764.73 MB/sec
SSE42  decode    5102.43 MB/sec
SSE42  decode16  4479.49 MB/sec
AVX    decode    5291.07 MB/sec
AVX    decode16  4892.08 MB/sec
Testing with buffer size 10 KB, fastest of 1000 * 100
AVX2   decode    6911.86 MB/sec
AVX2   decode16  5836.29 MB/sec
plain  decode    1476.84 MB/sec
plain  decode16   650.35 MB/sec
SSSE3  decode    3911.21 MB/sec
SSSE3  decode16  3749.75 MB/sec
SSE41  decode    3897.19 MB/sec
SSE41  decode16  3757.06 MB/sec
SSE42  decode    5071.97 MB/sec
SSE42  decode16  4717.31 MB/sec
AVX    decode    5275.38 MB/sec
AVX    decode16  4852.05 MB/sec
Testing with buffer size 1 KB, fastest of 1000 * 1000
AVX2   decode    5657.92 MB/sec
AVX2   decode16  4551.91 MB/sec
plain  decode    1438.35 MB/sec
plain  decode16   645.72 MB/sec
SSSE3  decode    3585.06 MB/sec
SSSE3  decode16  3345.42 MB/sec
SSE41  decode    3570.98 MB/sec
SSE41  decode16  3349.45 MB/sec
SSE42  decode    4459.54 MB/sec
SSE42  decode16  4073.10 MB/sec
AVX    decode    4641.85 MB/sec
AVX    decode16  4183.66 MB/sec

@mscdex
Copy link
Contributor Author

mscdex commented Oct 22, 2016

I had a gut feeling that having a separate decode16 would be faster, but I never got around to verifying that.

Some questions:

  1. @mayeut I've started looking at your current POC and while adding extra characters to the SSSE3 decoder implementation was straightforward (pretty much copy and paste), I'm a bit confused as to how I'd go about making such adjustments to the SSE4.2 decoder implementation for example. I specifically need to add - and _ as allowed characters. In the case of SSE4.2, is it just a matter of replacing the two 0 positions in lut and replacing two sets of the '+' entries in range?
  2. What do you think about my invalid character removal/ignore strategy? Does it seem optimal or do you know of a better way (perhaps even avoiding memmove()/memcpy()), using (additional) intrinsics or not?

@mayeut
Copy link
Contributor

mayeut commented Oct 23, 2016

Yes having a direct decode16 is faster but is the speed up worth adding it compared to the 2-stages approach ? Maybe not (and that's for @aklomp to decide given he'll be the one supporting this).

  1. I'm not sure I'm understanding what you're trying to achieve with those two additional characters. If you want them to be treated as valid then yes, you can add those entries to range. For the lut part, it's a bit more complicated but it should be possible given there's still room for 2 characters (yes the 0 positions in lut). You will still need to update the indices logic for those new values.
  2. I don't think it will be any faster than handling this in dec_tail.c. At least for sparse occurrences of invalid characters.

@mscdex
Copy link
Contributor Author

mscdex commented Oct 24, 2016

@mayeut Thanks, I ended up using blend to (first) replace instances of _ with /, since _ has a much higher value, so it doesn't really work with the initial indices computation.

I also noticed codec->dec16 = .... was missing for all of the non-neon codecs in your 16-bit decoder branch. After adding those and re-activating the codec->dec16 conditional in the main base64 decoding function, I saw a similar further performance boost as you previously showed.

It would be nice to see the built-in 16-bit decoding merged in the main repo here, but I understand if you feel like it may not be common enough to warrant its inclusion.

At any rate, I want to thank you all for your input, guidance, and patience :-)

@aklomp
Copy link
Owner

aklomp commented Oct 24, 2016

@mayeut while I think your proof-of-concept is remarkable (how are you so damn fast? :) I'm hesitant to merge something like it for the reasons mentioned above. In my opinion, Base64 is an algorithm that works on octets and produces octets. Limited scope. uint16_t strings are not octets, so they must be preprocessed before running regular Base64. I see the preprocessing as a separate concern that the component that runs regular Base64 should not have to address.

(It's easier to argue this when the code and API changes involved to support uint16_t as a first-class datatype are quite extensive, as the POC demonstrates.)

On the other hand, I think that skipping whitespace and newlines is a good candidate for inclusion into this library, because invalid characters are a runtime property of the data instead of a compile-time property of the data's type. Skipping invalid characters is also a feature that has been talked about in the past.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants