-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WT-12733 Decode JSON strings into the original code points #10551
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What happens is there are errors when decoding the data as UTF8? |
The decoder function will fail with an exception, which will include The way the PR is using this function, the failure should not happen, barring a bug in the JSON library. The library re-encodes every non-UTF character via UTF8, which guarantees that we see only valid UTF8 strings as input. (And even in that case, this is just within a debugging tool!) |
/* Extract the byte-long code points into an array. */ | ||
std::vector<char> decoded(decoded_size, 0); | ||
for (size_t i = 0; i < decoded_size; i++) { | ||
if (dest[i] > 0xFF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this logic, there are multibyte UTF-8 sequences that do not include 0xff, like UTF-8 2 byte characters: byte 1 = \xc0-\xdf, byte 2 = \x80-\xbf.
From what I can see, what does distinguish them is the high bit is set in both bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dest
array is the output of iconv
's conversion from UTF8 to wchar_t
, so it already contains the decoded characters.
The use case for this function is to get a char array from wt printlog -u
, which encodes non-printable bytes in JSON using \uXXXX
. For example, if a key has value 1, WT packs it as one byte 0x81, which printlog
writes to the JSON as key: "\u0081"
. The JSON parser would interpret this as two bytes C2 81, which will become the src
array in this function. iconv
will then produce dest
array of length 1 with 0x81, and this for
loop then copies this value into a char
array. The output of this function is later passed to data_value::unpack
, which decodes this back into integer 1.
printlog
only encodes individual bytes using the \u
notation, so every decoded value that we get at this point will be between 0 and 0xFF. This if
statement checks that we didn't get any unexpected value outside of this range (and I'm pretty sure that wchar_t
is unsigned, so no need to check for < 0).
I hope that makes sense... let me know if you have any more questions. Would it make sense for me to include this example somewhere as a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! I was completely missing dest
is wchar_t. I appreciate you taking the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad that this helps :)
I added the following big comment to the top of the function in 7c0fa41:
/*
* The main use case for this function is to recover the exact byte sequence specified in a JSON
* document produced by "wt printlog -u", which uses the \u notation to encode non-printable
* bytes. For example, WiredTiger packs key 1 as byte 81 (hex), which results in the following
* line in the output of printlog:
*
* "key": "\u0081"
*
* Our JSON parser encodes these Unicode characters using UTF-8, so reading this key results in
* a two-byte string C2 81. This function decodes this string back to 81, which the caller can
* then use in data_value::unpack to get the original integer 1.
*
* We do this in two steps:
* 1. Convert the UTF-8 string to a WCHAR_T array.
* 2. Convert the WCHAR_T array to a char array.
*/
The JSON library uses UTF-8 encoding, which has the unfortunate consequence of encoding all keys and values from the debug log JSON into UTF-8. As a result, we need to decode keys and values from UTF-8 back to their original bytes before unpacking them via
data_value::unpack
.I used the
iconv
library to do the work, because it is standard on POSIX systems. I would have preferred to use something likecodecvt
if I could, but it just got deprecated, and I didn't want to add a dependency on a non-standard library. And no, I didn't find a way to get our JSON library to use a different encoding; it looks like UTF-8 is hard-coded in.