Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WT-12733 Decode JSON strings into the original code points #10551

Merged
merged 4 commits into from
May 20, 2024

Conversation

pmacko86
Copy link
Contributor

@pmacko86 pmacko86 commented May 3, 2024

The JSON library uses UTF-8 encoding, which has the unfortunate consequence of encoding all keys and values from the debug log JSON into UTF-8. As a result, we need to decode keys and values from UTF-8 back to their original bytes before unpacking them via data_value::unpack.

I used the iconv library to do the work, because it is standard on POSIX systems. I would have preferred to use something like codecvt if I could, but it just got deprecated, and I didn't want to add a dependency on a non-standard library. And no, I didn't find a way to get our JSON library to use a different encoding; it looks like UTF-8 is hard-coded in.

Copy link
Contributor

@tod-johnson-mongodb tod-johnson-mongodb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

test/model/src/core/util.cpp Outdated Show resolved Hide resolved
test/model/src/core/util.cpp Outdated Show resolved Hide resolved
test/model/src/core/util.cpp Show resolved Hide resolved
test/model/src/core/util.cpp Outdated Show resolved Hide resolved
@pmacko86 pmacko86 requested a review from marcbutler May 7, 2024 23:40
@ershov
Copy link
Contributor

ershov commented May 8, 2024

What happens is there are errors when decoding the data as UTF8?

@pmacko86
Copy link
Contributor Author

pmacko86 commented May 8, 2024

What happens is there are errors when decoding the data as UTF8?

The decoder function will fail with an exception, which will include EILSEQ ("An invalid multibyte sequence has been encountered in the input."). As a consequence, the loading of the JSON file will fail.

The way the PR is using this function, the failure should not happen, barring a bug in the JSON library. The library re-encodes every non-UTF character via UTF8, which guarantees that we see only valid UTF8 strings as input. (And even in that case, this is just within a debugging tool!)

/* Extract the byte-long code points into an array. */
std::vector<char> decoded(decoded_size, 0);
for (size_t i = 0; i < decoded_size; i++) {
if (dest[i] > 0xFF)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this logic, there are multibyte UTF-8 sequences that do not include 0xff, like UTF-8 2 byte characters: byte 1 = \xc0-\xdf, byte 2 = \x80-\xbf.

From what I can see, what does distinguish them is the high bit is set in both bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dest array is the output of iconv's conversion from UTF8 to wchar_t, so it already contains the decoded characters.

The use case for this function is to get a char array from wt printlog -u, which encodes non-printable bytes in JSON using \uXXXX. For example, if a key has value 1, WT packs it as one byte 0x81, which printlog writes to the JSON as key: "\u0081". The JSON parser would interpret this as two bytes C2 81, which will become the src array in this function. iconv will then produce dest array of length 1 with 0x81, and this for loop then copies this value into a char array. The output of this function is later passed to data_value::unpack, which decodes this back into integer 1.

printlog only encodes individual bytes using the \u notation, so every decoded value that we get at this point will be between 0 and 0xFF. This if statement checks that we didn't get any unexpected value outside of this range (and I'm pretty sure that wchar_t is unsigned, so no need to check for < 0).

I hope that makes sense... let me know if you have any more questions. Would it make sense for me to include this example somewhere as a comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I was completely missing dest is wchar_t. I appreciate you taking the time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad that this helps :)

I added the following big comment to the top of the function in 7c0fa41:

    /*
     * The main use case for this function is to recover the exact byte sequence specified in a JSON
     * document produced by "wt printlog -u", which uses the \u notation to encode non-printable
     * bytes. For example, WiredTiger packs key 1 as byte 81 (hex), which results in the following
     * line in the output of printlog:
     *
     *     "key": "\u0081"
     *
     * Our JSON parser encodes these Unicode characters using UTF-8, so reading this key results in
     * a two-byte string C2 81. This function decodes this string back to 81, which the caller can
     * then use in data_value::unpack to get the original integer 1.
     *
     * We do this in two steps:
     *   1. Convert the UTF-8 string to a WCHAR_T array.
     *   2. Convert the WCHAR_T array to a char array.
     */

@pmacko86 pmacko86 added this pull request to the merge queue May 20, 2024
Merged via the queue into develop with commit 8ebfeed May 20, 2024
7 checks passed
@pmacko86 pmacko86 deleted the wt-12733-model-json-parse branch May 20, 2024 02:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants