WT-12733 Decode JSON strings into the original code points #10551

pmacko86 · 2024-05-03T03:35:57Z

The JSON library uses UTF-8 encoding, which has the unfortunate consequence of encoding all keys and values from the debug log JSON into UTF-8. As a result, we need to decode keys and values from UTF-8 back to their original bytes before unpacking them via data_value::unpack.

I used the iconv library to do the work, because it is standard on POSIX systems. I would have preferred to use something like codecvt if I could, but it just got deprecated, and I didn't want to add a dependency on a non-standard library. And no, I didn't find a way to get our JSON library to use a different encoding; it looks like UTF-8 is hard-coded in.

tod-johnson-mongodb

LGTM

test/model/src/core/util.cpp

… to UTF-8.

ershov · 2024-05-08T01:08:44Z

What happens is there are errors when decoding the data as UTF8?

pmacko86 · 2024-05-08T02:06:43Z

What happens is there are errors when decoding the data as UTF8?

The decoder function will fail with an exception, which will include EILSEQ ("An invalid multibyte sequence has been encountered in the input."). As a consequence, the loading of the JSON file will fail.

The way the PR is using this function, the failure should not happen, barring a bug in the JSON library. The library re-encodes every non-UTF character via UTF8, which guarantees that we see only valid UTF8 strings as input. (And even in that case, this is just within a debugging tool!)

marcbutler · 2024-05-09T11:56:50Z

test/model/src/core/util.cpp

+    /* Extract the byte-long code points into an array. */
+    std::vector<char> decoded(decoded_size, 0);
+    for (size_t i = 0; i < decoded_size; i++) {
+        if (dest[i] > 0xFF)


I don't understand this logic, there are multibyte UTF-8 sequences that do not include 0xff, like UTF-8 2 byte characters: byte 1 = \xc0-\xdf, byte 2 = \x80-\xbf.

From what I can see, what does distinguish them is the high bit is set in both bytes.

The dest array is the output of iconv's conversion from UTF8 to wchar_t, so it already contains the decoded characters.

The use case for this function is to get a char array from wt printlog -u, which encodes non-printable bytes in JSON using \uXXXX. For example, if a key has value 1, WT packs it as one byte 0x81, which printlog writes to the JSON as key: "\u0081". The JSON parser would interpret this as two bytes C2 81, which will become the src array in this function. iconv will then produce dest array of length 1 with 0x81, and this for loop then copies this value into a char array. The output of this function is later passed to data_value::unpack, which decodes this back into integer 1.

printlog only encodes individual bytes using the \u notation, so every decoded value that we get at this point will be between 0 and 0xFF. This if statement checks that we didn't get any unexpected value outside of this range (and I'm pretty sure that wchar_t is unsigned, so no need to check for < 0).

I hope that makes sense... let me know if you have any more questions. Would it make sense for me to include this example somewhere as a comment?

Yes! I was completely missing dest is wchar_t. I appreciate you taking the time.

I'm glad that this helps :)

I added the following big comment to the top of the function in 7c0fa41:

/* * The main use case for this function is to recover the exact byte sequence specified in a JSON * document produced by "wt printlog -u", which uses the \u notation to encode non-printable * bytes. For example, WiredTiger packs key 1 as byte 81 (hex), which results in the following * line in the output of printlog: * * "key": "\u0081" * * Our JSON parser encodes these Unicode characters using UTF-8, so reading this key results in * a two-byte string C2 81. This function decodes this string back to 81, which the caller can * then use in data_value::unpack to get the original integer 1. * * We do this in two steps: * 1. Convert the UTF-8 string to a WCHAR_T array. * 2. Convert the WCHAR_T array to a char array. */

WT-12733 Decode JSON strings into the original code points

58fd6b0

pmacko86 requested review from etienneptl and tod-johnson-mongodb May 3, 2024 03:35

etienneptl requested a review from marcbutler May 3, 2024 05:41

Use C++ style memory allocation.

20178b3

tod-johnson-mongodb approved these changes May 4, 2024

View reviewed changes

marcbutler reviewed May 5, 2024

View reviewed changes

test/model/src/core/util.cpp Outdated Show resolved Hide resolved

test/model/src/core/util.cpp Outdated Show resolved Hide resolved

test/model/src/core/util.cpp Show resolved Hide resolved

test/model/src/core/util.cpp Outdated Show resolved Hide resolved

Use std::vector instead of arrays. Specialize the conversion function…

11095d9

… to UTF-8.

etienneptl approved these changes May 7, 2024

View reviewed changes

pmacko86 requested a review from marcbutler May 7, 2024 23:40

marcbutler reviewed May 9, 2024

View reviewed changes

marcbutler approved these changes May 9, 2024

View reviewed changes

Added a big comment.

7c0fa41

pmacko86 added this pull request to the merge queue May 20, 2024

Merged via the queue into develop with commit 8ebfeed May 20, 2024
7 checks passed

pmacko86 deleted the wt-12733-model-json-parse branch May 20, 2024 02:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WT-12733 Decode JSON strings into the original code points #10551

WT-12733 Decode JSON strings into the original code points #10551

pmacko86 commented May 3, 2024

tod-johnson-mongodb left a comment

ershov commented May 8, 2024

pmacko86 commented May 8, 2024

marcbutler May 9, 2024

pmacko86 May 9, 2024

marcbutler May 9, 2024

pmacko86 May 10, 2024

WT-12733 Decode JSON strings into the original code points #10551

WT-12733 Decode JSON strings into the original code points #10551

Conversation

pmacko86 commented May 3, 2024

tod-johnson-mongodb left a comment

Choose a reason for hiding this comment

ershov commented May 8, 2024

pmacko86 commented May 8, 2024

marcbutler May 9, 2024

Choose a reason for hiding this comment

pmacko86 May 9, 2024

Choose a reason for hiding this comment

marcbutler May 9, 2024

Choose a reason for hiding this comment

pmacko86 May 10, 2024

Choose a reason for hiding this comment