wast2json double-encodes UTF-8 symbols #1086

cbiffle · 2019-05-28T22:55:40Z

Unicode symbols are currently mangled by the wast2json tool, which I noticed when trying to write a program that consumes and runs the JSON output.

For example, The names.wast file from the spec includes a test case that uses a byte-order marker as a symbol.

When processing that with wast2json (tested on a3a7a93), we get a file containing the following line:

{"type": "assert_return", "line": 623, "action": {"type": "invoke", "field": "\u00ef\u00bb\u00bf", "args": []}, "expected": [{"type": "i32", "value": "15"}]},

That field name is awfully suspicious, being composed of three unicode codepoints (not bytes) in the range 128-255. Feeding the BOM \uFEFF through a UTF-8 encoder shows that, indeed, the UTF-8 encoding is \xEF\xBB\xBF. The JSON file is pretending each of these bytes is a codepoint, effectively double-encoding the symbol.

I suspect the culprit is this line here, which is treating a C++ string_view (which contains an unspecified 8-bit encoding) as though it contains unicode characters -- effectively (and I suspect inadvertently) treating any symbol as ISO-8859-1, while the wast reader seems to be using UTF-8.

As a workaround, in case this bug sits around until someone else tries to run these files, I've implemented this routine in my test runner:

fn destupify_string(input: &str) -> String {
    let bytes: Vec<u8> = input.chars().map(|c| {
        assert!(c < '\u{0100}');
        c as u8
    }).collect();
    String::from_utf8(bytes).expect("warning: string is not doubly-encoded UTF8")
}

The text was updated successfully, but these errors were encountered:

binji · 2019-05-28T23:43:47Z

Yes, this is pretty strange :-) I had to do a little git diving to find why it works this way! It looks like I originally wrote the code to do this in around Oct 2016.

At that time, there were no requirements on imported or exported names, they were just byte strings. It wasn't until Feb 2017 that we started to discuss making UTF-8 a requirement for all names. Now that it's a requirement, we could simplify the wast2json names to use UTF-8. We will want to test invalid encodings (see for example utf8-invalid-encoding.wast), but those should never be needed in the JSON file.

cbiffle · 2019-06-04T22:55:39Z

Yeah, because symbols in the JSON files are JSON strings (and thus Unicode abstract strings), there isn't an obvious way to even represent invalid encodings. (Short of substituting an array of byte values and updating all the parsers to handle that.) The current invalid encoding tests that use generated wasm files are sufficient, as far as I can tell.

Fixes issue #1086.

binji added a commit that referenced this issue Jun 5, 2019

Rely on UTF-8 encoding in JSON strings

10c0e07

Fixes issue #1086.

binji mentioned this issue Jun 5, 2019

Rely on UTF-8 encoding in JSON strings #1088

Merged

binji added a commit that referenced this issue Jun 7, 2019

Rely on UTF-8 encoding in JSON strings (#1088)

d1bb5f8

Fixes issue #1086.

binji closed this as completed Jun 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wast2json double-encodes UTF-8 symbols #1086

wast2json double-encodes UTF-8 symbols #1086

cbiffle commented May 28, 2019

binji commented May 28, 2019 •

edited

cbiffle commented Jun 4, 2019

wast2json double-encodes UTF-8 symbols #1086

wast2json double-encodes UTF-8 symbols #1086

Comments

cbiffle commented May 28, 2019

binji commented May 28, 2019 • edited

cbiffle commented Jun 4, 2019

binji commented May 28, 2019 •

edited