Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wast2json double-encodes UTF-8 symbols #1086

Closed
cbiffle opened this issue May 28, 2019 · 2 comments
Closed

wast2json double-encodes UTF-8 symbols #1086

cbiffle opened this issue May 28, 2019 · 2 comments

Comments

@cbiffle
Copy link

cbiffle commented May 28, 2019

Unicode symbols are currently mangled by the wast2json tool, which I noticed when trying to write a program that consumes and runs the JSON output.

For example, The names.wast file from the spec includes a test case that uses a byte-order marker as a symbol.

When processing that with wast2json (tested on a3a7a93), we get a file containing the following line:

{"type": "assert_return", "line": 623, "action": {"type": "invoke", "field": "\u00ef\u00bb\u00bf", "args": []}, "expected": [{"type": "i32", "value": "15"}]}, 

That field name is awfully suspicious, being composed of three unicode codepoints (not bytes) in the range 128-255. Feeding the BOM \uFEFF through a UTF-8 encoder shows that, indeed, the UTF-8 encoding is \xEF\xBB\xBF. The JSON file is pretending each of these bytes is a codepoint, effectively double-encoding the symbol.

I suspect the culprit is this line here, which is treating a C++ string_view (which contains an unspecified 8-bit encoding) as though it contains unicode characters -- effectively (and I suspect inadvertently) treating any symbol as ISO-8859-1, while the wast reader seems to be using UTF-8.

As a workaround, in case this bug sits around until someone else tries to run these files, I've implemented this routine in my test runner:

fn destupify_string(input: &str) -> String {
    let bytes: Vec<u8> = input.chars().map(|c| {
        assert!(c < '\u{0100}');
        c as u8
    }).collect();
    String::from_utf8(bytes).expect("warning: string is not doubly-encoded UTF8")
}
@binji
Copy link
Member

binji commented May 28, 2019

Yes, this is pretty strange :-) I had to do a little git diving to find why it works this way! It looks like I originally wrote the code to do this in around Oct 2016.

At that time, there were no requirements on imported or exported names, they were just byte strings. It wasn't until Feb 2017 that we started to discuss making UTF-8 a requirement for all names. Now that it's a requirement, we could simplify the wast2json names to use UTF-8. We will want to test invalid encodings (see for example utf8-invalid-encoding.wast), but those should never be needed in the JSON file.

@cbiffle
Copy link
Author

cbiffle commented Jun 4, 2019

Yeah, because symbols in the JSON files are JSON strings (and thus Unicode abstract strings), there isn't an obvious way to even represent invalid encodings. (Short of substituting an array of byte values and updating all the parsers to handle that.) The current invalid encoding tests that use generated wasm files are sufficient, as far as I can tell.

binji added a commit that referenced this issue Jun 5, 2019
binji added a commit that referenced this issue Jun 7, 2019
@binji binji closed this as completed Jun 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants