New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wast2json double-encodes UTF-8 symbols #1086
Comments
Yes, this is pretty strange :-) I had to do a little git diving to find why it works this way! It looks like I originally wrote the code to do this in around Oct 2016. At that time, there were no requirements on imported or exported names, they were just byte strings. It wasn't until Feb 2017 that we started to discuss making UTF-8 a requirement for all names. Now that it's a requirement, we could simplify the wast2json names to use UTF-8. We will want to test invalid encodings (see for example utf8-invalid-encoding.wast), but those should never be needed in the JSON file. |
Yeah, because symbols in the JSON files are JSON strings (and thus Unicode abstract strings), there isn't an obvious way to even represent invalid encodings. (Short of substituting an array of byte values and updating all the parsers to handle that.) The current invalid encoding tests that use generated wasm files are sufficient, as far as I can tell. |
Unicode symbols are currently mangled by the
wast2json
tool, which I noticed when trying to write a program that consumes and runs the JSON output.For example, The
names.wast
file from the spec includes a test case that uses a byte-order marker as a symbol.When processing that with
wast2json
(tested on a3a7a93), we get a file containing the following line:That field name is awfully suspicious, being composed of three unicode codepoints (not bytes) in the range 128-255. Feeding the BOM
\uFEFF
through a UTF-8 encoder shows that, indeed, the UTF-8 encoding is\xEF\xBB\xBF
. The JSON file is pretending each of these bytes is a codepoint, effectively double-encoding the symbol.I suspect the culprit is this line here, which is treating a C++
string_view
(which contains an unspecified 8-bit encoding) as though it contains unicode characters -- effectively (and I suspect inadvertently) treating any symbol as ISO-8859-1, while the wast reader seems to be using UTF-8.As a workaround, in case this bug sits around until someone else tries to run these files, I've implemented this routine in my test runner:
The text was updated successfully, but these errors were encountered: