Require import/export names to be UTF-8. #1016

sunfishcode · 2017-03-14T17:00:49Z

This implements the UTF-8 proposal described in
#989 (comment).

This does not currently rename "name" to "utf8-name", because if UTF-8 is
required for import/export names, there's a greater appeal to just saying
that all strings are UTF-8, though this is debatable.

This implements the UTF-8 proposal described in #989 (comment). This does not currently rename "name" to "utf8-name", because if UTF-8 is required for import/export names, there's a greater appeal to just saying that all strings are UTF-8, though this is debatable.

jfbastien · 2017-03-14T17:09:03Z

BinaryEncoding.md

@@ -253,9 +253,9 @@ The import section declares all imports that will be used in the module.
 | Field | Type | Description |
 | ----- | ---- | ----------- |
 | module_len | `varuint32` | module string length |
-| module_str | `bytes` | module string of `module_len` bytes |
+| module_str | `bytes` | module name: `module_len` bytes holding valid utf8 string |


"UTF-8" here and elsewhere.

jfbastien · 2017-03-14T17:11:02Z

Modules.md

@@ -48,7 +48,8 @@ In the future, other kinds of imports may be added. Imports are designed to
 allow modules to share code and data while still allowing separate compilation
 and caching.

-All imports include two opaque names: a *module name* and an *export name*. The
+All imports include two opaque names: a *module name* and an *export name*,


In JS.md, which type of exception should occur if import or export are invalid UTF-8 strings?

JS.md already seems to specify WebAssembly.CompileError in this case.

It'd be a validation requirement, so WebAssembly.validate would return false, and APIs that throw would throw WebAssembly.CompileError. The design docs don't specify the details of validation, so there doesn't seem to be a clear place to specify this; perhaps we could just handle this in eventual spec PR?

jfbastien

lgtm, would be good to get input from @annevk / @tabatkins.

annevk

LGTM apart from this minor nit.

annevk · 2017-03-14T18:15:32Z

BinaryEncoding.md

@@ -253,9 +253,9 @@ The import section declares all imports that will be used in the module.
 | Field | Type | Description |
 | ----- | ---- | ----------- |
 | module_len | `varuint32` | module string length |
-| module_str | `bytes` | module string of `module_len` bytes |
+| module_str | `bytes` | module name: `module_len` bytes holding valid UTF-8 string |


This would be a valid UTF-8 byte sequence. A string is what you get after you decode.

Thanks for the correction! This is now fixed.

This document is describing the encoded bytes, rather than the string which one gets from decoding them. Also, make the descriptions of the byte sequence length fields more precise.

jfbastien · 2017-03-14T20:02:50Z

BinaryEncoding.md

-| field_len | `varuint32` | field name length |
-| field_str | `bytes` | field name: `field_len` bytes holding valid UTF-8 string |
+| module_len | `varuint32` | length of `module_str` in bytes |
+| module_str | `bytes` | module name: valid UTF-8 byte sequnce |


rossberg · 2017-03-15T07:38:13Z

I still think this is the wrong place to impose such a requirement, for all the reasons stated.

However, I realised that under such a spec implementations could still be allowed to restrict the range of code points they accept (and thereby limit to ASCII in particular) by the same token that they are allowed to impose other implementation restrictions, such as on the number of local variables or sizes of functions, etc.

So as long as there is agreement that it is legal for engines to implement such restrictions -- and we'll include it in the previously discussed (offline) but yet-to-be-written list on allowable implementation restrictions -- I would be fine with the change.

sunfishcode · 2017-03-20T21:53:41Z

@rossberg-chromium Yes, allowing embedders to impose additional constraints in this space would be fine with me.

Are there any other comments on this PR?

wanderer · 2017-03-21T20:20:59Z

less validation rules on the base spec === smaller code base. I think it should be left to the upper layers to decided on the string format

sunfishcode · 2017-03-23T08:15:48Z

@wanderer The amount of code needed is quite small. Some implementations will already have a Unicode library linked in for other purposes, and for those that don't, here's a simple standalone implementation in C, for example:

https://gist.github.com/sunfishcode/c050d4f60633c49ae6e54a3d45385031

In my experiment adding this to a production wasm decoder, the performance impact was negligible.

An implementation which only accepted ASCII strings, as mentioned above, could be even simpler -- just check that no byte has the MSB set.

RyanLamansky · 2017-03-28T14:01:18Z

Anything left to discuss before this can be merged?

sunfishcode · 2017-03-30T16:33:02Z

I believe all the concerns raised have been answered.

See WebAssembly/design#1016.

sunfishcode mentioned this pull request Mar 14, 2017

UTF-8 for all string encodings #989

Closed

jfbastien suggested changes Mar 14, 2017

View reviewed changes

s/utf8/UTF-8/g

9d260e7

jfbastien approved these changes Mar 14, 2017

View reviewed changes

annevk reviewed Mar 14, 2017

View reviewed changes

Say "UTF-8 byte sequence" rather than "UTF-8 string".

2f30dde

This document is describing the encoded bytes, rather than the string which one gets from decoding them. Also, make the descriptions of the byte sequence length fields more precise.

jfbastien reviewed Mar 14, 2017

View reviewed changes

Fix typo.

6f13ecf

binji mentioned this pull request Mar 29, 2017

Verify the sizes of the sub-sections within the name section WebAssembly/wabt#375

Merged

sunfishcode merged commit 8e5ecc3 into master Mar 30, 2017

sunfishcode deleted the require-utf8 branch March 30, 2017 16:33

pipcet mentioned this pull request Mar 30, 2017

Extensible name section WebAssembly/binaryen#933

Merged

This was referenced Mar 30, 2017

String encoding is often unspecified #968

Closed

Clarify import/export identifier validation on the Web. #1028

Merged

UTF-8 decoding of import/export names in JS #970

Closed

Test that invalid UTF-8 byte sequences are rejected. WebAssembly/spec#450

Closed

Cellule mentioned this pull request Apr 6, 2017

WASM - utf8 strings (2d) chakra-core/ChakraCore#2793

Closed

sunfishcode added a commit to sunfishcode/wasm-reference-manual that referenced this pull request Apr 8, 2017

Update to the requirement that names be UTF-8.

4838b7a

See WebAssembly/design#1016.

lukewagner mentioned this pull request May 26, 2017

[spec] Implementation restrictions WebAssembly/spec#483

Merged

rossberg mentioned this pull request May 29, 2017

[spec] Allow impls to limit code point range WebAssembly/spec#488

Merged

sunfishcode mentioned this pull request Jun 4, 2017

Binary function names bytecodealliance/cranelift#47

Closed

lukewagner mentioned this pull request Aug 24, 2017

Are names and UTF-8 validation web-only? WebAssembly/spec#550

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Require import/export names to be UTF-8. #1016

Require import/export names to be UTF-8. #1016

sunfishcode commented Mar 14, 2017

jfbastien Mar 14, 2017

sunfishcode Mar 14, 2017

jfbastien Mar 14, 2017

RyanLamansky Mar 14, 2017

sunfishcode Mar 14, 2017 •

edited

jfbastien Mar 14, 2017

jfbastien left a comment

annevk left a comment

annevk Mar 14, 2017

sunfishcode Mar 14, 2017

jfbastien Mar 14, 2017

rossberg commented Mar 15, 2017

sunfishcode commented Mar 20, 2017

wanderer commented Mar 21, 2017

sunfishcode commented Mar 23, 2017

RyanLamansky commented Mar 28, 2017

sunfishcode commented Mar 30, 2017

Require import/export names to be UTF-8. #1016

Require import/export names to be UTF-8. #1016

Conversation

sunfishcode commented Mar 14, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunfishcode Mar 14, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jfbastien left a comment

Choose a reason for hiding this comment

annevk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rossberg commented Mar 15, 2017

sunfishcode commented Mar 20, 2017

wanderer commented Mar 21, 2017

sunfishcode commented Mar 23, 2017

RyanLamansky commented Mar 28, 2017

sunfishcode commented Mar 30, 2017

sunfishcode Mar 14, 2017 •

edited