Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require import/export names to be UTF-8. #1016

Merged
merged 4 commits into from Mar 30, 2017
Merged

Require import/export names to be UTF-8. #1016

merged 4 commits into from Mar 30, 2017

Conversation

sunfishcode
Copy link
Member

This implements the UTF-8 proposal described in
#989 (comment).

This does not currently rename "name" to "utf8-name", because if UTF-8 is
required for import/export names, there's a greater appeal to just saying
that all strings are UTF-8, though this is debatable.

This implements the UTF-8 proposal described in
#989 (comment).

This does not currently rename "name" to "utf8-name", because if UTF-8 is
required for import/export names, there's a greater appeal to just saying
that all strings are UTF-8, though this is debatable.
@@ -253,9 +253,9 @@ The import section declares all imports that will be used in the module.
| Field | Type | Description |
| ----- | ---- | ----------- |
| module_len | `varuint32` | module string length |
| module_str | `bytes` | module string of `module_len` bytes |
| module_str | `bytes` | module name: `module_len` bytes holding valid utf8 string |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"UTF-8" here and elsewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -48,7 +48,8 @@ In the future, other kinds of imports may be added. Imports are designed to
allow modules to share code and data while still allowing separate compilation
and caching.

All imports include two opaque names: a *module name* and an *export name*. The
All imports include two opaque names: a *module name* and an *export name*,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In JS.md, which type of exception should occur if import or export are invalid UTF-8 strings?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JS.md already seems to specify WebAssembly.CompileError in this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be a validation requirement, so WebAssembly.validate would return false, and APIs that throw would throw WebAssembly.CompileError. The design docs don't specify the details of validation, so there doesn't seem to be a clear place to specify this; perhaps we could just handle this in eventual spec PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

Copy link
Member

@jfbastien jfbastien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, would be good to get input from @annevk / @tabatkins.

Copy link
Member

@annevk annevk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM apart from this minor nit.

@@ -253,9 +253,9 @@ The import section declares all imports that will be used in the module.
| Field | Type | Description |
| ----- | ---- | ----------- |
| module_len | `varuint32` | module string length |
| module_str | `bytes` | module string of `module_len` bytes |
| module_str | `bytes` | module name: `module_len` bytes holding valid UTF-8 string |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a valid UTF-8 byte sequence. A string is what you get after you decode.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the correction! This is now fixed.

This document is describing the encoded bytes, rather than the string which
one gets from decoding them.

Also, make the descriptions of the byte sequence length fields more precise.
| field_len | `varuint32` | field name length |
| field_str | `bytes` | field name: `field_len` bytes holding valid UTF-8 string |
| module_len | `varuint32` | length of `module_str` in bytes |
| module_str | `bytes` | module name: valid UTF-8 byte sequnce |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

@rossberg
Copy link
Member

I still think this is the wrong place to impose such a requirement, for all the reasons stated.

However, I realised that under such a spec implementations could still be allowed to restrict the range of code points they accept (and thereby limit to ASCII in particular) by the same token that they are allowed to impose other implementation restrictions, such as on the number of local variables or sizes of functions, etc.

So as long as there is agreement that it is legal for engines to implement such restrictions -- and we'll include it in the previously discussed (offline) but yet-to-be-written list on allowable implementation restrictions -- I would be fine with the change.

@sunfishcode
Copy link
Member Author

@rossberg-chromium Yes, allowing embedders to impose additional constraints in this space would be fine with me.

Are there any other comments on this PR?

@wanderer
Copy link
Contributor

less validation rules on the base spec === smaller code base. I think it should be left to the upper layers to decided on the string format

@sunfishcode
Copy link
Member Author

@wanderer The amount of code needed is quite small. Some implementations will already have a Unicode library linked in for other purposes, and for those that don't, here's a simple standalone implementation in C, for example:

https://gist.github.com/sunfishcode/c050d4f60633c49ae6e54a3d45385031

In my experiment adding this to a production wasm decoder, the performance impact was negligible.

An implementation which only accepted ASCII strings, as mentioned above, could be even simpler -- just check that no byte has the MSB set.

@RyanLamansky
Copy link

Anything left to discuss before this can be merged?

@sunfishcode
Copy link
Member Author

I believe all the concerns raised have been answered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants