Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 for all string encodings #989

Closed
jfbastien opened this issue Feb 15, 2017 · 80 comments
Closed

UTF-8 for all string encodings #989

jfbastien opened this issue Feb 15, 2017 · 80 comments

Comments

@jfbastien
Copy link
Member

Currently:

  • We use var[u]int for most of WebAssembly's binary integer encoding. Consistency is good.
  • We use length + bytes for all "strings" such import / export, and we let the embedder apply extra restrictions as they see fit (and JS.md does). Separation of concerns, and leeway for embedders, are good.

#984 opens a can of worms w.r.t. using UTF-8 for strings. We could either:

  • Do varuint for length + UTF-8 for each byte; or
  • Do varuint for number of codepoints + UTF-8 for each codepoint.

I'm not opposed to it—UTF-8 is super simple and doesn't imply Unicode—but I want the discussion to be a stand-alone thing. This issue is that discussion.

Let's discuss arguments for / against UTF-8 for all strings (not Unicode) in this issue, and vote 👍 or 👎 on the issue for general sentiment.

@jfbastien
Copy link
Member Author

Argument for UTF-8: it's very simple. encoder and decoder in JavaScript. Again, UTF-8 is not Unicode.

@jfbastien
Copy link
Member Author

Argument against UTF-8: it's ever slightly more complicated than length + bytes, leading to potential implementation divergences.

@tabatkins
Copy link

Again, UTF-8 is not Unicode.

What are you even saying? This is a nonsense sentence.

I think you're trying to say that there's no need to pull in an internationalization library. This is true - mandating that strings are encoded in UTF-8 has nothing to do with all the more complicated parts of Unicode, like canonicalization. Those are useful tools when you're doing string work that interfaces with humans, but in the same way that a trig library is useful to people doing math, and not relevant when deciding how to encode integers.

But UTF-8 is literally a Unicode encoding; your statement is meaningless as written. ^_^

@jfbastien
Copy link
Member Author

jfbastien commented Feb 15, 2017

But UTF-8 is literally a Unicode encoding; your statement is meaningless as written. ^_^

Yes, I'm specifically referring to the codepoint encoding that UTF-8 describes, not the treatment of codepoints proper (for the purpose of this proposal, a codepoint is an opaque integer). Put in wasm-isms, UTF-8 is similar to var[u]int, but more appropriate to characters. Further, UTF-8 isn't the only Unicode encoding, and it can be used to encode non-Unicode integers. So, UTF-8 isn't Unicode.

A further proposal would look at individual codepoints and do something with them. This is not that proposal.

@tabatkins
Copy link

And there would be no reason to. No Web API has found the need to introspect on the codepoints beyond strict equality comparison and sorting, unless it's literally an i18n API.

@RyanLamansky
Copy link

Another option is byte length + UTF-8 for each code point ( @jfbastien unless this is what you meant when you said UTF-8 for each byte, which I admit didn't make sense to me). I don't think this would make things any more difficult for a primitive parser that doesn't really care, while allowing a sophisticated Unicode library to take a byte array, offset, and length as input and return a string.

I agree with the definition as "UTF-8 code points", which are just integers. The binary spec should leave it at that. Individual embedders can define rules around allowed code points, normalization and other nuances. Analysis tools could provide warnings for potential compatibility issues.

I think error handling decisions should also be left to the embedders. A system that accessed WASM functions by index rather than name has no need for them to be valid (and they'd be easy to skip over with a byte length prefix).

@sunfishcode
Copy link
Member

sunfishcode commented Feb 16, 2017

Here's an attempt at summarizing the underlying issues and their reasons. Corrections and additions are most welcome.

Should wasm require module import/export identifiers be valid UTF-8?

My understanding of the reasons against is:

  • Processing imports and exports is on the critical path for application startup, and there's a desire to avoid anything which would slow it down.
  • The broad invariant "the core wasm spec does not interpret strings". String interpretation is complex in general, and there's a desire to encapsulate it and have broad invariants and boundaries that one can reason about at a high level.
  • WebAssembly decoders are often security-sensitive, so there's a general desire to minimize the amount of code involved.
  • Some WebAssembly producers may want to embed arbitrary data in these identifiers, and it's more convenient for them to encode the data however they want instead of mangling it into string form.

Should wasm recommend UTF-8 in areas where it doesn't require it?

The reason for would be that even if we can't require it, mentioning UTF-8 may discourage needless incompatibilities among the ecosystem.

My understanding of the reason against is that even mentioning UTF-8 would compromise the conceptual encapsulation of string interpretation concerns.

Should wasm specify UTF-8 for name-section names?

The reason for is: The entire purpose of these names is to be converted into strings for display, which is not possible without an encoding, so we should just specify UTF-8 so that tools don't have to guess.

My understanding of the reason against is: If wasm has other string-like things in other areas that don't have a designated encoding (i.e. imports/exports as discussed above), then for consistency sake it shouldn't designate encodings for any strings.

@rossberg
Copy link
Member

@sunfishcode provides a good summary, but I want to add three crucial points.

@jfbastien, it would be the most pointless of all alternatives to restrict binary syntax (an encoding) but not semantics (a character set) for strings. So for all practical purposes, UTF-8 implies Unicode. And again, this is not just about engines. If you define names to be Unicode, then you are forcing that on all Wasm eco systems in all environments. And that pretty much means that all environments be required to have some Unicode support.

@tabatkins, I think there is a domain error underlying your argument. None of the strings we are talking about are user-facing. They are dev-facing names. Many/most programming languages do not support Unicode identifiers, nor do tools. Can e.g. gdb handle Unicode source identifiers? I don't think so. So it is quite optimistic (or rather, unrealistic) to assume that all consumers have converged on Unicode in this space.

And finally, the disagreement is not whether Wasm on the Web should assume UTF-8, but where we specify that.

@tabatkins
Copy link

I think there is a domain error underlying your argument. None of the strings we are talking about are user-facing. They are dev-facing names. Many/most programming languages do not support Unicode identifiers, nor do tools. Can e.g. gdb handle Unicode source identifiers? I don't think so. So it is quite optimistic (or rather, unrealistic) to assume that all consumers have converged on Unicode in this space.

"dev-facing" means "arbitrary toolchain-facing", which means you need to agree on encoding up-front, or else the tools will have to do encoding "detection" (that is to say, guessing, which is especially bad when applied to short values) or have out-of-band information. Devs are still users. ^_^

If you think a lot of toolchains aren't going to understand Unicode, then I'm unsure why you think they'd understand any other arbitrary binary encoding. If that's your limitation, then just specify and require ASCII, which is 100% supported everywhere. If you're not willing to limit yourself to ASCII, tho, then you need to accept that there's a single accepted non-ASCII encoding scheme - UTF-8.

Saying "eh, most things probably only support ASCII, but we'll let devs put whatever they want in there just in case" is the worst of both worlds.

@rossberg
Copy link
Member

Saying "eh, most things probably only support ASCII, but we'll let devs put whatever they want in there just in case" is the worst of both worlds.

@tabatkins, nobody is proposing the above. As I said, the question isn't whether but where to define such platform/environment-specific matters. Wasm is supposed to be embeddable in the broadest and most heterogeneous range of environments, some much richer than others (for example, JS does support Unicode identifiers). Consequently, you want to allow choosing on a per-platform basis. Hence it belongs into platform API specs not the core spec.

@tabatkins
Copy link

There's no choice to make, tho! If your embedding environment doesn't support non-ASCII, you just don't use non-ASCII in your strings. (And if this is the case, you still need encoding assurance - UTF-16 isn't ASCII-compatible, for example!)

If your environment does support non-ASCII, you need to know what encoding to use, and the correct choice in all situations is UTF-8.

What environment are you imagining where it's a benefit to not know the encoding of your strings?

@tabatkins
Copy link

it would be the most pointless of all alternatives to restrict binary syntax (an encoding) but not semantics (a character set) for strings. So for all practical purposes, UTF-8 implies Unicode.

No, it absolutely doesn't. For example, it's perfectly reasonable to simultaneously (a) restrict a string to the ASCII characters, and (b) dictate that it's encoded in UTF-8. Using ASCII characters doesn't imply an encoding, or else all encodings would be ASCII-compatible! (For example, UTF-16 is not.) So you still have to specify something; UTF-8, being "ASCII-compatible", is fine for this.

Again, if you are okay with restricting these names to ASCII-only, then it's reasonable to mandate the encoding be US-ASCII. If you want it to be possible to go beyond ASCII, then it's reasonable to mandate the encoding be UTF-8. Mandating anything else, or not mandating anything at all (and forcing all consumers to guess or use out-of-band information), are the only unreasonable possibilities.

And again, this is not just about engines. If you define names to be Unicode, then you are forcing that on all Wasm eco systems in all environments. And that pretty much means that all environments be required to have some Unicode support.

Again, this looks like you're talking about internationalization libraries. What we're discussing is solely how to decode byte sequences back into strings; that requires just knowledge of how to decode UTF-8, which is extremely trivial and extremely fast.

Unless you're doing human-friendly string manipulation, all you need is the ability to compare strings by codepoint, and possibly sort strings by codepoint, neither of which require any "Unicode support". This is all that existing Web tech uses, for example, and I don't see any reason Wasm environments would, in general, need to do anything more complicated than this.

@lukewagner
Copy link
Member

I'm in favor of mandating utf8 for All The Strings. Pure utf8 decoding/encoding seems like a pretty low impl burden (compared to everything else) for non-Web environments. Also, from what I've seen, time spent validating utf8 for imports/names will be insignificant compared to time spent on everything else, so I don't think there's a performance argument here.

Practically speaking, even if we didn't mandate utf8 in the core wasm spec, you'd have a Bad Time interoperating with anything if your custom toolchain didn't also use utf8 unless you're a total island and then maybe you just say "screw it" and do your own non-utf8 thing anyway... because then who cares.

What I'd realllly like to do, though, is resolve #984, which seems to block on this...

@jfbastien
Copy link
Member Author

@lukewagner I don't think #984 is blocked on this. 😄

@lukewagner
Copy link
Member

I guess you're right.

@rossberg
Copy link
Member

What environment are you imagining where it's a benefit to not know the encoding of your strings?

@tabatkins, it seems I've still not been clear enough. I don't imagine such an environment. However, I imagine a wide spectrum of environments with incompatible requirements. Not everything is a subset of UTF-8, e.g. Latin1 is still in fairly widespread use. You might not care, but it is not the job of the core Wasm spec to put needless stones in the way of environment diversity.

you'd have a Bad Time interoperating with anything if your custom toolchain didn't also use utf8 unless you're a total island

@lukewagner, I indeed expect that Wasm will be used across a variety of "continents" that potentially have little overlap. And where they do you can specify interop (in practice, name encodings are likely gonna be the least problem for sharing modules between different platforms -- it's host libraries). Even total islands are not unrealistic, especially wrt embedded systems (which also tend to have little use for Unicode).

@MI3Guy
Copy link

MI3Guy commented Feb 17, 2017

One of the most difficult parts of implementing a non-browser based WebAssembly engine is making things work the way it does in the browser (mainly the JS parts). I expect that if the encoding doesn't get standardized, we will end up with a de facto standard where everyone copies what is done for the web target. This will just result in it being harder to find information on how to decode these strings.

There may be value in allowing some environments to further restrict the allowed content, but not requiring UTF-8 will just result in more difficulty.

@rossberg
Copy link
Member

@MI3Guy, the counter proposal is to specify UTF-8 encoding as part of the JS API. So if you are building a JS embedding then it's defined to be UTF-8 either way and makes no difference for you. (However, we also want to allow for other embedder APIs that are neither Web nor JavaScript.)

@MI3Guy
Copy link

MI3Guy commented Feb 17, 2017

Right. My point is if you are not doing a JS embedding, you are forced to emulate a lot of what the JS embedder does in order to use the WebAssembly toolchain.

@pipcet
Copy link

pipcet commented Feb 17, 2017

Do varuint for number of codepoints + UTF-8 for each codepoint.

I'd just like to speak out against this option. It complicates things, doesn't and cannot apply to user-specific sections, and provides no benefit that I can see—in order to know the number of codepoints in a UTF-8 string, in practice you always end up scanning the string for invalid encodings, so you might as well count codepoints while you're at it.

@tabatkins
Copy link

Not everything is a subset of UTF-8, e.g. Latin1 is still in fairly widespread use. You might not care, but it is not the job of the core Wasm spec to put needless stones in the way of environment diversity.

Correct; UTF-8 differs from virtually every encoding once you leave the ASCII range. I'm unsure what your point is with this, tho. Actually using the Latin-1 encoding is bad precisely because there are lots of other encodings that look the same but encode different letters. If you tried to use the name "æther" in your Wasm code, and encoded it in Latin-1, then someone else (justifiably) tries to read the name with a UTF-8 toolchain, they'll get a decoding error. Or maybe the other person was making a similar mistake, but used the Windows-1250 encoding instead (intended for Central/Eastern European languages) - they'd get the nonsense word "ćther".

I'm really not sure what kind of "diversity" you're trying to protect here. There is literally no benefit to using any other encoding, and tons of downside. Every character you can encode in another encoding is present in Unicode and can be encoded in UTF-8, but the reverse is almost never true. There are no relevant tools today that can't handle UTF-8; the technology is literally two decades old.

I keep telling you that web standards settled this question years ago, not because Wasm is a web spec that needs to follow web rules, but because text encoding is an ecosystem problem that pretty much everyone has the same problems with, and the web already dealt with the pain of getting this wrong, and has learned how to do it right. There's no virtue in getting it wrong again in Wasm; every environment that has to encode text either goes straight to UTF-8 from the beginning, or makes the same mistakes and suffers the same pain that everyone else does, and then eventually settles on UTF-8. (Or, in rare cases, develops a sufficiently isolated environment that they can standardize on a different encoding, and only rarely pays the price of communicating with the outside environment. But they standardize on an encoding, which is the point of all this.)

@tabatkins
Copy link

So if you are building a JS embedding then it's defined to be UTF-8 either way and makes no difference for you. (However, we also want to allow for other embedder APIs that are neither Web nor JavaScript.)

This issue has nothing to do with the Web or JS. Every part of the ecosystem wants a known, consistent text encoding, and there's a single one that is widely agreed upon across programming environments, countries, and languages: UTF-8.

@qwertie
Copy link

qwertie commented Feb 19, 2017

I vote for 'Do varuint for length (in bytes) + UTF-8 for each byte'. Assuming that's not a controversial choice - pretty much every string implementation stores strings as "number of code units" rather than "number of code points", because it's simpler - then isn't the real question "should validation fail if a string is not valid UTF-8"?

As I pointed out in #970, invalid UTF-8 can be round-tripped to UTF-16, so if invalid UTF-8 is allowed, software that doesn't want to store the original bytes doesn't have to. On the other hand, checking if UTF-8 is valid isn't hard (though we must answer - should overlong sequences be accepted? surrogate characters?)

On the whole I'm inclined to say let's mandate UTF-8. In the weird case that someone has bytes they can't translate to UTF-8 (perhaps because the encoding is unknown), arbitrary bytes can be transliterated to UTF-8.

@rossberg
Copy link
Member

rossberg commented Feb 20, 2017

I'm really not sure what kind of "diversity" you're trying to protect here.

@tabatkins, yes, that seems to be the core of the misunderstanding.

It is important to realise that WebAssembly, despite its name, is not limited to the web. We are very cautious to define it in suitable layers, such that each layer is as widely usable as possible.

Most notably, its core is not actually a web technology at all. Instead, try to think of it as a virtual ISA. Such an abstraction is useful in a broad spectrum of different environments, from very rich (the web) to very rudimentary (embedded systems), that do not necessarily have anything to do with each other, may be largely incompatible, and have conflicting constraints (that Wasm is in no position to change).

As such, it makes no more sense to impose Unicode on core Wasm than it would, say, to impose Unicode on all string literals in the C programming language. You'd only coerce some potential clients into violating this bit of the standard. What's the gain?

There will, however, be additional spec layers on top of this core spec that define its embedding and API in concrete environments (such as JavaScript). It makes perfect sense to fix string encodings on that level, and by all means, we should.

@rossberg
Copy link
Member

PS: A slogan that defines the scope of Wasm is that it's an abstraction over common hardware, not an abstraction over common programming languages. And hardware is agnostic to software concerns like string encodings. That's what ABIs are for.

@jfbastien
Copy link
Member Author

@rossberg-chromium

As such, it makes no more sense to impose Unicode on core Wasm than it would, say, to impose Unicode on all string literals in the C programming language. You'd only coerce some potential clients into violating this bit of the standard. What's the gain?

I agree 100%. This issue isn't about Unicode though, it's purely about UTF-8, an encoding for integers, without mandating that the integers be interpreted as Unicode.

I don't understand if we agree on that. Could you clarify: are you OK with UTF-8, and if not why?

@rossberg
Copy link
Member

@jfbastien, would it be any more productive to require UTF-8 conformance for all C string literals?

As I noted earlier, it makes no sense to me to restrict the encoding but not the character set. That's like defining syntax without semantics. Why would you possibly do that? You gain zero in terms of interop but still erect artificial hurdles for environments that do not use UTF-8 (which only Unicode environments do anyway).

@jfbastien
Copy link
Member Author

@jfbastien, would it be any more productive to require UTF-8 conformance for all C string literals?

I don't understand, can you clarify?

As I noted earlier, it makes no sense to me to restrict the encoding but not the character set. That's like defining syntax without semantics. Why would you possibly do that? You gain zero in terms of interop but still erect artificial hurdles for environments that do not use UTF-8 (which only Unicode environments do anyway).

I think that the crux of the discussion.

@tabatkins touched on precedents to exactly this:

Again, this looks like you're talking about internationalization libraries. What we're discussing is solely how to decode byte sequences back into strings; that requires just knowledge of how to decode UTF-8, which is extremely trivial and extremely fast.

Unless you're doing human-friendly string manipulation, all you need is the ability to compare strings by codepoint, and possibly sort strings by codepoint, neither of which require any "Unicode support". This is all that existing Web tech uses, for example, and I don't see any reason Wasm environments would, in general, need to do anything more complicated than this.

So I agree: this proposal is, in your words, "defining syntax without semantics". That's a very common thing to do. In fact, WebAssembly's current length + bytes specification already does this!

I'd like to understand what the hurdle is. I don't really see one.

@tabatkins
Copy link

It is important to realise that WebAssembly, despite its name, is not limited to the web.

I just stated in the immediately preceding comment that this has nothing to do with the web. You keep trying to use this argument, and it's really confusing me. What I'm saying has nothing to do with the web; I'm merely pointing to the web's experience as an important example of lessons learned.

As such, it makes no more sense to impose Unicode on core Wasm than it would, say, to impose Unicode on all string literals in the C programming language. You'd only coerce some potential clients into violating this bit of the standard. What's the gain?

You're not making the point you think you're making - C does have a built-in encoding, as string literals use the ASCII encoding. (If you want anything else you have to do it by hand by escaping the appropriate byte sequences.) In more current C++ you can have UTF-16 and UTF-8 string literals, and while you can still put arbitrary bytes into the string with \x escapes, the \u escapes at least verify that the value is a valid codepoint.

All of this is required, because there is no inherent mapping from characters to bytes. That's what an encoding does. Again, not having a specified encoding just means that users of the language, when they receive byte sequences from other parties, have to guess at the encoding to turn them back into text.

You gain zero in terms of interop but still erect artificial hurdles for environments that do not use UTF-8 (which only Unicode environments do anyway).

Can you please point to an environment in existence that uses characters that aren't included in Unicode? You keep trying to defend this position from a theoretical purity / environment diversity standpoint, but literally the entire point of Unicode is to include all of the characters. It's the only character set that can make a remotely credible argument for doing so, and when you're using the Unicode character set, UTF-8 is the preferred universal encoding.

What diversity are you attempting to protect? It would be great to see even a single example. :/

@rossberg
Copy link
Member

rossberg commented Feb 27, 2017 via email

@annevk
Copy link
Member

annevk commented Feb 27, 2017

Moreover, they would have to deal with the problem of characters that cannot be mapped (in either direction), so you'd still have compatibility issues in general, just kicked the can down the road.

Is this a theoretical concern?

@tabatkins
Copy link

And if it's a reasonable concern, we must once again weigh the (occurence * cost) of dealing with that against the cost of virtually every other user of Wasm in the world not being able to depend on an encoding, and having to deal with the same encoding-hell the web platform had to go thru, and eventually fixed as well as it could.

@rocallahan
Copy link

Non-Unicode platforms would be forced to perform transcoding to actually handle their strings.

In what cases do Wasm strings need to interoperate with platform strings, though? As far as I can tell we're only talking about the encoding of strings in the Wasm metadata, not the encoding of strings manipulated by actual module code. (If that's wrong, I apologize...) Then I can only think of a few possible cases where interop/transcoding might be required:

  • A Wasm module imports a platform identifier
  • The platform imports a Wasm identifier
  • You to extract Wasm names and print them or save them using platform strings, e.g. to dump a stack trace.

Right?

For hypothetical non-Unicode embedded systems, for the first two cases, the advice is simple: limit identifiers imported across the platform boundary to ASCII, then the required transcoding is trivial. Wasm modules could still use full Unicode names internally and for linking to each other.

For the third issue --- if you have a closed world of Wasm modules, you can limit their identifiers to ASCII. If not, then in practice you'll encounter UTF8 identifiers and you'd better be able to transcode them, and you'll be glad the spec mandated UTF8!

@rocallahan
Copy link

implying that somebody's non-ASCII characters are irrelevant a priori

That is a straw-man argument. The position here is "if you want non-ASCII identifiers, use Unicode or implement transcoding to/from Unicode", and it has not attracted criticism as "culturally questionable" in other specs, AFAIK.

@rossberg
Copy link
Member

rossberg commented Feb 28, 2017 via email

@rossberg
Copy link
Member

rossberg commented Feb 28, 2017 via email

@rocallahan
Copy link

every embedding spec will specify an encoding and character set. On every platform you can rely on this. You'd only ever run into encoding questions if you tried to interoperate between two unrelated eco systems -- which will already be incompatible for deeper reasons than strings.

What about Wasm processing tools such as disassemblers? Wouldn't it be valuable to be able to write a disassembler that works with any Wasm module regardless of "embedding spec" variants?

Under the proposal you would not be allowed to limit anything to ASCII!

Under the proposal, Wasm modules would not be limited to ASCII, but if an implementer chose to make all their identifiers defined outside Wasm modules ASCII (e.g. as pretty much all system libraries actually do!), that would be outside the scope of the Wasm spec.

If an implementer chose to print only ASCII characters in a stack trace and replace all non-ASCII Unicode characters with ? or similar, that has to be allowed by the spec, since in practice there always exist Unicode characters you don't have a font for anyway.

Having said all that, defining a subset of Wasm in which all Wasm names are ASCII would be fairly harmless since such Wasm modules would be processed correctly by tools that treat Wasm names as UTF8.

@tabatkins
Copy link

You are software engineers. As such I assume you understand and appreciate the value of modularisation and layering, to separate concerns and maximise reuse. That applies to specs as well.

Yes, I'm a software engineer. I'm also a spec engineer, so I understand the value of consistency and establishing norms that make the ecosystem work better. Character sets and encodings are one of the subjects where the value of allowing modularization and choice are vastly outweighed by the value of consistency and predictability. We have literal decades of evidence of this. This is why I keep repeating myself - you're ignoring history and the recommendation of many experts, several of which have shown up in this very thread, and many more which I'm representing the opinions of, when you insist that we need to allow freedom in this regard.

@titzer
Copy link

titzer commented Mar 1, 2017

After reading this whole (long) thread, I think the only way to resolve this discussion is to explicitly specify that the names section we are describing in the binary format and are enhancing in #984 is a UTF-8 encoding, and I would propose that we simply call that section "utf8-names". That makes the encoding explicit, and almost certainly all tools that want to manipulate WASM binaries on all relevant platforms today want to speak UTF-8 anyway. They could be forgiven for speaking only UTF-8.

I am sensitive to @rossberg-chromium's concerns for other platforms, and to some extent, I agree. However, this is easily fixable. As someone suggested earlier in the thread, those systems are more than welcome to add a non-standard "ascii-names" section or any other encoding that their ecosystem uses. With explicit names, it becomes obvious which tools work with which sections. For modules that only work on DOS, this would become obvious from the presence of DOS-specific sections. IMO it would be a disaster to interpret these binaries' names as having a different encoding.

(By the way, this is informed from war stories about a system that accidentally lost the encodings of the strings for user-uploaded content, and could never recover them. The system died a horrific, spasmic death. Literally, millions of dollars were lost.)

@titzer
Copy link

titzer commented Mar 1, 2017

We could even adopt a naming standard for names sections (heh), so that they are all "<encoding>-names" so that generic tools could process and manipulate any kind of names section, regardless of the actual encoding of the strings inside.

@RyanLamansky
Copy link

@titzer Yeah, custom sections are the solution here for exotic or specialized platforms that want nothing to do with UTF8. I'd be hesitant to prescribe in in the spec, though: if a platform is so specific in its mode of operation that it can't even be bothered to map UTF-8 code points to their native preference, they may want to do a lot more with custom sections than just supply names in their preferred encoding.

I recommend putting a greater emphasis on using custom sections for platform-specific details in the spec, and let the platform's own specifications define those details. Common WASM toolchains could support them via some kind of plug-in architecture.

@lukewagner
Copy link
Member

@titzer Switching to utf8-names sounds fine. As a bonus, it would smooth the transition since browsers could easily support both "names" (in the old format) and "utf8-names" (in the #984 format) for a release or two before dropping "names" which in turn removes a lot of urgency to get this deployed.

Sorry if this was already decided on above but, to be clear: is there any proposed change to the import/export names from what's in BinaryEncoding.md now?

@jfbastien
Copy link
Member Author

utf8-names sounds fine.

Same question as @lukewagner on import/export.

@titzer
Copy link

titzer commented Mar 2, 2017

@lukewagner @jfbastien Good question. I didn't see a decision above. I think above all we don't want to change the binary format from what we have now. So it's really just whatever mental contortions we have to go through to convince ourselves what we did is rational :-)

AFAICT we currently assume that strings in import/exports are uninterpreted sequences of bytes. That's fine. I think it's reasonable to consider the encoding of strings used for import/export to be solely defined by the embedder in a way that the names section is not; E.g. JS always uses UTF-8. The names section comes with an explicit encoding in the name of the names section.

Short version: the encoding of names in import/export declarations is a property of the embedding environment, the encoding of names in the names section is explicit by the string used to identify the user section (e.g. "utf8-names").

WDYT?

@lukewagner
Copy link
Member

lukewagner commented Mar 2, 2017

That's fine with me and matches what we had before #984 merged (modulo names=>utf8-names).

@jfbastien
Copy link
Member Author

I think the names section isn't as important as import/export, which are where the true compatibility issues occur:

  • Load a mojibaked names section and you get funky Error.stack and debugging.
  • Load a mojibaked import/export and nothing works.

I don't think this is truly a binary format change since the embeddings we all implement already assume this.

I'd lean on the recommendation of people who know better than I do about this topic before closing.

@annevk
Copy link
Member

annevk commented Mar 3, 2017

You'll need to decide on how you decode UTF-8. Do you replace erroneous sequences with U+FFFD or halt on the first error? That is, you either want https://encoding.spec.whatwg.org/#utf-8-decode-without-bom or https://encoding.spec.whatwg.org/#utf-8-decode-without-bom-or-fail. Either way loading will likely fail, unless the resource happened to use U+FFFD in its name.

@lukewagner
Copy link
Member

lukewagner commented Mar 3, 2017

The way it's currently described we throw an exception if the import/export name byte array fails to decode as UTF-8 into a JS string. After that, you have a JS string and import lookup is defined in terms of Get.

@sunfishcode
Copy link
Member

To check my understanding, if we did https://encoding.spec.whatwg.org/#utf-8-decode-without-bom-or-fail, would that mean that, after successful validation, checking for codepoint-sequence equality would be equivalent to checking for byte-sequence equality?

@annevk
Copy link
Member

annevk commented Mar 3, 2017

Yes.

@sunfishcode
Copy link
Member

After the discussion above, I support validating UTF-8 for import/export names in the core spec.

Specifically, this would be utf-8-decode-without-bom-or-fail, and codepoint-sequence equality (so engines can do byte-sequence equality), so engines would avoid the scary and expensive parts of Unicode and internationalization. And, this is consistent with the Web embedding. I've experimented with this and found the main overhead negligible.

  • Re: Hardware ISAs are agnostic to encoding: The hardware we're talking about here doesn't have imports/exports as such, so the analogy doesn't directly apply. The one place I'm aware of where such hardware uses byte-sequence identifiers of any kind, x86's cpuid, does specify a specific character encoding: UTF-8.

  • Re: Layering: As software engineers, we also know that layering and modularisation are means, not ends in themselves. For example, we could cleanly factor out LEB128 from the core spec. That would provide greater layering and modularisation. LEB128 is arguably biased toward Web use cases.

  • Re: "Embedded systems": An example given is DOS, but what would be an example of something that a UTF-8 requirement for import/export names would require a DOS system to do that would be expensive or impractical for it to do?

  • Re: Islands: WebAssembly also specifies a specific endianness, requires floating-point support, 8-bit address units, and makes other choices, even though there are real settings where those would be needless burdens. WebAssembly makes choices like those when it expects they'll strengthen the common platform that many people can share.

  • Re: Arbitrary data structures in import/export names: this is theoretically useful, but it can also be done via mangling data into strings. Mangling is less convenient, but not difficult. So there's a tradeoff there, but not a big one (and arguably, if there's a general need for attaching metadata to imports/exports, it'd be nicer to have an explicit mechanism than saddling identifiers with additional purposes.)

  • Re: Binary compatibility: I also agree with JF that this change is still feasible. utf-8-decode-without-bom-or-fail would mean no silent behavior changes, and at this time, all known wasm producers keep their output compatible with the Web embedding (even if they also support other embeddings), so they're already staying within UTF-8.

sunfishcode added a commit that referenced this issue Mar 14, 2017
This implements the UTF-8 proposal described in
#989 (comment).

This does not currently rename "name" to "utf8-name", because if UTF-8 is
required for import/export names, there's a greater appeal to just saying
that all strings are UTF-8, though this is debatable.
@sunfishcode
Copy link
Member

sunfishcode commented Mar 14, 2017

A PR making a specific proposal for UTF-8 names is now posted as #1016.

sunfishcode added a commit that referenced this issue Mar 30, 2017
* Require import/export names to be UTF-8.

This implements the UTF-8 proposal described in
#989 (comment).

This does not currently rename "name" to "utf8-name", because if UTF-8 is
required for import/export names, there's a greater appeal to just saying
that all strings are UTF-8, though this is debatable.

* s/utf8/UTF-8/g

* Say "UTF-8 byte sequence" rather than "UTF-8 string".

This document is describing the encoded bytes, rather than the string which
one gets from decoding them.

Also, make the descriptions of the byte sequence length fields more precise.

* Fix typo.
@sunfishcode
Copy link
Member

With #1016, this is now fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests