Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can WebAssembly.Instance recompile an already-compiled Module? #838

Closed
jfbastien opened this issue Oct 27, 2016 · 94 comments
Closed

Can WebAssembly.Instance recompile an already-compiled Module? #838

jfbastien opened this issue Oct 27, 2016 · 94 comments
Milestone

Comments

@jfbastien
Copy link
Member

Say I have this code:

let descriptor = /* ... */;
let memories = Array.apply(null, {length: 1024}).map(() => new WebAssembly.Memory(descriptor));
let instance = fetch('foo.wasm')
  .then(response => response.arrayBuffer())
  .then(buffer => WebAssembly.compile(buffer))
  .then(module => new WebAssembly.Instance(module, { memory: memories[1023] }));

Is WebAssembly.Instance allowed to block for a substantial amount of time. Could it for example recompile the WebAssembly.Module?

In most cases I'd say no, but what if the already-compiled code doesn't particularly like the memory it receives? Say, because that memory is a slow-mode memory and the code was compiled assuming fast-mode? maybe memories[0] was a fast-mode memory, but memories[1023] sure won't be.

What about this code instead:

let instances = [0,1,2,3,4,5,6,7].map(v => fetch(`foo${v}.wasm`)
  .then(response => response.arrayBuffer())
  .then(buffer => WebAssembly.compile(buffer))
  .then(module => new WebAssembly.Instance(module)));

Are those calls to WebAssembly.Instance allowed to cause recompilation?

Assuming the above makes sense, here are a few related questions:

  • Do we want a promise-returning async function which can compile and instantiate? I'm not saying that we should drop any of the synchronous and asynchronous APIs that we already have, I'm proposing a new asynchronous API.
  • How does a browser expose that compiled code in a WebAssembly.Module is fast, and that a WebAssembly.Memory instance is suitable for such fast code? Right now the answer seems to be "try it and see if you can notice".
  • How does a user know how many WebAssembly.Memory instances they're allowed before they get slow code (counting the implicit ones, e.g. as created by the second example)?
@mbebenita
Copy link
Contributor

Would be nice for WebAssembly.Instance to sometimes cause recompilation, this way immutable globals vars could be constant folded in the generated code. For instance, Emscripten generates relocatable code by offsetting all pointers to static data. The offset is passed in as an immutable global var when the module is instantiated. If WebAssembly.Instance can recompile, it could specialize the generated code.

@rossberg
Copy link
Member

The specification doesn't define what "compilation" is, nor would it make
sense for it to do so, because implementation approaches may wildly differ
(including interpreters). So it cannot have any normative say about this
either way. The best we could do is adding a note that
WebAssembly.Instance is expected to be "fast".

On 27 October 2016 at 03:24, Michael Bebenita notifications@github.com
wrote:

Would be nice for WebAssembly.Instance to sometimes cause recompilation,
this way immutable globals vars could be constant folded in the generated
code. For instance, Emscripten generates relocatable code by offsetting all
pointers to static data. The offset is passed in as an immutable global var
when the module is instantiated. If WebAssembly.Instance can recompile,
it could specialize the generated code.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#838 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEDOO9sJPgujK3k0f6P7laYV_zaJxES5ks5q3_1LgaJpZM4Kh1gM
.

@lukewagner
Copy link
Member

Agreed this would be a non-normative note at most.

In SM, we are also currently intending for instantiation to never recompile so that there is a predictable compilation cost model for devs (in particular, so that devs can use WebAssembly.compile and IDB to control when they take the compilation hit). Instantiation-time recompilation from within the synchronous Instance constructor would certainly break that cost model and could lead to major jank.

But I do appreciate that separate compilation is fundamentally at odds with a variety of optimizations one might want to do to specialize generated code to ambient parameters. Fusing compilation and instantiation into one async op makes sense and is something we've considered in the past. The downside, of course, is that this inhibits explicit caching (there is no Module), so the developer has to make an unpleasant tradeoff. Some options:

  • The impl could do implicit content-addressed caching (which could include ambient parameters in the key), like we do with asm.js currently in FF. This would be kindof a pain and has all the predictability/heuristic problems of any implicit cache.
  • We could create a new way (e.g., a new WebAssembly.Cache API where you pass in bytecode and instantiation parameters and get back a Promise<Instance>.

The latter intrigues me and could provide a much nicer developer experience than using IDB and perhaps a chance to even further optimize caching (since the cache is specialized to purpose), but it's certainly a big feature and something we'd want to take some time to consider.

@jfbastien
Copy link
Member Author

@rossberg-chromium I seem to have explained my purpose badly: I don't want to quibble over what the spec says. I'm trying to point out what seems like a serious surprise for developers, hiding under the API. A developer won't expect .compile's result to be re-compiled. That seems like a design flaw to me.

@lukewagner even with implicit or explicit caching we may have the same issue: how many WebAssembly.Memory can be created in the same address-space / origin is a browser limitation. I like what you're suggesting, but I think it's orthogonal to the issue. Let me know if I've misunderstood what you suggest.

Maybe .compile and Module could be given a Memory, and Instance has a .memory property which can be passed to other compilations / instantiations?

I'm not trying to eliminate the possibility of re-compilation, I think we rather want a common idiomatic API usage which has perfect information w.r.t. Memory at first-compile-time (or at cache-retrieval time) so that the compilation emits bounds checks or not if needed.

@lukewagner
Copy link
Member

@jfbastien With implicit/explicit caching that was provided the particular instantiation parameters (so Memory), I don't see how there would be any need for recompilation.

@jfbastien
Copy link
Member Author

@jfbastien With implicit/explicit caching that was provided the particular instantiation parameters (so Memory), I don't see how there would be any need for recompilation.

There may be:

  1. Create many Memorys.
  2. Compile code, with explicit (slow) bounds check because there were too many Memoryies.
  3. Cache that code.
  4. Leave page.
  5. Load page again.
  6. Allocate only one Memory, which gets the fast version.
  7. Get from the cache.
  8. Receive slow code Instance.

At this point I agree you don't need recompilation, but we're being a bit silly of we do slow bounds checks when we don't need to.

As I said: I like this Cache API you propose, I think it makes WebAssembly more usable, but I think the problem is still there. 😢

@lukewagner
Copy link
Member

Well that's my point about having an enhanced cache that accepts instantiation parameters and bytecode: the cache is free to recompile if what it has cached doesn't match the instantiation parameters. So the steps would just be:

  1. create many Memorys
  2. request an Instance from the cache, passing one of those (slow) Memorys
  3. slow-code is compiled, cached and returned as an Instance
  4. leave page
  5. load page again
  6. allocate only one Memory
  7. request an Instance from the cache, passing the fast Memory
  8. fast-code is compiled, cached and returned as an Instance

and after step 8, all future page loads will get cached fast or slow code.

@pizlonator
Copy link
Contributor

@lukewagner First of all, you're proposing a mitigation that flies in the face of the stated goal of WebAssembly providing deterministic performance. The difference between slow and fast was last quoted as around 20%, so it would really stink if a spec that painstakingly aims aims for deterministic perf drops it on the floor because of an API quirk. I don't buy that the browser having a content-addressed cache is the right solution, because the spec already goes to a lot of trouble elsewhere to obviate the need for profile-recompile-cache optimizations. For example, we promisify compilation precisely so that the app can get reasonable behavior even if the code is not cached. If the way this is spec'd necessitates all of us to implement caches or other mitigations then we will have failed our goal of giving people a reasonably portable cost model.

To me the issue is just this: one of the optimizations that we will all effectively have to do for competitive reasons (the 4GB virtual memory bounds checking, which I'll just call the 4GB hack) cannot be done in the current spec without sacrificing one of these things:

  • You can get away with it if you always allocate 4GB of virtual memory for any wasm memory. This will discourage people from using WebAssembly for small modules, because you will hit virtual memory allocation limits, virtual memory fragmentation, or other problems if you allocate many of these. I also fear that if you allow allocating a lot of them then you will reduce the efficacy of security mitigations like ASLR. Note that existing APIs don't share this danger, since they commit the memory they allocate and they will either OOM or crash before letting you allocate much more than what physical memory allows.
  • You can get away with it if you allow for a recompile when you find a mismatch during instantiation (compiled code wants 4GB hack but the memory doesn't have that virtual memory allocation). You could also get away with it if instantiation moved the memory into a 4GB region, but see the previous point. So, probably anytime this happens, it'll be a P1 bug for the browser that encountered it.

I think that this means that the spec will encourage vendors to converge to just allowing 4GB reservations anytime wasm memory is allocated, or to have cache/lazy-compile/profile optimizations to detect this.

Finally, I don't understand the point about making any of this non-normative. This can be normative, because we could make the API preclude the possibility of the browser having to compile something without knowing what kind of memory it will have. I imagine that there are many ways to do this. For example, instantiating could return a promise and we could remove the separate compile step. This would make it clear that instantiation is the step that could take a while, which strongly implies to the client that this is the step that does the compilation. In such an API, the compiler always knows if the memory it's compiling for has the 4GB hack or not.

It's sad that we're only noticing this now, but I'm surprised that you guys don't see this is a bigger issue. Is there some mitigation other than caching that I'm overlooking?

@mtrofin
Copy link
Contributor

mtrofin commented Oct 27, 2016

@jfbastien in your motivating scenario, you pointed out that the module was authored to prefer fast memory. I'm assuming you're primarily chasing enabling the fast memory optimization when a particular module wants it, and might be OK with not doing it when the module doesn't want it (nothing bad with opportunistically stumbling upon it in that case, too, just trying to tease apart priorities).

If so, how would these alternatives to caching or to async Instantiate feel like:

  1. Module author must require 4GB as min/max memory
  2. A variant of compile (async at least, maybe also sync) that produces an instance accepting only fast memory.

@kripken
Copy link
Member

kripken commented Oct 27, 2016

For the issue of the "4GB hack" and mismatches between memory using it and code expecting it, would it make sense for compilation to internally emit two versions of the code? (Obviously this would use more memory, which is sad, but hopefully compile time wouldn't be much worse, the writer could generate both at once?)

@jfbastien
Copy link
Member Author

@mtrofin I don't think it makes sense to ask for 4GiB if you don't intend to use it. The virtual allocation is separate from the use intent, so I think we'd need to separate both.

On 2.: it still isn't super helpful to the developer: if they use that variant and it fails, then what?

@kripken I don't think double compilation is a good idea.

@pizlonator
Copy link
Contributor

@kripken I think that's what we would do without any other resolution to this issue.

I want WebAssembly to be great in the case of casual browsing: you tell me about a cool thingy, send me the URL, I click it, and I amuse myself for a few minutes. That's what makes the web cool. But that means that many compiles will be of code that isn't cached, so compile time will play a big part in a user's battery life. So, double-compile makes me sad.

@pizlonator
Copy link
Contributor

@mtrofin

Module author must require 4GB as min/max memory

That's not really practical, since many devices don't have 4GB of physical memory. Also, that's hard to spec.

A variant of compile (async at least, maybe also sync) that produces an instance accepting only fast memory.

I don't think we want double compiles.

@lukewagner
Copy link
Member

lukewagner commented Oct 27, 2016

@pizlonator Thus far, we haven't considered designs which required different modes of codegen: we've just always allocated 4gb regions on 64-bit and observed this to succeed for many many thousands of memories on Linux, OSX and Windows. We have a conservative upper bound to prevent trivial total exhaustion of available address space which I expect will be sufficient to support the many-small-libraries use case. So I think the new constraint we're addressing here is that iOS has some virtual address space limitations which could reduce the number of 4gb allocations.

So one observation is that a large portion of the bounds-check-elimination allowed by the 4gb hack can be avoided by just having a small guard region at the end of wasm memory. Our initial experiments show that basic analyses (nothing to do with loops, just eliminating checks on loads/stores w/ the same base pointer) can already eliminate roughly half of bounds checks. And probably this could get better. So the 4gb hack would be a more modest, and less necessary, speedup.

Another idea I had earlier would be to pessimistically compile code with bounds checks (using elimination based on the guard page) and then nop them out when instantiating with a fast-mode-memory. Combined, the overhead could be pretty small compared to idealized fast-mode code.

@pizlonator
Copy link
Contributor

@lukewagner

Thus far, we haven't considered designs which required different modes of codegen: we've just always allocated 4gb regions on 64-bit and observed this to succeed for many many thousands of memories on Linux, OSX and Windows. We have a conservative total number to prevent trivial total exhaustion of available address space which I expect will be sufficient to support the many-small-libraries use case. So I think the new constraint we're addressing here is that iOS has some virtual address space limitations which could reduce the number of 4gb allocations.

This isn't an iOS-specific problem. The issue is that if you allow a lot of such allocations then it poses a security risk because each such allocation reduces the efficacy of ASLR. So, I think that the VM should have the option of setting a very low limit for the number of 4GB spaces it allocates, but that implies that the fall-back path should not be too expensive (i.e. it shouldn't require recompile).

What limit do you have on the number of 4GB memories that you would allocate? What do you do when you hit this limit - give up entirely, or recompile on instantiation?

So one observation is that a large portion of the bounds-check-elimination allowed by the 4gb hack can be avoided by just having a small guard region at the end of wasm memory. Our initial experiments show that basic analyses (nothing to do with loops, just eliminating checks on loads/stores w/ the same base pointer) can already eliminate roughly half of bounds checks. And probably this could get better. So the 4gb hack would be a more modest, and less necessary, speedup.

I agree that analysis allows us to eliminate more checks, but the 4GB hack is the way to go if you want peak perf. Everyone wants peak perf, and I think it would be great to make it possible to get peak perf without also causing security problems, resource problems, and unexpected recompiles.

Another idea I had earlier would be to pessimistically compile code with bounds checks (using elimination based on the guard page) and then nop them out when instantiating with a fast-mode-memory. Combined, the overhead could be pretty small compared to idealized fast-mode code.

Code that has bounds checks is best off pinning a register for the memory size and pinning a register for memory base.

Code that uses the 4GB hack only needs to pin a register for memory base.

So, this isn't a great solution.

Besides the annoyance of having to wrangle the spec and implementations, what are the downsides of combining compilation and instantiation into one promisified action?

@lukewagner
Copy link
Member

The issue is that if you allow a lot of such allocations then it poses a security risk because each such
allocation reduces the efficacy of ASLR.

I'm not an expert on ASLR but, iiuc, even if we didn't have a conservative bound (that is, if we allowed you to keep allocating until mmap failed because the kernel hit its number-of-address-ranges max), only small fraction of the entire 47-bit addressable space would be consumed so code placement would continue to be highly random over this 47-bit space. IIUC, ASLR code placement isn't completely random either; just enough to make it hard to predict where anything will be.

What limit do you have on the number of 4GB memories that you would allocate? What do you do
when you hit this limit - give up entirely, or recompile on instantiation?

Well, since it's from asm.js days, only 1000. Then the memory allocation just throws. Maybe we'll need to bump this, but even with many super-modularized apps (with many seprate wasm modules each) sharing the same process, I can't imagine we'd need too much more. I think Memory is different than plain old ArrayBuffers in that apps won't naturally want to create thousands.

Besides the annoyance of having to wrangle the spec and implementations, what are the downsides
of combining compilation and instantiation into one promisified action?

As I mentioned above, adding a Promise<Instance> eval(bytecode, importObj) API is fine, but now it places the developer in a tough spot because now they have to choose between a perf boost on some platforms vs. being able to cache their compiled code on all platforms. It seems we need a solution that integrates with caching and that's what I was brainstorming above with the explicit Cache API.

@lukewagner
Copy link
Member

New idea: what if we added an async version of new Instance, say WebAssembly.instantiate and, as with WebAssembly.compile, we say that everyone is supposed to use the async version? This is something I've been considering anyway since instantiation can take a few ms if patching is used. Then we say in the spec that the engine can do expensive work in either compile or instantiate (or neither, if an engine does lazy validation/compilation!).

That still leaves the question of what to do when a compiled Module is stored in IDB, but that's just a hard question when there are multiple codegen modes anyway. One idea is that Modules that are stored-to or retrieved-from IDB hold onto a handle to their IDB entry and they add new compiled code to this entry. In that way, the IDB entry would lazily accumulate one or more more compiled versions of its module and be able to provide whichever was needed during instantiation.

The IDB part is a bit more work, but that seems pretty close to ideal, performance-wise. WDYT?

@jfbastien
Copy link
Member Author

I think adding async instantiate makes sense, but I'd also add a Memory parameter to compile. If pass a different memory to instantiate then you can get recompiled, otherwise you've already "bound" the memory when compiling.

I haven't thought about the caching enough to have a fully-formed opinion yet.

@pizlonator
Copy link
Contributor

@lukewagner

I'm not an expert on ASLR but, iiuc, even if we didn't have a conservative bound (that is, if we allowed you to keep allocating until mmap failed because the kernel hit its number-of-address-ranges max), only small fraction of the entire 47-bit addressable space would be consumed so code placement would continue to be highly random over this 47-bit space. IIUC, ASLR code placement isn't completely random either; just enough to make it hard to predict where anything will be.

ASLR affects both code and data. The point is to make it more expensive for an attacker to weasel his way into a data structure without chasing a pointer to it. If the attacker can exhaust memory, he definitely has more leverage.

Well, since it's from asm.js days, only 1000. Then the memory allocation just throws. Maybe we'll need to bump this, but even with many super-modularized apps (with many seprate wasm modules each) sharing the same process, I can't imagine we'd need too much more. I think Memory is different than plain old ArrayBuffers in that apps won't naturally want to create thousands.

1000 seems like a sensible limit. I'll ask around with security folks.

As I mentioned above, adding a Promise eval(bytecode, importObj) API is fine, but now it places the developer in a tough spot because now they have to choose between a perf boost on some platforms vs. being able to cache their compiled code on all platforms. It seems we need a solution that integrates with caching and that's what I was brainstorming above with the explicit Cache API.

Right. I can see a few ways that such an API could be made to work. A cheesy but practical API would be to overload eval:

  1. instancePromise = eval(bytecode, importObj)
  2. instancePromise = eval(module, importObj)

and then Instance has a getter:

module = instance.module

Where module is structure cloneable.

What do you think of this?

New idea: what if we added an async version of new Instance, say WebAssembly.instantiate and, as with WebAssembly.compile, we say that everyone is supposed to use the async version? This is something I've been considering anyway since instantiation can take a few ms if patching is used. Then we say in the spec that the engine can do expensive work in either compile or instantiate (or neither, if an engine does lazy validation/compilation!).

That still leaves the question of what to do when a compiled Module is stored in IDB, but that's just a hard question when there are multiple codegen modes anyway. One idea is that Modules that are stored-to or retrieved-from IDB hold onto a handle to their IDB entry and they add new compiled code to this entry. In that way, the IDB entry would lazily accumulate one or more more compiled versions of its module and be able to provide whichever was needed during instantiation.

The IDB part is a bit more work, but that seems pretty close to ideal, performance-wise. WDYT?

Intriguing. Relative to my idea above:

Pro: yours is an easy to understand abstraction that is conceptually similar to what we say now.
Con: yours does not lead to as much synergy between what the user does and what the engine does as my proposal allows.

There are three areas where your proposal doesn't give the user as much control as mine:

  1. The expensive work could happen in one of two places, so the user has to plan for either of them being expensive. We will probably have web content that behaves badly if one of them is expensive, because it was tuned for cases where it happened to be cheap. My proposal has one place where expensive things happen, leading to more uniformity between implementations.
  2. There's no clearly guaranteed path for all versions of the compiled code to be cached. On the other hand, my use of threading the module through the API means that the VM can build up the module with more stuff each time, while still allowing the user to manage the cache. So, if the first time around we do 4GB then this is what we will cache, but if we fail to do 4GB the second time, we will be able to potentially cache both (if the user caches instance.module after every compile).
  3. Unusual corner cases in the browser or other issues could sometimes lead to a double compile in your scheme, because we'd compile one thing in the compile step but then realize we need another thing in the instantiation step. My version never requires a double compile.

So, I like mine better. That said, I think your proposal is a progression, so it definitely sounds good to me.

@titzer
Copy link

titzer commented Oct 27, 2016

This issue rests upon how often fragmentation makes allocation of fast
memory (btw you'll 4GB + maximum supported offset, or 8GB) fails. If the
probably is way less than 1%, then it might not be entirely unreasonable to
have that be an OOM situation.

In the scenario where the user is browsing around the web and using lots of
little WASM modules in quick succession, presumably they aren't all live at
once. In that case, a small cache of reserved 4GB chunks would mitigate the
issue.

Another possible strategy is to generate one version of the code with
bounds checks, and if fast memory is available, just overwrite the bounds
checks with nops. That's ugly, but that's a heck of a lot faster than a
recompile, and less space than two compiles.

On Thu, Oct 27, 2016 at 9:03 PM, pizlonator notifications@github.com
wrote:

@mtrofin https://github.com/mtrofin

Module author must require 4GB as min/max memory

That's not really practical, since many devices don't have 4GB of physical
memory. Also, that's hard to spec.

A variant of compile (async at least, maybe also sync) that produces an
instance accepting only fast memory.

I don't think we want double compiles.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#838 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALnq1F6CYUaq0unla0H6RYivUC8jfxIAks5q4PWIgaJpZM4Kh1gM
.

@jfbastien
Copy link
Member Author

jfbastien commented Oct 27, 2016

It's not just ASLR: it's also pagetable / allocator / etc pollution. We all need to talk to our security folks as well as out kernel / systems folks. Or we can be up-front to developers about the limits each engine imposes on "fast" Memory, and make it idiomatic in the API so it's hard to use it wrong.

There's all these crutches we can use, such as nop or double compilation, but why even have crutches?

@lukewagner
Copy link
Member

@jfbastien I don't think PROT_NONE memory costs page table entries; I think there is a separate data structure that holds mappings from which the page table is lazily populated.

@pizlonator I like that idea, and I can see that being what we encourage everyone to use by default in tutorials, toolchain, etc. It's also more succinct and easier to teach if you can simply ignore Module. This could also address @s3ththompson's concern about discouraging use of the sync APIs by making the nicest API the async one.

However, I think we shouldn't take away WebAssembly.compile and the Module constructor: I'm imagining scenarios where you have a "code server" (providing cross-origin code caching via IDB + postMessage; this has been specifically discussed with some users already) that wants to compile and cache code without having to "fake up" instantiation parameters. (There could also be some needless overhead (garbage, patching, etc) for an unnecessary instantiation.) And, for the same corner cases that want synchronous compilation (via new Module), we would need to keep new Instance.

So if agreed on that, then what this boils down to is a purely additive proposal of the two WebAssembly.eval overloads you mention. Yes?

One tweak, though: I think we shouldn't have a module getter since this would require the Instance to keep some internal data around (viz., bytecode) for the lifetime of the Instance; right now Module can usually be GCed immediately after instantiation. This would suggest either a data property (that the user can remove, although they will probably forget to), or maybe a third version of eval that returns an {instance, module} pair...

@flagxor
Copy link
Member

flagxor commented Oct 27, 2016

Having an async one step API as the recommended case for the typical monolithic app makes sense as the recommended pattern.

Agreed with @lukewagner that both the all sync (inline compile) case covered by new Module + new Instance is useful.
Also the background compile (async) server with sync instantiate also seems useful.

Adding the two eval variants proposed seem an ok way to introduce this.

However I don't like the name, because it will be conflated in (security) folks mind with js eval (which it resembles in one way, but not in terms of scope capture).
How about WebAssembly.instantiate ?

@lukewagner
Copy link
Member

Hah, good point, eval does have a bit of a rep. +1 to WebAssembly.instantiate.

@mtrofin
Copy link
Contributor

mtrofin commented Oct 27, 2016

What would the guideline to the developer be wrt when to use the async instantiate?

@lukewagner
Copy link
Member

@mtrofin To use WebAssembly.instantiate by default unless they had some special code-sharing/loading scheme that required compiling Modules independently of any particular use.

@pizlonator
Copy link
Contributor

@lukewagner This seems reasonable.

Hah, good point, eval does have a bit of a rep. +1 to WebAssembly.instantiate.

Agreed.

So if agreed on that, then what this boils down to is a purely additive proposal of the two WebAssembly.eval overloads you mention. Yes?

That's what it sounds like.

I think we shouldn't have a module getter since this would require the Instance to keep some internal data around (viz., bytecode) for the lifetime of the Instance; right now Module can usually be GCed immediately after instantiation. This would suggest either a data property (that the user can remove, although they will probably forget to), or maybe a third version of eval that returns an {instance, module} pair...

Sure feels like a data property is better. Or having WebAssembly.instantiate always return an instance, module pair.

@mtrofin
Copy link
Contributor

mtrofin commented Oct 27, 2016

Is this correct: Suppose you WebAssembly.instantiate with the goal of getting a fastmemory module variant. You now get the module, and structure-clone it. Now, this module is bound to needing to be instantiated with Memory-es supporting fastmemory.

@lukewagner
Copy link
Member

@pizlonator Yeah, I can bikeshed it in my head multiple ways. I think I like returning the pair a little better since it'll probably lead to less people accidentally entraining an unused Module.

@mtrofin Recompilation can still be necessary when you pluck the Module off one instantiate call and instantiate with new imports; I think the point of this API addition is that it won't be the common case and it will only happen when it's fundamentally necessary (i.e., you have 1 module accessing two kinds of memories).

@jfbastien
Copy link
Member Author

This thread is getting long, looks like it's converging but to be 100% sure we need to write up the code that we expect different users to write:

  1. Async instantiation of a single module.
  2. Async instantiation of a module, with memory sharing with other modules.
  3. Synchronous instantiation of a single module (I don't think synchronous multi-module is useful?).
  4. Caching for all of these (both putting into the cache, as well as retrieving and instantiating, with memory).
  5. Update of a single .wasm module, and cached loads of the other modules.

Anything else? It sounds like @lukewagner has ideas around imports which I don't fully grok.

@kripken
Copy link
Member

kripken commented Dec 22, 2016

Oh, I didn't understand that having the imports available before the wasm XHR even begins is key here. That's a lot trickier then. So most of what I said before is wrong.

For us to have the imports, we need to have downloaded, parsed and run all the JS glue. If we do all that before the wasm XHR even begins then it's a very different loading scheme and set of tradeoffs than we have now. In particular, for small to medium projects, maybe this wouldn't even be a speedup, I guess we'd have to measure this - if we haven't already?

This would not be something simple for Unity to do. It would require significant changes to the code emcc emits, so that the JS glue can be run before the compiled code is ready.

Perhaps we'd want to consider a new model of emitted JS, one JS file for the imports, one JS file for the rest? And it would be opt-in, so we wouldn't break anyone. Anyhow, there's a lot of consider, and without measurements it's hard to guess what is optimal.

@lukewagner
Copy link
Member

lukewagner commented Dec 22, 2016

@kripken Not before the XHR but after it completes, and some time before we start running the script that wants to synchronous access to the exports object. I'd expect it could be as simple as putting that exports-using code in some callback called when the instantiation resolves.

@kripken
Copy link
Member

kripken commented Dec 22, 2016

Hmm, I guess I still don't fully understand this then, sorry (in particular, I don't understand why the asyncness interacts with getting the imports - could the benefit of having the imports not work synchronously?). But it seems like the issues I mentioned above are still relevant even if this is after the XHR completes - that is, we have a single script now that generates the imports and also receives the exports. Splitting that up may lead to tradeoffs - we'll just need to measure when we have time.

About putting the exports-using code in a callback, sadly it won't just work for the reasons mentioned earlier, but we could investigate as part of those measurements adding some new compilation mode.

@lukewagner
Copy link
Member

lukewagner commented Dec 22, 2016

The signature of instantiate is WebAssembly.instantiate(bytes, importObj), so to fire off the async compile+instantiate, you need to pass importObj.

@kripken
Copy link
Member

kripken commented Dec 28, 2016

Talking offline, @lukewagner impressed upon me that the main issue here is having the Memory while compiling the module. Overall then, it seems that there are three issues that relate to how easy it is to use this new API in practice:

  1. Providing the Memory when compiling.
  2. Providing all the imports when compiling (superset of the previous).
  3. Doing all this asynchronously.

Given how the toolchain currently works, doing 2 + 3 is hard, as described above, because it will break existing users. We can do it, but possibly not want it on by default, at least not initially - we'd need to consult community members etc. And to do it really well - no new overhead - will require time and work (doing it quickly can be done by adding a layer of indirection on either the imports or exports).

But some other options are trivially easy to use:

  • 1 + 3 would require a new API like instantiate(bytes, Memory) -> Promise. The reason this is easy to use is that the Memory can be created early anyhow (while almost all the other imports are JS functions, which we can't have early on).
  • 2 by itself, without 3, i.e., new Instance(bytes, imports). That is, synchronous compilation on binary data + imports. The reason this is easy to use is that our current code does this: instance = new WebAssembly.Instance(new WebAssembly.Module(getBinary()), imports) so we would just fold that into 1 API call.

I think the last option makes sense. It basically means adding a sync version of the new API, making things more symmetrical with our existing synchronous compilation option. And I don't see a reason to tie the new optimization of knowing the Memory at compile time to asynchronous compilation? (async is great, but a separate issue?)

@ghost
Copy link

ghost commented Dec 28, 2016

I suggest meta objects that describe the imports (the memory) to break the constraint that the imports need to be allocated before compiling. Then the instantiate method could throw an error if the supplied memory did not match that expected and would not be delayed re-compiling.

It would also be very useful to be able to compile before allocating the linear memory on a memory constrained device, also to be able to optimize compilation for memory allocated at zero in the linear address space, and perhaps for memory with guard zones, and to optimize for memory with a fixed maximum size that can be a power of two plus a spill zone. This meta object could also be used by a wasm user decoder/re-writer to emit code optimized for the negotiated memory characteristics.

This is something being suggested as necessary around the start of this project many years ago now!

@lukewagner
Copy link
Member

@kripken While the sync APIs are, I believe, technically necessary, they definitely should be the not-recommended path or else we'll be introducing tons of unnecessary main-thread jank, a regression compared to asm.js + <script async>.

Having a compilation API that takes the bytes and Memory and resolves to a Module seems like a more confusing API in contrast with the current instantiate which is a simplification of what you'd otherwise write as two steps. Thus, it doesn't seem like what people would naturally use by default.

I think it totally makes sense to start with a new clean slate that is oriented toward the async APIs + caching. This could offer opportunities to fix some of the other inefficiencies we've noticed where load-time steps are executed sequentially when they could be done in parallel. It seems fine to have a compiler flag that emits output equivalent to the old sync asm.js style for the handful of users who depended on this precise output format.

@s3ththompson
Copy link
Member

s3ththompson commented Dec 28, 2016

I've been meaning to provide a better example for why we need to explicitly discourage (or, as I've argued in the past, even disallow) synchronous compilation/instantiation. Hopefully this provides additional fodder for discussion.

Let's zoom out and talk about user experience for a second.


Here's a comparison between loading the AngryBots demo asynchronously (via asm.js), left, and synchronously (via WebAssembly in V8), right.

comparison

Synchronous compilation provides a terrible user experience and breaks progress bars or any other visual indication of application loading.

Whatever the effort, I think we need to work together to make sure that Emscripten, Unity, and other tools export asynchronously loading WebAssembly by default. cc @jechter @juj Not only that, I think we need to make it prohibitively difficult for developers to fall into the trap of writing synchronous loading code.

We need to explicitly encourage WebAssembly.compile & WebAssembly.instantiate and discourage new WebAssembly.Module & new WebAssembly.Instance.


Let's dive a little deeper and see how bad that progress bar on the right is, for WebAssembly.

Here's a trace captured with the DevTools performance panel:

screen shot 2016-12-28 at 1 26 59 pm

It takes 30 seconds on my MacBook Air to show any progress bar at all.

What takes that time? Let's zoom into the ~20s after the wasm module has downloaded when the main thread is totally locked up:

screen shot 2016-12-28 at 2 18 36 pm

Compilation is taking ~20s and instantiation is taking ~2s.

Note that in this case, the Unity loader is already calling the asynchronous WebAssembly.compile if it's supported, so the 20s is because V8 still does a synchronous compile under the hood. cc @titzer @flagxor this is something V8 really needs to fix.

The 2s instantiation jank is due to the Unity loader code calling the synchronous new WebAssembly.Instance. This is what we really need to fix in both Unity code and Emscripten.


I think it's also worth mentioning that this is not merely the risk of a developer shooting themselves in the foot or not. The average web page includes dozens of third party scripts: those scripts all have the potential to include top-document-janking WebAssembly.

(If you'd like to explore the trace in more detail you can view the full timeline here: https://chromedevtools.github.io/timeline-viewer/?loadTimelineFromURL=https://www.dropbox.com/s/noqjwql0pq6c1wy/wasm?dl=0)

@kripken
Copy link
Member

kripken commented Dec 29, 2016

I agree 100% that async is great :) But I'm saying it's orthogonal to the optimization of getting the Memory at the same time as compiling the Module. And that we can't trivially get async+that optimization - we can get that combo, but either at the cost of some overhead, or of time if we introduce a new flag for it and find a way to phase it in to be default.

So if we want that Memory/compilation optimization now, on by default, and at full efficiency, then tying it to being also async is a problem. But if it either isn't urgent, or we are ok with it not being on by default, or we are ok with some new overhead, then no problem.

@lukewagner
Copy link
Member

Since developers are opting into wasm, I think it makes sense to take the opportunity to change the top-level structure a bit with a new default (with a way to opt-in to the old behavior if necessary). People using sync <script> tags after a sync instantiation script seems like an antipattern that we should migrate to something async anyway.

@jfbastien
Copy link
Member Author

@kripken I think you come to different conclusions than I, @lukewagner and @s3ththompson have because to you the most important part is to offer a seamless transition from asm.js for existing developers.

Is this correct?

I agree this is a valid concern. I weigh it lower because IMO with WebAssembly we're bringing way more developers in than asm.js had. I'd like to avoid burning early adopters, but IIUC existing developers are pretty flexible and want asynchronous compilation.

If we assume that they want asynchronous compilation, and are willing to refactor some to get it, then all that's left is having memory at compilation time, which is what this API provides. It's highly desirable because it avoid a pitfall.

Earlier you expressed a concern at the amount of work involved on the Emscripten side to support this. Do you still think this is an issue?

@jechter
Copy link

jechter commented Dec 29, 2016 via email

@kripken
Copy link
Member

kripken commented Dec 30, 2016

@jfbastien Maybe I don't understand what you mean by "conclusions". I am mostly just presenting options here. I don't have a conclusion myself.

If we want the Memory-at-Compile-time opt right now, I proposed a few options that can let us have it ASAP, trivially modifying our current sync compilation.

Or if we want that opt now in an async way, I proposed an option ("1 + 3", i.e., not receiving all the imports at compile time, just the Memory) that can let us have it ASAP as well.

Or if we want to focus now on an async version of that opt with the current API (not "1 + 3") then we can get that too, it will just take some careful planning because it would be a breaking change. I don't think it's an option for us here to decide to break existing users, so we would need to consult the community. Possibly there will be few concerns and we can just do it, in which case, how much effort depends on how much overhead we are willing to tolerate - if we cannot accept any overhead at all, then it might be a lot. Or possibly there will be concerns, which is my personal intuition - any breaking change must be done gradually - in that case it will take longer, we'd probably start with a new option and eventually plan to make it the default.

Again, no conclusions from me. All the above are options. It really depends what you care about more: Memory-at-Compile-time, or async, or both; and the issue of tolerating new overhead vs not; and whether you want this urgently or can wait, etc. etc. I'm happy to help on the emscripten side with any of the above, however people decide.

@jfbastien
Copy link
Member Author

Or if we want to focus now on an async version of that opt with the current API (not "1 + 3") then we can get that too, it will just take some careful planning because it would be a breaking change. I don't think it's an option for us here to decide to break existing users, so we would need to consult the community. Possibly there will be few concerns and we can just do it, in which case, how much effort depends on how much overhead we are willing to tolerate - if we cannot accept any overhead at all, then it might be a lot. Or possibly there will be concerns, which is my personal intuition - any breaking change must be done gradually - in that case it will take longer, we'd probably start with a new option and eventually plan to make it the default.

To be sure I understand:

  1. What are the overheads?
  • Are these overheads inherent to Async + Memory-at-compile-time, or are they implementation constraints which can be fixed later? IMO if they're inherent to A+M then we should fix the API ASAP.
  • IIUC the overheads are a temporary thing on either imports or exports, and aren't inherent to A+M, is that correct? Could you detail this a bit more?
  1. Which users are we talking about breaking? asm.js users who want the same code to also target wasm?

Put another way: what's the ideal end-state for WebAssembly if we had infinite time and no "legacy"? I think we've designed for that ideal end-state, and you're pointing out hurdles on the way, but I'm still not sure that's the case. If so, then I'm not sure I have enough information to figure out if WebAssembly needs a stopgap to ease transition, or if temporary burden is acceptable for a subset of Emscripten users (and which users).

@kripken
Copy link
Member

kripken commented Dec 30, 2016

About the overheads in some of the options: As described above the tricky thing is we must send the imports and receive the exports synchronously currently, and that we don't have the JS imports until the JS is ready (but we do have the Memory!). One way to get around that is to add a layer of indirection on either the imports or exports. For example, we could provide thunks instead of the true imports very early (in the HTML, before the JS), so we appear to provide them synchronously (but it's just the thunks, which are trivial to create even before we have the JS). Then later when we have the JS, we can update the thunks to point to the right place. This would add overhead of another JS call and a JS object lookup for each import call. Also some overhead in code size and startup.

About breaking users: We want existing emscripten users to be able to flip a flag and get working wasm. Many users and partners are happy about that. It lets them compare asm.js and wasm and it lets projects they spent effort on porting benefit from wasm with no new porting effort. And it avoids the risk that if they also need to asyncify their startup code while porting to wasm, and something breaks, they might blame wasm unnecessarily.

@lukewagner
Copy link
Member

that we don't have the JS imports until the JS is ready (but we do have the Memory!)

That seems incidental and not something we should bake into time forever with an API. Furthermore, for the reasons @s3ththompson explained, even ignoring the memory-at-compile-time issue, we need instantiation to be async anyway. So I agree with @jfbastien that we have our ideal end-state API here and that should not be what we compromise on here.

About breaking users: if we offer a simple migration path (like an "onload" event/promise), it should be simple for users to migrate, with the carrot being all the major perf wins. I don't think we have any evidence that exact asm.js source backcompat is a hard requirement for the default configuration; we have many examples of quite the opposite: users willing to do whatever to achieve the optimal perf since that is, of course, the whole point.

@kripken
Copy link
Member

kripken commented Dec 30, 2016

Talking to @lukewagner offline, we think the best thing is to start with enabling instantiate for wasm in the default recommended setting in emscripten, by adding an indirection layer (as described above, but on the exports). It will probably have negligible overhead for most use cases. So this gets us the benefits we all want here.

Eventually, we can get rid of that small overhead, but it will be a breaking change that will require a bunch more planning and mailing list discussion etc., as there isn't an obviously good/right way to do it, we realized. But since we think the overhead of the indirection layer is very low, this part isn't urgent.

@jfbastien
Copy link
Member Author

@kripken great! It would be helpful to have a diagram showing different instances / JS / imports / exports / memory / tables. Do you want to move this discussion elsewhere (Emscripten / binaryen repo)? I have a mental picture of what I think C++ code should be organized at, but it's pretty obvious by now that you don't have the same picture! You've got more experience there, so I'd love to learn from it and help however I can.

@kripken
Copy link
Member

kripken commented Dec 31, 2016

@jfbastien: sure. I'm not clear yet on what you're looking for in a diagram, but yeah, maybe another place is better. For the implementation of this in emscripten there is the issue mentioned before, emscripten-core/emscripten#4711, also feel free to open another if that doesn't cover what you want.

@jfbastien
Copy link
Member Author

IIUC Emscripten uses this by default now. Closing.

@kgryte
Copy link

kgryte commented Mar 11, 2017

To follow-up on @s3ththompson comment, note that synchronous compilation and instantiation is useful. Notably, I recently ran into a situation where I wanted to synchronously load and compile a WebAssembly module in Node.js (v7.7.2). If only APIs which return promises are available, this means that I would be unable to provide a synchronous export.

When deciding whether to provide either async, sync, or both APIs, remember that a browser context is not the only environment in which people want to use WebAssembly. A number of people, myself included, are interested in WebAssembly's potential as an efficient compile target combined with the Node.js runtime, akin to the JVM. While async import/exports may land in Node.js, synchronous import and export will remain the dominant pattern for the foreseeable future. In which case, the ability to load a WebAssembly module and synchronously compile and instantiate that module is highly useful.

@s3ththompson
Copy link
Member

@kgryte I should have clarified that my comment pertained primarily to the browser as an execution context. We've landed on an API surface that still exposes the synchronous APIs. Browsers may impose a size limit on modules passed to the synchronous APIs (Chrome already does, for example), but that limit is configurable by the embedder and should not need to apply to Node.

@kgryte
Copy link

kgryte commented Mar 13, 2017

@s3ththompson Thanks for the clarification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests