Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What are Web Assembly threads? #104

Closed
jfbastien opened this issue Jun 2, 2015 · 44 comments
Closed

What are Web Assembly threads? #104

jfbastien opened this issue Jun 2, 2015 · 44 comments

Comments

@jfbastien
Copy link
Member

Are Web Assembly specified in terms of WebWorkers, or are they different? What are the differences, or are WebWorkers just an implementation detail?

Maybe the polyfill is OK with WebWorkers, and Web Assembly does its own thing that's closer to pthreads.

We need to ensure that Web Assembly can work outside of a browser.

@jfbastien jfbastien added this to the Public Announcement milestone Jun 2, 2015
@jfbastien
Copy link
Member Author

It sounds like there is a version of emscripten-fastcomp which uses WebWorkers to implement threads:
https://github.com/juj/emscripten-fastcomp/tree/pthreads

@sunfishcode
Copy link
Member

As threads are currently listed as a post-v1 feature, do we need to sort this out before the public announcement?

@jfbastien jfbastien removed this from the Public Announcement milestone Jun 2, 2015
@jfbastien
Copy link
Member Author

Agreed, moving to no milestone.

Related to this issue, in #103 I suggest we look at forward progress guarantees as being defined in the C++ standards committee.

@lukewagner
Copy link
Member

This is an important question. In general, I've taken the stance that we should work hard to avoid duplicating other parts of the Web platform by eventually allowing direct Web API access from WebAssembly. That being said, threads are a core part of execution semantics, making them more like SIMD (obvious builtin) than WebGL (obvious external API).

During an initial discussion with @ncbray about workers vs. a pure-WebAssembly version of threads, I viewed the situation as either-or and sided with workers. More recently, though, I realized that the two can be quite complementary. First let me describe both independently before describing how they compose:

With worker-based threads, to create a thread, a wasm app would create a worker (in v.1, by calling out to JS; after WebIDL integration, by importing the Worker API and creating the worker directly). To share a wasm heap/global-state between workers, the wasm module object itself would be postMessage()ed directly to the destination worker, symmetric with how one shares a SharedArrayBuffer (imports of the WebAssembly module would be re-imported at the destination).

Pros:

  • symmetric with the obvious pthreads-to-asm.js+SAB mapping that Jukka has working and passing the pthread test suite.
  • any number of wasm modules can be shared between any number of workers allowing very expressive configurations in the not-whole-app-port use cases of wasm.
  • No real change for the web platform or Web APIs compared to JS+SAB.

Cons:

  • Workers are currently rather memory-fat (due to each including an independent JS execution context (JSRuntime/Isolate, etc)).
  • While it is definitely possible for a worker to avoid JS context creation if no JS is imported in the worker (once we have wasm+WebIDL integration), this won't be a simple impl task (due to accidental dependencies and the cross-cutting nature of workers). Also, there is the case of dynamic import of JS (in a previously JS-free worker) and non-JS overhead that might be harder to eradicate. Definitely interested to hear what other browsers have to say on this. But I worry threads wouldn't reliably slim down across all browsers for many years.

So, an alternative is pure-WebAssembly-defined threads. Basically, the spec would define how you create, destroy, join, etc threads. Threads would be logically "inside" a module and all threads would have access to the same globals, imports, etc. What happens when a non-main thread calls an import that (in a browser) is JS? (This whole idea from @ncbray) The calling thread would block and the call would be executed on the main thread of the module (the one on which the module was loaded, which has a well-defined Realm etc), as if by setTimeout(importedFun.bind(args to import call)).

Pros:

  • By construction, each WebAssembly-created thread could be nothing but an OS thread. No special work required to slim down which means OS-cost threads from day 1 in all browsers implementing the feature.
  • We may end up eventually wanting some operations on threads not wanted on workers (say, pthread_kill...).

Cons:

  • Increased latency compared to synchronous call and serialization bottleneck on the main thread. A workaround here is that, when we get WebIDL bindings, we could define an opt-in mechanism (e.g., new WebIDL method attribute) that APIs declare that they can be called synchronously from wasm threads.

Not viewing these as mutually exclusive suggests an exciting hybrid:

  • A module can be shared with many workers (as described above) and, within each worker, can fork any number of pure-wasm threads contained by that worker+module.
    • The containing worker would be the "main thread" for the purpose of import calls on pure-wasm threads.
  • This allows a wasm application to precisely control how many event loops and JS contexts it creates.
    • For example, I'd expect a game to have one OffscreenCanvas worker (no pure-wasm threads, just rendering on the worker thread w/o interruption), one IDB worker (maybe a few pure-wasm threads), and one "all other background threads" worker (w/ all the other threads as pure-wasm threads).
    • In the limit, after all important Web APIs have been marked callable directly from pure-wasm threads, an ideal app would exclusively create pure-wasm threads.

From a Web platform POV, I think neither of these fundamentally change the model as long as we carefully define things like the state of the stack of script settings objects (again, though, analogous to setTimeout). Also, the pure-wasm threads are describable and polyfillable in terms of asm.js+SAB: the calling non-main thread uses shared memory to enqueue work for the main thread and then futexWait()s for a response.

From a non-Web environment POV: only pure-wasm threads would exist according to the spec (though the spec would specify what happens when some "other" (read: worker) thread calls a function out of the blue). This situation would be symmetric with module imports where the spec only talks about the semantics of a single module, leaving what can be imported up to the host environment. In particular, this means that, with pure-wasm threads, you'd be able to easily write a threaded module w/o any host dependencies.

What I especially like is that the hybrid model avoids the requirement for all browsers to slim down their workers (which may take years and require cross-org collaboration). I am, however, quite interested to hear if other browser vendors think that this isn't a big deal.

@jfbastien
Copy link
Member Author

We can also evolve the web platform in two ways:

  • Allow some web APIs such as GL to be used off main thread, but still only from one thread.
  • This approach doesn't work for many filesystem usecases though: applications expect to be able to read/write from multiple threads without doing implicit hops, and there's not really a great reason to forbid this. We can spec some APIs to work from multiple wasm threads (with some restrictions).

We also want to allow some form of lightweight user-mode threading. Other languages, such as Go, will perform better with this, and C++ itself will gain such features in C++17 or soon after. Let's make sure our approach makes this possible.

@ncbray
Copy link

ncbray commented Jun 9, 2015

Since we talked I’ve been trying to unravel an unclear, hairy ball of interacting issues and unarticulated starting points (re: api surface + ffi + threads) and parcell them into smaller chunks we can talk through. Hopefully I’ll start posting some issues soon, consider this a hand-waving ill-support stub until then.

In general, I think we’re starting to see things from a similar perspective, but there are a lot of related issues that need to be worked through until the details for this issue click into place.

Having read through the Web Worker spec, it’s very JS-centric. It doesn’t buy you a whole lot unless a thread has an implicit (JS-style) event loop and a thread-local JS isolate. In that case, it may make sense to treat it as a worker. (But these kind of threads may not exist, depending on other design choices.) I have some other concerns about how worker lifetimes are specified and the fact that the app “origin” could differ per thread, but I think those issues can be deferred for the moment.

I don’t believe we want workers to become “threads out of nowhere” by calling into whatever WASM code they please. What pthreads ID do they get? How does TLS work? Unless thread management is hermetic to WASM, there are going to be some tough questions to answer.

I do like WASM code interacting with arbitrary workers, however. Something along the line of message ports + bulk data copies, if nothing else? (Yes, this seems like a step back, I’ll try to justify it elsewhere.)

Re-importing the associated ES6 modules on postMessage sort of scares me. It solves a few nasty issues but also seems like a big hammer that will inevitably smash something else. I’ll need to think through the consequences.

Note: at least in Chrome, I know that many worker APIs are implemented by bouncing through the main thread. So explicitly bottlenecking through the main thread may not hurt performance much, in the short term?

Note: even with SAB, the only real way to enqueue a task on a thread with an implicit (JS-style) event loop is postMessage. Alternatively, we could create some sort of “event on futex wake” type functionality, but that might be better suited for an explicit (native-style, rentrant, pumped by the program) event loop.

Note: the implicit storage mutex seems like a deadlock waiting to happen, although I cannot find any APIs that actually acquire it in a worker...

@titzer
Copy link

titzer commented Jun 9, 2015

On Tue, Jun 9, 2015 at 2:05 AM, Nick Bray notifications@github.com wrote:

Since we talked I’ve been trying to unravel an unclear, hairy ball of
interacting issues and unarticulated starting points (re: api surface + ffi

  • threads) and parcell them into smaller chunks we can talk through.
    Hopefully I’ll start posting some issues soon, consider this a hand-waving
    ill-support stub until then.

In general, I think we’re starting to see things from a similar
perspective, but there are a lot of related issues that need to be worked
through until the details for this issue click into place.

Having read through the Web Worker spec, it’s very JS-centric. It doesn’t
buy you a whole lot unless a thread has an implicit (JS-style) event loop
and a thread-local JS isolate. In that case, it may make sense to treat it
as a worker. (But these kind of threads may not exist, depending on other
design choices.) I have some other concerns about how worker lifetimes are
specified and the fact that the app “origin” could differ per thread, but I
think those issues can be deferred for the moment.

Event loops might make sense outside of JS, as well. E.g. there are a
couple models that might be interesting to wasm, such as promises, async IO
events, etc. I'm not proposing anything concrete here, but maybe the wasm
use case will prompt a generalization here.

I don’t believe we want workers to become “threads out of nowhere” by
calling into whatever WASM code they please. What pthreads ID do they get?
How does TLS work? Unless thread management is hermetic to WASM, there are
going to be some tough questions to answer.

The question is how would a worker get into wasm code? They would have to
get a reference to the wasm module somehow, presumably by having that first
postMessage()'d to them.

I do like WASM code interacting with arbitrary workers, however. Something
along the line of message ports + bulk data copies, if nothing else? (Yes,
this seems like a step back, I’ll try to justify it elsewhere.)

Re-importing the associated ES6 modules on postMessage sort of scares me.
It solves a few nasty issues but also seems like a big hammer that will
inevitably smash something else. I’ll need to think through the
consequences.

Note: at least in Chrome, I know that many worker APIs are implemented by
bouncing through the main thread. So explicitly bottlenecking through the
main thread may not hurt performance much, in the short term?

Note: even with SAB, the only real way to enqueue a task on a thread with
an implicit (JS-style) event loop is postMessage. Alternatively, we could
create some sort of “event on futex wake” type functionality, but that
might be better suited for an explicit (native-style, rentrant, pumped by
the program) event loop.

Note: the implicit storage mutex seems like a deadlock waiting to happen,
although I cannot find any APIs that actually acquire it in a worker...


Reply to this email directly or view it on GitHub
WebAssembly/spec#104 (comment).

@lukewagner
Copy link
Member

Agree with @titzer that we may want to support async-style programming by allowing wasm to directly participate in the event loop (which isn't even a stretch of the imagination). At a high level, I think we're going to see:

  1. Apps with a portable C++ POV that don't want any async, they want pure threads and many of them and they could confine themselves to pure-wasm threads (using workers only to avoid bottlenecks until all the important Web APIs are callable directly from pure wasm threads).
  2. Apps with a web POV that are composed of JS modules, wasm modules, are based on popular web frameworks, use tons of asynchrony. This is where we definitely need the full power of shared-wasm-in-workers and good integration with the event loop, promises, etc.

and I think both use cases will matter a lot for the foreseeable future.

@jfbastien
Copy link
Member Author

From a C++ perspective we could map the 2 POVs @lukewagner proposes into:

  1. Codebase that exports _start.
  2. Codebase that exports something like a select or epoll handler.

An interpreter (lua, python, ...) could fit nicely in 2, and implementations could mux JS event loop processing with processing of the wasm module's events, including some file-descriptor and pipe handling that JS typically can't do but wasm module could. To be clear, I'm not saying we expost epoll as-is, but something close to it that still fits nicely in the web platform.

@lukewagner
Copy link
Member

IIUC, select/epoll would amount to synchronously blocking on a limited set of events of the event loop. That has long been proposed independently for workers in JS, and we should probably push that forward (it is a cross-cutting extension to the platform that likely has broad semantic consequences, not something we can just do locally in wasm), but I think if we want to integrate with the existing web platform, these are not the logical primitives to do it: we have to allow returning to the event loop which means having wasm register itself as the worker's onmessage handler (which could be expressed in a wasm-y way) and participating with the other forms of micro/nano/pico-task queues already defined in the HTML5 spec. That is, if you can have the browser call JS, you should be able to have it call wasm (which is vacuously possible in the MVP since you can just have the JS call the wasm, but we're theorizing a world where the worker (or even main thread) only contains wasm). (Exception: I'm fine leaving inline event handlers as JS-only forever :)

@trevnorris
Copy link

If I may offer a few thoughts from the server-side of things (specifically node). There are three main use cases for threads (there are probably "proper" names for these, but you'll get the idea):

  • Computationally heavy native work. These have no JS stack. Values are passed, a callback is set and when the numbers have been crunched the callback receives the return value.
  • Immediately exiting thread. A single file is passed, along with optional arguments, which runs to completion, during which time return values can be messaged back to the parent thread. This one has two variations.
    • No I/O access thread. The spawned thread is essentially synchronous, but to further clamp down on the scope of work these threads can do they are not allowed to do any sync I/O operations either.
    • I/O thread. Allowed to require other modules and perform sync I/O.
  • Event loop thread. These are essentially just like a running process. Except in the way they communicate, and have the ability to share memory.

These have variations on how developers want them implemented. Like does the immediately exiting thread create a new JS stack every time, or use a new one. Is the thread closed when the operation is complete or reused. And also is a thread, sync or async, joined causing the main thread to hault until the spawned thread is complete.

Sorry if this was too many examples. What I'm getting at is it seems like wasm's ability to work with threads will either not be low level enough or extensive enough to fit these needs (I don't expect it could concerning the first example). Leaving server applications needing to use their own bindings. This correct?

@sunfishcode
Copy link
Member

I don't have all the answers, but one thing I can comment on is that WebAssembly threads are going to have access to a pthreads-level API, so applications will have quite a lot of control. Decisions like when and how to use pthread_join are largely determined by the application (or by a library linked into the application).

Also, WebAssembly itself is being designed to be independent of JS, and it will be possible to have WebAssembly threads with no JS code on the stack.

@trevnorris
Copy link

Thank you. That's excellent to hear.

@sunfishcode
Copy link
Member

What things here do we need to decide on for the MVP, and what things can wait until we actually introduce shared memory?

@lukewagner
Copy link
Member

I don't see it directly blocking anything in the MVP. That being said, for something so central, it seems like we should have a pretty clear idea about the feature (and some experimental experience) before the MVP is actually finalized/released. I don't see "blocks_binary", though.

@jfbastien
Copy link
Member Author

A few issues on WebWorkers detailed here: https://github.com/lars-t-hansen/ecmascript_sharedmem/issues/2

@lukewagner
Copy link
Member

FWIW, both of those are impl issues and ones that we're expecting to address in FF.

@lars-t-hansen
Copy link
Contributor

@lukewagner, they are not entirely implementation issues. The worker startup semantics are allowed by the worker spec. The limitation on the number of threads is a consequence of wanting to prevent DOS attacks, but it's very crude and something better, with observability (ie exception thrown on failure to create a worker), would be welcome by many applications, but probably requires a spec change too. Additionally, so far as I can tell there's no way to detect if a worker has gone away, which is a bizarre spec hole, but I've not found a mechanism for it.

@lukewagner
Copy link
Member

@lars-t-hansen As for workers not starting before returning to the event loop, when you say "allowed by the spec", do you just mean by the spec not specifying when progress is made or does it specifically mention this case? As for the limitation on number of threads, you're right, what is needed (in addition to a higher quota) is some visible error to indicate the quota is exceeded.

@jfbastien
Copy link
Member Author

I think a forward-progress guarantee is what we want here, in line with Torvald Riegel's N4439 paper to the C++ standards committee.

@lars-t-hansen
Copy link
Contributor

@lukewagner, The service worker spec (https://slightlyoff.github.io/ServiceWorker/spec/service_worker/) provides minimal justification for the "kill a worker" behavior (indeed, lack of UI for a slow script dialog, see the section "Lifetime") but no actual guidance on how to avoid being gunned down. For a computational worker in a SAB context that license to kill is particularly troublesome, as the default mode for such a worker will be that it waits on a condition variable for more work, not that it returns to its event loop.

@lars-t-hansen
Copy link
Contributor

Bug filed against the WHATWG spec here: https://www.w3.org/Bugs/Public/show_bug.cgi?id=29039.

@jfbastien
Copy link
Member Author

@slightlyoff can probably chime in on web worker + service worker and the "license to kill".

@emanuelpalm
Copy link

I guess running webasm in strictly single-process environments would be doable even if the spec would require the presence of a pthreads-level API? As parallelism, as opposed to concurrency, cannot be guaranteed, an implementation would be free to simply treat created "pthreads" as blocks to schedule arbitrarily?

Will any guarantees stricter than soft realtime be guaranteed by the spec?

@jfbastien
Copy link
Member Author

@emanuelpalm agreed, a valid implementation could emulate a single-processor system.

Hard-realtime isn't something I think we can guarantee in general because WebAssembly doesn't know a-priori which machine it'll execute on. I see this limitation as similar to promising constant-time algorithms: the .wasm file doesn't know how it'll actually get lowered and exactly what guarantees it can expect from the JIT and machine it'll execute on.

@jbondc
Copy link
Contributor

jbondc commented Dec 18, 2015

I'm really excited about this:
http://images.nvidia.com/events/sc15/SC5105-open-source-cuda-compiler.html
https://developer.nvidia.com/CUDA-LLVM-Compiler

Any thoughts if implementing a forall parallel loop would be feasible in Web Assembly?

What could the pipeline look like as a concrete way to experiment?
web assembly -> wasm binary -> chrome -> gpu

Seems more convenient to have something like:
web assembly -> binary (jit?) -> chrome -> gpu

@jfbastien
Copy link
Member Author

@jbondc I think your thinking is closer to the C++ parallelism TS, which requires runtime support. It would be possible, but WebAssembly currently doesn't have ongoing work for GPUs. It's important, but not the primary focus for MVP. The work by Robert's team is pretty different from what WebAssembly would do (though it could benefit from that work).

@jbondc
Copy link
Contributor

jbondc commented Dec 18, 2015

Yes this looks right:
https://github.com/cplusplus/parallelism-ts/blob/master/parallelism-ts.pdf

But more interested in writing my own language that compiles down to Web Assembly (and has parallelism support).

@lukewagner
Copy link
Member

I think the best thing wasm can and should do is to provide the raw hardware primitives for parallelism (threads and SIMD) within the general-purpose CPU model assumed by WebAssembly to the language/library/tool authors, leaving existing and future Web APIs to access hardware outside the general-purpose CPU (like WebGL for the GPU). Then the language/library/tool authors can build up abstractions that target a specific model of parallelism. (This is basically just an Extensible Web argument.) Based on this, I would think the libraries in the C++ parallelism TS would be provided as libraries on top of the abovementioned primitives of the web platform, not as builtin functions (at least, not in the near-term).

@jbondc
Copy link
Contributor

jbondc commented Dec 20, 2015

@ghost
Copy link

ghost commented Dec 20, 2015

@jbondc It's a higher level abstraction and a framework that I expected could be implemented in wasm with good performance. If someone tried porting this and found some show stoppers then this might suggest extra low level threading support to be considered for wasm?

My understand behind the reasons for avoiding such high level abstractions is that they generally have niche uses and also they have inherent assumptions and limitations that do not map well to the hardware, so there might not be the general interest to warrant direct support but perhaps someone can rally support for some other parallel computation models.

@jbondc
Copy link
Contributor

jbondc commented Dec 21, 2015

It's not a higher level abstraction. The "Actor Model" (https://en.wikipedia.org/wiki/Actor_model) is a different theoretical model to think about computation (vs. a finite turing machine).

Based on hardware (e.g. gpu by Nvidia), you can implement in a VM some messaging passing actorLike thing. The way I see it, if either Chrome, Chakra, Firefox or Webkit would implement pattern matching + an actorLike thing, then we'd get C++ parallelism for free, memory sharing + threads, and pretty much any other concurrent model.

@jbondc
Copy link
Contributor

jbondc commented Dec 21, 2015

Related in case someone wants to hack something togheter:
https://github.com/solodon4/Mach7

@ghost
Copy link

ghost commented Dec 22, 2015

@jbondc It's not SMP and it doubt it could implement SMP code with any level of efficiency? Consider how it would implement atomically updated stacks and queues and caches etc in a code written for shared memory? The planned wasm support should allow some of these 'Actor Model' programming languages to be implemented in wasm, and the atomic operations might be used to implement fast message queues and memory management to keep the actors apart etc.

@jbondc
Copy link
Contributor

jbondc commented Dec 24, 2015

@JSStats I'm more trying to look at good building blocks. Better hardware is already here.

There's a good discussion about hardware here:
http://amturing.acm.org/acm_tcc_webcasts.cfm (Computer Architecture)

The threads / shared memory design is bad.
Hewitt's Actor Model or some variation of it is the way to go.

As a concrete example, this is good work:
http://2014.rtss.org/wp-content/uploads/2014/12/RTSS2014-sifakis-distr.pdf

Page 34. Threads bad.
Page 48. Good model.

Page 59-60. A ~'binary jit' could apply here: Distributed BIP generator.

I'll try some experiments in the coming months, hopefully others will join.

@sunfishcode
Copy link
Member

Moving this milestone to Post-MVP, so that the MVP milestone focuses on actionable tasks for the MVP. It's obviously important to plan for threads, but it's my impression that there's sufficient awareness already, so we don't need this issue to remind us. If people disagree, I'm fine moving it back.

@jbondc If you are planning to do some experiments, please open a new issue to report your findings. Thanks!

@jbondc
Copy link
Contributor

jbondc commented Sep 23, 2016

@sunfishcode Will do, but largely gave up.

For anyone interested in BIP:
http://www-verimag.imag.fr/New-BIP-tools.html

Is the thinking then with WASM threads, that code like this would compile into atomic load/store ops?

class FastBlockableMutex {
 protected:
  atomic<bool> mIsLocked;
  atomic<int> mWaiters;
  atomic<int> mBlockers;
  mutex mMutex;

};

@jfbastien
Copy link
Member Author

@jbondc it would need instructions to lower into ops ;-)
Yes, atomic accesses would as long as the size is lock-free. Mutex would likely go to an atomic and then some futex-like thing. I expect the model to be close to this.

@malthe
Copy link

malthe commented Nov 3, 2016

@jbondc isn't the idea to provide threads and compare-and-swap etc so that one can build a lock-free actor implementation on top of it?

@jfbastien
Copy link
Member Author

@binji will address in threads proposal. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests