Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Typed continuations to model stacks #1359

Open
rossberg opened this issue Jul 29, 2020 · 68 comments
Open

Typed continuations to model stacks #1359

rossberg opened this issue Jul 29, 2020 · 68 comments

Comments

@rossberg
Copy link
Member

rossberg commented Jul 29, 2020

Motivation

Wasm currently lacks any support for multiple stacks of execution and switching between them.
That prevents it from efficiently representing most control flow abstractions that appear in many modern programming languages, such as

  • lightweight (“green”) threads
  • coroutines
  • generators
  • async/await
  • effect handlers
  • call/cc

In the spirit of Wasm as a low-level engine, it's highly undesirable to add a multitude of ad-hoc features for each of this mechanisms individually.
Instead, it's preferable to identify a sufficiently general mechanism for managing execution stacks that allows expressing all of them.
More concretely, Wasm needs a way to represent, create, and switch between stacks.

Challenges

  1. In order to be able to express features like green threads and others, switching stacks must be possible at any stack depth, not just at the bottom of the stack (“deep coroutines”).

  2. In order to be able to express features like generators, it must be possible to pass values (back and forth!) when switching to a different stack.

  3. In order to be able to validate the usage of stacks, both stacks and the values being passed when switching must be properly typed.

  4. In order to inform the semantics of the interaction with other control flow, esp exceptions, it is desirable to have a semantic framework that can explain all such constructs.

  5. In order for multiple of the above features to be usable at the same time, they must be expressible in a composable fashion that allows arbitrary nesting.

  6. The mechanism must not require copying or moving stack frames, since some engines could not easily do that due to the heterogeneous nature of their stacks.

  7. It must not be possible to create cyclic references to stacks, since that would require GC.

Proposal

One known way to address the above challenges is to view stacks slightly more abstractly, through the notion of a delimited continuation.
Broadly speaking, that represents “the rest of” a computation (a continuation) “up to” a certain point (delimited).
Concretely, that certain point is the bottom of a specific stack, while the stack frames on the stack essentially describe the remaining tasks to be completed.

In particular, we can apply known ways to type such continuations, which is not a problem with an immediately obvious solution otherwise.
One specific way of typing continuations and the values communicated back and forth is by following the approach taken by so-called effect handlers, one modern way of representing delimited continuations, which essentially are exception handlers with resumption, where both throw and resume can carry along arguments.

Like with asymmetric coroutines, switching between stacks then occurs through a pair of resume and suspend instructions.
The suspend instruction “throws” a value akin to an exception — we call it an event below — which the resume handles.
The event's tag, like a regular exception tag, determines the types of values passed from suspend to resume point, but also those passed back.

Details

The first central addition is a new form of resumable exception, a.k.a. event.
Such an event is declared with a respective new form of exception definition
(in the binary format this is an exception definition with a certain flag bit):

(event $evt (param tp*) (result tr*))

Unlike a regular exception, events can have return values.

The other central addition is a new (reference) type of continuations:

(cont $ft)

where $ft is a type index denoting a function type [t1*] -> [t2*].
Intuitively, this describes a suspended stack that is resumable with values of types t1* and will ultimately terminate with values of types t2*.

A new stack is created with the instruction

cont.new : [(ref $ft)] -> [(cont $ft)]

which takes a function reference to run on the new stack.
Execution isn’t started before resuming the continuation.

The instruction

cont.resume (event $evt $handler)* : [t1* (cont $ft)] -> [t2*]

resumes such a continuation, by passing it the expected arguments t1*.
Once the computation terminates, it returns with t2*.
If the computation suspends before termination (see next instruction), with one of the event tags listed, then execution branches to the respective label $handler.
This happens without unwinding the stack.
The label receives the event arguments tp* and a continuation of type (cont $ft'), where $ft' is the function type [tr*] -> [t2*] of the next continuation.

Continuations are single-shot, that is, they cannot be resumed more than once.
An attempt to resume an already resumed continuation traps.

A computation running on its own stack (i.e., initiated as a continuation), can suspend itself with the following instruction:

cont.suspend $evt : [tp*] -> [tr*]

where $evt is an event of respective type [tp*] -> [tr*].
Essentially, this passes control back to the parent stack.
That is, like throw the exception takes arguments tp*, and transfers control to the innermost cont.resume, or rather, its handler label, respectively.
But unlike regular exceptions, execution can also be resumed, which returns here with values of type tr*.

As described above, the innermost active handler matching the event tag receives the event's tp* and a new continuation.
Resuming this continuation (with cont.resume) will therefore pass back tr* to the originating cont.suspend.

Resumption can nest, i.e., a continuation may itself resume another continuation.
Consequently, an event may generally pass through multiple parent stacks before hitting a matching resume handler.
The same chain of stacks is maintained when resuming.

Note how the exception tag ties together the types of the values passed in both direction between a matching suspend and resume.
Different suspension points within the same computation can use different exception tags and thereby pass different types of values back and forth between parent and child stack.

Continuations are first-class values that are allowed to escape a handler.
That way, a stack can be kept live to be resumed at a later point.
The stack’s lifetime ends when the final resume terminates regularly (or with an exception, see below).
Managing the lifetime of continuations/stacks and evtrefs generally requires reference counting in the VM.
However, it’s impossible for them to form cycles, so general garbage collection is not needed.

Finally,

cont.throw $exn : [te* (cont $ft)] -> [t2*]

aborts a continuation by injecting the (regular) exception $exn with arguments of type te* at the suspension point.
This is a way to explicitly unwind the stack associated with the continuation.
In practice, a compiler should make sure that every continuation is used linearly, i.e., either resumed or aborted with a throw.
However, there is no simple way to enforce this in a type system, so Wasm validation won’t be able to check this constraint.

Example: A simple generator

To see how these mechanisms compose, consider encoding a generator enumerating i64 integers from 0 up until a certain condition is met.
The consumer can signal back this condition by returning a bool (i32) to the generator.
This can be expressed by the following event.

(event $enum-yield (param i64) (result i32)) 

Here is the actual generator:

(func $enum-until
 (param $b i32)
  (local $n i64)

  (local.set $n (i64.const -1))

  (br_if 0 (local.get $b))
  (loop $l

    (local.set $n (i64.add (local.get $n) (i64.const 1)))

    (cont.suspend $enum-yield (local.get $n))

    (br_if $l)

  )

)

Here is one possible consumer, which runs the generator until a given max value is reached:

(func $run-upto (param $max i64)

  (local $n i64)

  (local $cont (cont (param i32)))
  (local.set $cont (cont.new $enum-until))

  (loop $l

    (block $h (result i64 (cont (param i32)))
      (cont.resume (event $enum-yield $h)
        (i64.ge_u (local.get $n) (local.get $max))

        (local.get $cont)
      )
      (return)
    )

    (local.set $cont)
    (local.set $n)
    ;; ...process $n...
    (br $l)
  )
)

Note how the handler $h takes both the i64 argument produced by the generator and the next continuation.

Example: A simple thread scheduler

For a more interesting example, consider a simple scheduler for green threads.
We define two events with which a thread can signal the scheduler to either yield or spawn a new thread:

(event $yield)
(event $spawn (param (ref $proc)))

where

(type $proc (func))

We further assume that there is a global variable encoding a thread queue in form of a list of coninuations:

(global $queue (list-of (cont $proc)) ...)

The exact representation does not matter, so list-of is just pseudo code.
All we need is three obvious operations on this queue, ignoring their details:

(func $enqueue (param (cont $proc)) …)
(func $dequeue (result (cont $proc)) …)
(func $queue_empty (result i32) …)

Given these prerequisites, a simple scheduler loop can be implemented as follows:

(func $scheduler (param $main (ref $proc))
  (cont.new (local.get $main)) (call $enqueue)
  (loop $l
    (if (call $queue_empty) (then (return)))   ;; program is done
    (block $on_yield (result (cont $proc))
      (block $on_spawn (result (ref $proc) (cont $proc))
        (call $dequeue)
        (cont.resume (event $yield $on_yield) (event $spawn $on_spawn))
        (br $l)                                ;; thread terminated
      )
      ;; on $spawn, proc and cont on stack
      (call $enqueue)                          ;; continuation of old thread
      (cont.new) (call $enqueue)               ;; new thread
      (br $l)
    )
    ;; on $yield, cont on stack
    (call $enqueue)
    (br $l)

  )

)

TODO: More examples.

Interaction with regular exceptions

Regular exceptions and events are disjoint.
An exception handler will not catch events and vice versa.

If resuming a continuation throws a regular exception that is uncaught on its stack then it propagates through the active resume, unwinding the stack encapsulated in the continuation and ending its lifetime.

Yielding an event on the other hand does not trigger exception handlers between the suspend and the resume point.
They become active again when the continuation resumes and execution thereby switches back to their stack.

Note that this dichotomy could, in principle, be used to implement exceptions with two-phase unwind: the first phase is represented by an event, whose handler then injects back a proper exception to unwind. However, such a an implementation would be fairly expensive, since every exception handler would have to run on its own stack.
It is still useful as a semantic explanation of what ought to happen, and how two-phase unwind would interact with single-phase exceptions.

@kripken
Copy link
Member

kripken commented Jul 29, 2020

Is this identical to what you presented in the past @rossberg ? (If not a summary of the changes might be useful.)

@rossberg
Copy link
Member Author

@kripken, yes, up to some renaming and minor tweaks.

@kripken
Copy link
Member

kripken commented Jul 29, 2020

Got it, thanks @rossberg .

@RossTate
Copy link

Is there another proposal on the way on how to support applications of stack inspection, e.g. two-phase exception handling, resumable exceptions, C#/Python-style generators, stack tracing, linear-memory garbage collection, stack duplication?

@taralx
Copy link

taralx commented Jul 29, 2020

cont.throw doesn't take an evtref, so it's not possible to respond to an unknown suspension with an exception. Is that intentional?

The new function type $ft' : [tr*] -> [t2*] isn't a type that the module specifies, but one that is synthesized from other types. Doesn't that make for significantly more work at validation-time, as the synthesized type has to be checked for compatibility with a (later) specified function type?

@taralx
Copy link

taralx commented Jul 29, 2020

@RossTate Doesn't this support those, if indirectly, since the exceptions are resumable?

@tlively
Copy link
Member

tlively commented Jul 29, 2020

@taralx I believe you're right that all of those features can be implemented in terms of this proposal. However, for many of those features, allocating an entire new stack would be overkill. Proposals for stack inspection, two-phase unwinding, etc. are essentially special cases of the feature proposed here that can be implemented more efficiently.

@RossTate
Copy link

Yes, and on top of that, this proposal requires running code just to find out if an inspection site is relevant to the inspection at hand. So I'm wondering if interaction with (more efficient) stack inspection has been taken into account in this design.

@RossTate
Copy link

Putting stack inspection aside, the lightweight-threads example seems to be a relatively inefficient implementation of lightweight threads. Ideally, switching between threads just involves a few instructions to switch contexts. But in this proposal, switching threads involves a stack walk to find the relevant br_on_evt, which is not necessarily cheap.

@rossberg
Copy link
Member Author

@taralx:

cont.throw doesn't take an evtref, so it's not possible to respond to an unknown suspension with an exception. Is that intentional?

That's a good point. But you cannot actually get an evtref for an unknown suspension either, since each handler is guarded by a list of known events. The idea here was that a handler has no business interacting with an unknown suspension -- that would destroy composition. But maybe there are use cases for a handle-all.

The new function type $ft' : [tr*] -> [t2*] isn't a type that the module specifies, but one that is synthesized from other types. Doesn't that make for significantly more work at validation-time, as the synthesized type has to be checked for compatibility with a (later) specified function type?

I don't think so, since validation already has to be able to treat multiple definitions of the same function type as compatible. Function types are structural, even in the MVP.

FWIW, there is a similar question regarding func.bind and other such instructions. Currently, it takes a target type as immediate to side-step the issue, but it would actually be enough (and cheaper to check) if it just had an numeric immediate that indicates the number of parameters to bind and synthesise the resulting function type.

So generally speaking, we have to decide whether we want to avoid synthesising, at the price of more type annotations and checking on instructions like these, or if we're fine with synthesising such types (which may require slightly more work in the spec primarily).

@RossTate Doesn't this support those, if indirectly, since the exceptions are resumable?

It does, and more importantly, it does so in a composable fashion -- an important point I forgot to include (added). I.e., you can use multiple of these features together at the same time, even if they are implemented separately.

@tlively is correct however that there is one special case where creating a new stack can be overkill, namely when the continuation does not escape the handler. In practice, though, handling some of these cases without a stack would require the introduction of either fragmented stack frames, display chains, or scope restrictions that essentially make it equivalent to just calling another function. It is worth investigating if adding those mechanisms is a relevant win over implementing them in user space.

@RossTate:

Putting stack inspection aside, the lightweight-threads example seems to be a relatively inefficient implementation of lightweight threads. Ideally, switching between threads just involves a few instructions to switch contexts. But in this proposal, switching threads involves a stack walk to find the relevant br_on_evt, which is not necessarily cheap.

That is not correct. Handlers are associated with resume points, not branches. Consequently, a handler is only present "at the bottom" of each non-suspended stack, and there is no need to walk an actual stack. Moreover, each resume declares a list of tags it handles. So transferring control to a handler can e.g. be implemented by an indirect jump through a (thread-local) side table associating each event tag with its active handler.

@RossTate
Copy link

It does, and more importantly, it does so in a composable fashion -- an important point I forgot to include (added). I.e., you can use multiple of these features together at the same time, even if they are implemented separately.

Yes, and the same is true for every other suggestion that has been made for directly supporting these features. Only the suggestion of implementing shadow stacks within WebAssembly without direct support is not composable.

however that there is one special case

Let's work through the first such "special case" that was discussed four months ago: linear-memory garbage collection. Runtimes and programs with their own (linear-)memory management generally have some way to inspect the stack to find and later update the GC roots on demand. These programs now want to compile to WebAssembly and want some way to inspect the stack on demand to find and update these roots (albeit in a safer manner per WebAssembly's security requirements).

Your proposal is to allocate a new stack at every point someone would want to perform such inspection. For these applications, nearly every stack frame has such a root and therefore such a need. So your proposal is essentially to have these applications allocate every stack frame on its own separate stack. @aardappel, as someone who has been particularly advocating for this feature, do you think this would serve your needs?

Now the second "special case" that was discussed was stack tracing. All the most-used garbage-collected languages have some means for collecting/examining the stack trace (sometimes even lazily using an interactive stack walk) during execution even in production mode (not just debug mode). Again, nearly every frame would need to provide stack-trace information, so again your proposal would have all these languages allocate every stack frame on its own separate stack.

Moreover, each resume declares a list of tags it handles.

Ah, missed this change. There's a bug though. cont.forward also needs to specify a (handler) label and a list of events.

Oh, and another (unrelated) bug: cont.throw needs to consider that the continuation might catch the thrown exception. So it has an output type, and it needs to specify a (handler) label and a list of events.

Consequently, a handler is only present "at the bottom" of each non-suspended stack, and there is no need to walk an actual stack. Moreover, each resume declares a list of tags it handles. So transferring control to a handler can e.g. be implemented by an indirect jump through a (thread-local) side table associating each event tag with its active handler.

This assumes a particular implementation strategy and infrastructure, essentially baking in decisions that would often be made at the application level (e.g. should this dynamically scoped variable be implemented using a stack walk or using thread-local storage). If we're going to assume there is always the infrastructure for thread-local storage, it might be a good idea to incorporate the feature directly into WebAssembly, since it's useful for a lot more than just continuations. Regardless, this means that in the examples above every call and return (in the surface language) would have to update thread-local storage. In general, every suspend/resume, forward, and throw will have to update thread-local storage, which common implementations of lightweight threads do not need.

@RossTate
Copy link

RossTate commented Jul 30, 2020

When you presented this proposal six months ago, it was really great. You provided a lot of great ideas and garnered a lot of interest for an unconventional yet useful feature. That sparked a lot of productive conversations within the CG. Those conversations explored the design space, identifying weaknesses in the specifics you presented and then searching for alterations and extensions that would address those weaknesses. That is exactly what the effect of a good presentation should be. But this design has barely changed to incorporate those many related conversations within the group over the months since.

@aardappel
Copy link

This design seems especially well suited to implement first class stack-full asymmetric co-routines, as it maps relative 1:1 on its implementation needs.

One place where it is more general than such a co-routine feature is the use of event tags, which essentially allow one to yield not just from the current co-routine, but from any of the parents, presumably suspending a chain of co-routines in the process. It is not quite clear to me when this is useful, and I could imagine it having an implementation cost. I guess this gives us the flexibility of continuations where each stack frame has its own lifetime.

The machinery also seems quite heavy-weight for typical uses of "generators", which tend to function more like fancy for-loops than anything else. There would have to be some kind of guarantee that it is easy to collapse this functionality (no dynamic allocations, re-use of the same stack etc etc) for this to be feasible. Leaving this up to the cleverness of the engine makes it useless.

That holds even more if this is to be useful for stack inspection like @RossTate mentions. Even if we can specify that in simple cases these instructions translate to something close to no-ops, it seems like it would generate a lot bigger binaries than the simpler stack inspection proposal. To me, the whole point of having something like a stack inspection feature is that it is more efficient than a shadow stack, so if there's even a small chance that it isn't (e.g. requires a dynamic allocation) then it will not be used.

So as it stands, this feature seem limited in its applicability, unless its generality can be subdivided in semantically more restrictive uses with different implementation requirements. That is counter-intuitive maybe: the more general a feature is, the more things it can represent. But potentially the less features it can represent efficiently, which is what matters the most for Wasm, I feel.

@RossTate
Copy link

Another consideration is how to inspect (efficiency aside) the delimited continuations. For example, a garbage-collecting lightweight-thread manager needs to collect (and later update) the roots across all the lightweight threads. Given a particular cont representing a yielded lightweight thread, it's not clear how the thread manager can reasonably collect the cont's roots using this approach.

@lukewagner
Copy link
Member

I think, for fundamental performance/implementation reasons, we should expect that there are two features we need:

  1. the ability for wasm to cheaply create, suspend and resume a lightweight stack (green thread)
  2. the ability for wasm to iterate over and inspect (and possibly mutate) opt-in locals of frames on the stack without having had to significantly change the representation of the stack

I think this proposal speaks to 1, whereas stack iteration proposals speak to 2. They should obviously be designed to compose, but as far as I can tell from an perf/impl POV, I don't think it's practical to assume that one proposal is going to satisfy all our use cases and achieve all of our goals here. I also think the relative priorities of these features are different: there are several high-priority use cases for 1 whereas at least one wasm engine has specifically voiced an intention to delay 2 until we have wasm GC in place.

@RossTate
Copy link

Yeah, a common thread across feedback we got for #1340 was to prioritize first-class stacks over stack inspection. My presentation last week was to gauge whether stack inspection should still be a feature in the pipeline, and based on online and offline feedback the answer seems to be likely yes (provided it comes out after GC). This is important to know for designing first-class stacks because, just as stack inspection can be combined with unwinding to support two-phase exception handling, stack inspection can be combined with (restricted) detaching or redirecting to support (one-shot) delimited continuations. So if you want a "unifying view on stack-related mechanisms", you need to take stack inspection into account, which this proposal has not.

@rossberg
Copy link
Member Author

@RossTate:

It does, and more importantly, it does so in a composable fashion -- an important point I forgot to include (added). I.e., you can use multiple of these features together at the same time, even if they are implemented separately.

Yes, and the same is true for every other suggestion that has been made for directly supporting these features.

No, it is not. It is one of the main innovation of effect handlers, and the reason why PL folks got excited about them, that they are a compositional mechanism for defining effects, including control effects.

Your proposal is to allocate a new stack at every point someone would want to perform such inspection.

I have not been proposing that. The only thing I have mentioned is that effect handlers can express this form of stack inspection (and they immediately explain how that would interact with other control features). But obviously, they would only suitable as an actual strategy if we were able to reliably optimise the special case I mentioned -- which may or may not be the case in this setup. If not, there needs to be a more explicit alternative that can be.

There's a bug though. cont.forward also needs to specify a (handler) label and a list of events.

Why? I don't think it does.

Oh, and another (unrelated) bug: cont.throw needs to consider that the continuation might catch the thrown exception. So it has an output type

Ah thanks, you're right, I mixed up signatures. Fixed.

and it needs to specify a (handler) label and a list of events.

We could do that. But it's not clear whether it would be all that useful. The idea is that cont.throw is used to "destroy" a stack, when it is not supposed to be completed anymore. At least with the use cases listed, it shouldn't throw back to the handler anymore in that situation. But maybe there are some odd use cases? This is one of the details that needs figuring out.

This assumes a particular implementation strategy and infrastructure, essentially baking in decisions that would often be made at the application level

It doesn't assume, it just points out one possible implementation strategy. It may not even be the best one, as fast implementations have used other strategies.

When you presented this proposal six months ago, it was really great. You provided a lot of great ideas and garnered a lot of interest for an unconventional yet useful feature. That sparked a lot of productive conversations within the CG. Those conversations explored the design space, identifying weaknesses in the specifics you presented and then searching for alterations and extensions that would address those weaknesses. That is exactly what the effect of a good presentation should be. But this design has barely changed to incorporate those many related conversations within the group over the months since.

I honestly don't know what you mean. AFAICT, there hasn't been much discussion of this proposal. There have been various discussions about various ideas, and occasionally I pointed out the connection. Correct me if I'm wrong, but none of these recent ideas has been presented as a substitute for this one, so its utility still stands. Whether it is sufficient to also subsume some of the others (efficiently) is a separate discussion to be had.

So if you want a "unifying view on stack-related mechanisms", you need to take stack inspection into account, which this proposal has not.

Yes, well, the nice thing about this proposal is that it can express it, even if not efficiently, but thereby at least suggesting a canonical semantics for integrating such a feature.

@aardappel:

This design seems especially well suited to implement first class stack-full asymmetric co-routines, as it maps relative 1:1 on its implementation needs.

Ah, it may look like it from first glance, but it's not at all specific to or optimised for co-routines. Its heritage is (algebraic) effect handlers, which are a principled way of expressing and composing almost any sort of state or control effect.

One place where it is more general than such a co-routine feature is the use of event tags, which essentially allow one to yield not just from the current co-routine, but from any of the parents, presumably suspending a chain of co-routines in the process. It is not quite clear to me when this is useful, and I could imagine it having an implementation cost. I guess this gives us the flexibility of continuations where each stack frame has its own lifetime.

Ah, the ability to nest resumption is what makes this mechanism composable. It is what allows each control abstraction to be implemented separately, e.g., you can run an async function in a green thread and still yield to the scheduler without worrying about intermediate handlers. That significantly simplifies implementations, as well as enabling the use of control abstractions in a heterogenous setting with multiple interacting languages.

The machinery also seems quite heavy-weight for typical uses of "generators", which tend to function more like fancy for-loops than anything else. There would have to be some kind of guarantee that it is easy to collapse this functionality (no dynamic allocations, re-use of the same stack etc etc) for this to be feasible. Leaving this up to the cleverness of the engine makes it useless.

It is much less heavyweight than you might think. We have reason to believe that implementing generators this way can be just as or even more efficient on average than the common approach of constructing a state machine, which often involves additional local control flow to reenter the previous state and requires copying live locals to the heap at every yield, or allocating them there in the first place. Plus, it immediately allows deep generators (yielding from a nested call) without any additional overhead.

@slindley
Copy link

The design outlined above is similar to that of effect handlers in Multicore OCaml. In practice, the overhead of implementing a range of abstractions from generators to lightweight threads to async/await on top of effect handlers in Multicore OCaml is close to zero.

@kayceesrk
Copy link

Another consideration is how to inspect (efficiency aside) the delimited continuations. For example, a garbage-collecting lightweight-thread manager needs to collect (and later update) the roots across all the lightweight threads. Given a particular cont representing a yielded lightweight thread, it's not clear how the thread manager can reasonably collect the cont's roots using this approach.

There is a standard way of implementing this which Multicore OCaml does and so goes Go, GHC Haskell, Fibers in OpenJDK and other languages which use lightweight threads. Essentially, the continuations (blocked & ready to run goroutines in Go, blocked lightweight threads in GHC) become a distinguished object in the GC, and the GC knows how to scan the stack associated with the continuation. The suspended continuations are marked as regular objects would be. For any suspended continuation reachable from the GC roots, the GC scans the stack associated with the continuation in the same way it would scan the current program stack.

@RossTate
Copy link

@kayceesrk That comment was meant to be in the context of modules that implement their own garbage collector in linear memory, though it's my fault for not making that clear. So my point was that a module-implemented garbage collector would not be able to perform the scans you describe.

@RossTate
Copy link

It is one of the main innovation of effect handlers, and the reason why PL folks got excited about them, that they are a compositional mechanism for defining effects, including control effects.

Effect handlers are a surface-level language feature. Depending on specifics of the handlers, they can be compiled in a variety of ways. While continuations are always a viable option for any handler, they are generally by far the least efficient option. Even the body of work you are basing this on has significant discussion of analyses and optimizations for identity common special cases and eliminating, but many of those optimized forms are not expressible in this proposal. So what's problematic about this proposal is that it's based on high-level concepts but does not provide the corresponding low-level primitives that those concepts are often implemented with.

Correct me if I'm wrong, but none of these recent ideas has been presented as a substitute for this one, so its utility still stands. Whether it is sufficient to also subsume some of the others (efficiently) is a separate discussion to be had.

WebAssembly/exception-handling#105 was written up to describe these low-level primitives. You were explicitly informed here that it could express algebraic effects, but you never engaged in that conversation. #1340 was a Phase 0 proposal to explore those ideas. But in that conversation concerns were raised about security, prioritization, efficiency, and complexity. The presentation I gave last week was following up on these concerns regarding stack inspection, which we needed to gauge in order to determine how best to address the concerns regarding first-class stacks.

The only thing I have mentioned is that effect handlers can express this form of stack inspection (and they immediately explain how that would interact with other control features). But obviously, they would only suitable as an actual strategy if we were able to reliably optimise the special case I mentioned -- which may or may not be the case in this setup. If not, there needs to be a more explicit alternative that can be.

This more explicit alternative has already been discussed a fair amount. It is stack inspection, e.g. #1340 and #1356. Notice that your cont.suspend instruction has the same type as call_stack in #1356. The difference is that call_stack does not require an answer to perform stack allocation just to find out basic information or make basic updates to the current stack. But if you combine it with stack-allocation and stack-redirection primitives, an answer can choose to use continuations if it needs to in order to support the surface-level feature at hand, as we show in #1360. That is, cont.suspect breaks down into a bunch of smaller primitive operations and is not a primitive operation itself.

So this proposal is two iterations behind others that were heavily inspired by it, but which have since broken its concepts down into smaller parts that can be combined more flexibly, and which have already developed answers to many of the open questions in this proposal.

@aheejin
Copy link
Member

aheejin commented Aug 1, 2020

Will this proposal be compatible with the changes I proposed in WebAssembly/exception-handling#123, namely, removal of exnref, introduction of catch_br, and split of catch and unwind?

@RossTate
Copy link

RossTate commented Aug 1, 2020

@aheejin From what I can tell, it should be compatible with those changes.

@RossTate
Copy link

RossTate commented Aug 4, 2020

soil-initiative/stacks#10 now has two translations of this proposal in terms of #1360, one using stack inspection and one using thread-local storage (assuming such a thing exists in the host and is made accessible to WebAssembly). Each implementation strategy has its strengths and weaknesses, and different effects might be better expressed with one or with the other (the translations compose so that different effects can in fact make different choices even within the same application). One of the problems with the proposal here is that it does not let the application pick its own implementation strategy; instead it bakes in a high-level feature and leaves it to the engine to decide/guess which strategy to use, much like how the JVM bakes in interface-method dispatch.

@KronicDeth
Copy link

The green threads scheduler example shown by @rossberg at the 2020-08-04 meeting appears with the addition of exception handling to catch exception from the cont.resume to be sufficient to implement a scheduler for Lumen (Erlang compiler/runtime).

On x86_64 where we have native stack switching we add yield points (here encoded as calls to @__lumen_builtin_yield()) when each Erlang process (green thread) has used up its time slice (encoded as @CURRENT_REDUCTION_COUNT reaching a static limit)

init.erl

-module(init).
-export([start/0]).
-import(erlang, [display/1]).

start() ->
  display(atom).

init.llvm.mlir

module @init {
  llvm.func @"erlang:display/1"(!llvm.i64) -> !llvm.i64
  llvm.func @__lumen_builtin_yield()
  llvm.mlir.global external local_exec @CURRENT_REDUCTION_COUNT() : !llvm.i32
  llvm.func @lumen_eh_personality(...) -> !llvm.i32
  llvm.func @"init:start/0"() -> !llvm.i64 attributes {personality = @lumen_eh_personality} {
    %0 = llvm.mlir.constant(20 : i32) : !llvm.i32
    %1 = llvm.mlir.addressof @CURRENT_REDUCTION_COUNT : !llvm<"i32*">
    %2 = llvm.load %1 : !llvm<"i32*">
    %3 = llvm.icmp "uge" %2, %0 : !llvm.i32
    llvm.cond_br %3, ^bb1, ^bb2
  ^bb1:	// pred: ^bb0
    llvm.call @__lumen_builtin_yield() : () -> ()
    llvm.br ^bb2
  ^bb2:	// 2 preds: ^bb0, ^bb1
    %4 = llvm.mlir.constant(562949953421378 : i64) : !llvm.i64
    %5 = llvm.mlir.addressof @CURRENT_REDUCTION_COUNT : !llvm<"i32*">
    %6 = llvm.mlir.constant(1 : i32) : !llvm.i32
    %7 = llvm.atomicrmw add %5, %6 monotonic  : !llvm.i32
    %8 = llvm.call @"erlang:display/1"(%4) {tail} : (!llvm.i64) -> !llvm.i64
    llvm.return %8 : !llvm.i64
  }
}

On the BEAM for Erlang, the schedulers can do work stealing. How continuations are stored was hand-waved in the presentation. @rossberg will the continuations be stealable/transferable between Web Workers to allow for a scheduler per Web Worker (and one on the main UI thread) to work steal?

@rossberg
Copy link
Member Author

rossberg commented Aug 4, 2020

Updated OP to match the version I presented today, removing separate evtref type and br_on_evt instruction. Also removed cont.forward instruction, since it may not be needed and its semantics seems to be confusing.

@RossTate
Copy link

RossTate commented Aug 4, 2020

@kayceesrk Thanks for the talk! It was very clear and well presented.

In your talk, you showed some assembly code on how you implemented algebraic effects in Multicore OCaml. I feel like a goal of a WebAssembly design for stacks should be to enable WebAssembly programs to essentially write those assembly-level operations themselves, provided the functionality can be provided in a way that is similarly composable. Based on your experience with Multicore OCaml, what would be the key primitives for enabling that?

In the previous CG meeting it was suggested that these continuations would be used for effects that would have a handler in roughly every stack frame. The implementation strategy you presented involved a linear walk through effect handlers, which suggests that having frequent effect handlers would affect performance. Have you run experiments in the context of such effects? If not and you would like a concrete example to try, make a pass through the source code so that every function call has a handler indicating what file and line number the function call is on (or was on in the original source code) so that effect handlers can be used to implement stack tracing (turning off any optimizations you have to eliminate continuation allocations). I'd be interested to hear what the relative performance of the original and the modified programs are.

@RossTate
Copy link

RossTate commented Aug 4, 2020

@rossberg Those were much appreciated improvements!

@rossberg
Copy link
Member Author

@RossTate, the proposed mechanism is more low-level than conventional effect handlers. In particular, stack creation is separated from installing a handler. That way, we believe it can encode both shallow and deep handlers efficiently, unlike either of the other two.

@RossTate
Copy link

The straightforward low-level implementation of shallow handlers is to allocate the stack for the handlee/child without the handler code on the stack and then to specify the handling code in the current/parent stack. That's exactly what the current proposal does. To encode deep handlers with shallow handlers, you make sure to always run the continuation with the relevant handler around it, which again is exactly how the current proposal indirectly supports deep handlers. The current proposal isn't particularly lower-level; it's baked in specifically shallow handlers, but it happens to do so in WebAssembly.

On the other hand, direct support for deep handlers would permit the handling code be part of (the root of) the allocated stack.

@rossberg
Copy link
Member Author

My understanding regarding high-level effect handlers is the following, @slindley, @dhil and @kayceesrk please correct me if i'm wrong. See Daniel & Sam's paper for more details.

  • In their high-level form, shallow handlers and deep handlers have different performance trade-offs and neither approach beats the other -- deep is a fold, shallow is a case. Either one fares better on a certain set of abstractions.

  • Expressing deep handlers in terms of shallow handlers is fairly simple: you merely need to manually reenter the handler in a loop. The reason that can be expensive is mainly because it will naively create an additional stack for every resumption. -- But that is not the case in the low-level mechanism we propose, because we make stack creation a separate operation, such that the same stack can be reused. The only cost is for switching the stack & handler, which can be cheap, as @kayceesrk's numbers showed.

  • Expressing shallow handlers in terms of deep handlers is more involved, because you have to express a case in terms of a fold. So you have to jump through some hoops to break out of the inherent recursion. That would remain somewhat clumsy and expensive, even in a low-level mechanism like the one proposed.

  • With regard to the lower-level mechanism we propose, the difference between the two approaches only materialises in the fact that resume can specify a new handler, which is more flexible. Because stack creation is separated, it can express both high-level flavours without much overhead.

@kayceesrk
Copy link

The only cost is for switching the stack & handler, which can be cheap, as @kayceesrk's numbers showed.

If Multicore OCaml used shallow handlers, then the only additional cost would be a single additional store to update the handler function slot in the stack at resumption. Multicore OCaml uses deep handlers since that is what you want almost all of the time for expressing concurrency (async i/o, generators). Choosing deep handlers leads to succinct high-level programs.

@RossTate
Copy link

Okay, so the answer seems to be that you opted not to directly support deep handlers because you were willing to accept the overhead of the encoding into shallow handlers.

But that is not the case in the low-level mechanism we propose, because we make stack creation a separate operation, such that the same stack can be reused.

Yes, and there's a low-level mechanism for deep handlers that optimizes for deep handlers but enables efficient encoding of shallow handlers. It sounds like this is a better fit for common applications of effect handlers. Such an alternative is also a better fit for languages that allow the prompt/tag to be determined dynamically. It's also easier to extend with optimizations for tail resumption and continuation elimination.

Why was such an option not considered?

@slindley
Copy link

You seem to be making a lot of claims about an alternative design. Do you have any empirical evidence for any of them?

@RossTate
Copy link

RossTate commented Aug 18, 2020

Well no one has any empirical evidence for any of these designs. But I can say that #1360 supports optimizations for tail-resumptive effect handlers—meaning that it can express the more optimized lowered code that tail-resumptive effect handlers permit—and as such it can eliminate the 226 stack switches in @kayceesrk's example for generators depending on how the generated values are consumed. In particular, generators are very often used by for each constructs, which are encodable as tail-resumptive effect handlers, and which consequently #1360 can implement without stack switching. In the less common case, though, where the generated values are enumerated through successive stateful function calls, #1360 can fall back on the less efficient implementation using a stack. Importantly, the functions generating values (say through yield) do not need to know which context they are executing within.

The trick is that #1360 breaks down algebraic effects into lower-level components, namely stack inspection and stack creation/switching (without a trampoline). The "perform operation" step translates to an (eager) stack inspection (call_stack), looking for the most immediate executable stack mark (answer) for that operation and invoking it, giving it direct access to the stack frame it marked. In an unoptimized implementation of deep handlers, this stack mark sits at the root of a stack the handler and handlee were run on, and it has a pointer to the parent stack. In this case, the body of the executable stack mark says to switch control to the parent stack, and then when it's switched back to with some value, it "answers" the stack inspection with whatever value it received (after storing the new parent). But in an optimized implementation of a tail-resumptive handler, there is no need to allocate a new stack and instead the current stack is marked, and the mark simply executes the handler code on the same stack and then "answers" with whatever the handler code would have resumed the continuation with—no stack allocation or stack switching necessary.

#1360 also allows optimizations for more mixed patterns. For example, some effect handlers are tail-resumptive on some control-flow paths but not all. In these cases, #1360 lets you allocate a new stack in case the continuation escapes the handler. But #1360 also lets the stack mark at the root of the allocated stack execute the handler before any stack switch is performed. If control goes down a tail-resumptive branch, then the mark simply "answers" with whatever the handler provides. Only when control goes down a non-tail-resumptive branch is control switched to the parent stack (extended to execute the remaining computation).

The proposal here does not support these optimizations. Furthermore, notice that it's important, especially in the final case with mixed tail resumption, for the handler to be executed on the newly allocated stack rather than the parent stack, like in deep handlers. So the orientation of the current proposal around shallow handlers makes it difficult to add support for such optimizations in the future.

@RossTate
Copy link

Note: I realized that we might be using the term "generator" differently, and after looking at the documentation for Multicore OCaml I indeed found that was the case. So I edited the above comment to give a more nuanced and detailed answer that bridges the difference in terminology.

@RossTate
Copy link

The example of generators in todays presentation appeared to be precisely the "for each" pattern for which all stack allocation and switching can be eliminated (even with "deep" yields, i.e. no need to employ an automaton transformation). This proposal cannot express that optimized implementation strategy though.

Can you explain why you do not want the optimized implementation strategy to be expressible?

@sabine
Copy link

sabine commented Aug 24, 2020

To discuss the finer points of this proposal and #1360, I think it would be helpful, if both could be formulated in terms of a "global model of the world" that includes a notion of suspended stacks, the currently executing stack, and how they relate to each other.

It looks to me like both proposals show formal semantics only from the "viewpoint" of a single thread. The consequence of that is that, for the most interesting instructions, we need to fall back to informal language, because the semantics can not be expressed completely in the single-threaded execution model.

@RossTate
Copy link

@sabine The semantics of both proposals are expressible and implementable in a single-threaded execution model. Could you clarify why you believe they aren't? I feel like that will uproot a miscommunication that hopefully will provide the insight you're looking for.

@KronicDeth
Copy link

@RossTate even if the semantics can be expressed single-threaded, should it be called out that it is not expected that the continations can transfer between threads as it makes it explicit that work stealing schedulers on different threads won't be able to grab a continuation? It was mentioned in the meetings, but it probably should be more explicit than just the meeting notes. It precludes the work-stealing that Erlang's BEAM does, so we won't be able to do that in Lumen.

@RossTate
Copy link

Certainly the CG should be made aware that the applications of both proposals to multi-threading are still limited by the multi-threading model of WebAssembly and the embedder.

@sabine
Copy link

sabine commented Aug 24, 2020

@RossTate certainly, the semantics must be implementable in a single-threaded execution model, and thus, it must be expressible in a single-threaded execution model.

I think what I'm having trouble with is the use of big-step semantics and the use of informal explanations to describe the core parts of stack switch semantics (and it's very likely that I also need to read some papers that you are all familiar with). For example,

  • here in Typed continuations to model stacks #1359, cont.resume (event $evt $handler)* : [t1* (cont $ft)] -> [t2*] we have big-step semantics in the sense that evaluating a cont.resume instruction results in a new stack after the continuation has been executed completely. All the other effects of cont.resume are described informally (though, I think I get the general idea).
  • in Proposal: First-Class Stacks #1360, we have stack.switch $event : [t* stackref] -> unreachable where event $event : [t* stackref], which, I assume, is what the stack switch looks like from the viewpoint of the current thread. In a global model with small-step semantics, however, we should see the stack we just switched to?

I can see that both proposals follow the regular WebAssembly specification conventions from https://webassembly.github.io/spec/core/exec/conventions.html, where it says "The execution rules also assume the presence of an implicit stack that is modified by pushing or popping values, labels, and frames.".

I have this feeling that stack switch semantics would need less informal explanations if we could use a formalism that has an explicit model of the stack (which we could then reuse to express what a suspended stack looks like).

Using small-step semantics with an explicit stack, we might be able to formulate mathematically what happens on suspend, resume, and the more expressible primitives from #1360.

I might be wrong, though, and expressing what happens in small-step semantics is not trivial or is not as helpful as I think it would be.

@rossberg
Copy link
Member Author

@sabine, the semantics of continuations can certainly be described in a small-step semantics. We actually had that written down for an earlier version of this proposal, but haven't adapted it yet to the current incarnation.

Roughly, a continuation/stack is represented by an evaluation context E, and there would be a new administrative instruction for representing handlers on the stack. It's very similar to how you model exceptions, with the continuation just capturing and reifying the evaluation context you have thrown across, and resume reentering it.

Not sure that would be too helpful to most people at the current stage, though.

@kayceesrk
Copy link

One standard way of explicitly reifying stacks is using a CEK machine. @dhil has a WIP extension of the Wasm semantics with effect handlers which I assume is modelled as a CEK machine.

@sabine
Copy link

sabine commented Aug 25, 2020

@rossberg is the earlier (outdated) version of small-step semantics available somewhere?

@rossberg
Copy link
Member Author

@sabine, you might be able to access it here, but beware, it's more than 2 years old, still modelling everything as an extension to exceptions, and it's in fancy markdown.

@RossTate
Copy link

Ah, thanks for the clarification @sabine! The link @rossberg above should be useful for understanding the evaluation-context approach. @kayceesrk mentions the CEK machine, which is another approach. It makes the stack explicit part of the machine state, much like the heap. So if the stack of the machine state looks like (cons K K'), i.e. a stack K' with a pointer to another stack K on top, then a stack switch changes the stack of the machine state to look like (cons K' K). The stack.switch instruction then does something similar, but further incorporating exceptions to manage untyped communication. Hope that helps.

@KronicDeth
Copy link

Saw this OOPSLA '20 paper from a retweet by @yaahc of @AndrewCMyers about bidirectional effect handlers. I'm not good at reading the formal semantics in PL papers yet, so going off just the prose, these sections caught my interest related to WASM's proposed usage of effect handler:

Multicore OCAML's async/await + exception interaction

Prior encodings. The ability to encode promise-based async–await [Dolan et al. 2017; Leijen 1
2017a,b]speakstotheexpressivepowerofalgebraiceffects, butencodingsinexistinglanguage designs compromise on how they accommodate exceptional computations. Koka [Leijen 2017a,b] supports structured asynchrony via algebraic effects, but uses an either monad for possible exceptional outcomes of the await operation—but encoding exceptional outcomes into monadic values is a pattern that algebraic effects in Koka are intended to help avoid! Unlike Koka, Multicore OCaml [Dolan et al. 2017] does not check algebraic effects statically. To notify user programs about asynchronously raised exceptions, the language adds a special discontinue construct. It is our goal to treat async–await and asynchronous exceptions in a more unified way: both are statically checked algebraic effects.

The part about Multicore OCAML seems relevant since the WASM Stacks Group has been using Multicore OCAML as an example, working implementation, so will we also need a special discontinue construct to mesh together async / await and exceptions if we don't support something like the bidirectional effect handlers this paper proses?

Using normal effect handlers to emulate bidirectional

The paper also includes implementation test cases where if we think the effect handlers will need to also produce effects (such as exceptions from async/ await, that yes, they can be implemented with normal effect handlers and closures, but there will be a performance hit.

We performed two hand translations of the ping-pong communication example (Section 3.3, with ponger implemented as in Section 3.3.2) into μC++, with one of the two using callbacks. This program was chosen because it exercises high-frequency bidirectional control transfer. We ran the hand-translated code using the modified μC++ implementation with the extra run-time check disabled, and measured the running time on a 3.2GHz Intel Xeon Gold processor, averaging 500 runs. The translation relying on callbacks incurred a 2.1× slowdown: 42.6 ms vs. 19.8 ms. This result argues for bidirectional handlers as a first-class language feature: obtaining bidirectionality via a desugaring into callbacks is less efficient. (In the other way around, compiling callbacks away into efficient bidirectionality would involve a complex interprocedural analysis.)
8:07

Accidental handling

Preventing accidental handling. Accidental handling of algebraic effects in the presence of effect polymorphism is a known problem. Tunneling (lexically scoped, lifetime-bounded handlers) as a way to avoid accidental handling of exceptions was introduced by Zhang et al. [2016]; follow-on work adapted it to explicit effect polymorphism and proved parametricity [Zhang and Myers 2019]. Brachthäuser et al. [2018, 2020] implement tunneled algebraic effects in a Scala library, called Effekt, that encodes lifetime effects through Scala’s intersection types and path-dependent types.

The paper proves a lot of goodness of bidirectionality because of the type system, so I'm sure it actually would apply to our since it seems to be that like the Exception proposal already, we will have some sort of tag system for effect handlers and won't actually be able to show all effects are handled to allow mixing different languages together.

The paper did make me wonder how we handle the effect handlers not swallowing the wrong effects without all the types the paper outlines, which also involves lifetimes like in Rust too.

Does this even apply to us?

Again, I don't normally read PL papers, so I could be completely misunderstanding how this paper relates to WASM's effect handlers, but I thought I should bring it up since it seemed on-topic.

@RossTate
Copy link

RossTate commented Oct 18, 2020

Oh cool! Glad to see Yizhou has kept this line of work going at his new position. There's a lot to digest from this paper. To keep from fragmenting the discussion too much, I'll focus just on the second item you raised.

Interestingly, the measurements they provided are more relevant to Stack Inspection (#1356). The reason is that both evaluations are done on an implementation of tail-resumptive algebraic effects that does no stack switching, and whereas this proposal (#1359) does not support such an implementation strategy, #1356 does with its answer construct. An answer is two parts: a stack mark (which is akin to dynamic scoping), and a closure (over the stack frame) that is essentially an tail-resumptive effect handler. One way to implement an answer is to directly use the stack frame as its closure's environment, i.e. a stack-allocated closure. Another way is to allocate the closure environment in the heap and copy to and from the closure environment as necessary. What their measurement indicates is that just the cost of heap allocation rather than reusing the stack frame could make stack-inspection-intensive programs 2.1x slower. (For comparison, these are the same authors who found that eager stack tracing makes exception-intensive programs 6x slower.)

@RossTate
Copy link

Following up, the second big thing in this paper is its use of lexical scoping. This proposal, as well as Stack Inspection, use dynamic scoping—one walks up the stack to find the handler for a given effectful operation. This has its advantages—in particular, it works well in untyped (in terms of effects) settings—but also its disadvantages—such as accidentally finding a different handler than what the programmer intended, as described in the paper. To implement lexically scoped handlers, you pass the handler to the effectful function as an explicit low-level argument. The value you are handing off is in general a closure. But the stuff they do with lifetimes ensures that the closure's lifetime never exceeds the lifetime of its stack frame. This enables them to use stack-allocated closures rather than heap-allocated closures. This implementation technique corresponds to the higher-order parameters discussed in WebAssembly/exception-handling#105.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests