Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Add inter-language type bindings #1274

Closed
PoignardAzur opened this issue May 1, 2019 · 61 comments
Closed

Proposal: Add inter-language type bindings #1274

PoignardAzur opened this issue May 1, 2019 · 61 comments

Comments

@PoignardAzur
Copy link

PoignardAzur commented May 1, 2019

WebAssembly is currently very good at executing code written in arbitrary languages from a given (usually JS) interpreter, but it lacks several key features when it comes to combining multiple arbitrary languages together.

One of these features is a language-agnostic type system. I would like to propose that one or several such system(s) be added to WebAssembly.

As an aside, in previous feature discussions, some contributors have expressed that language-interoperability shouldn't be a design goal of WebAssembly. While I agree that it shouldn't necessarily be a high-priority goal, I think it is a goal striving for in the long term. So before I go into design goals, I'm going to lay out the reasons why I think language interoperability is worth the effort.

Why care about language interoperability?

The benefits of lower language-to-language barriers include:

  • More libraries for wasm users: This goes without saying, but improving language interoperability means that users can use existing libraries more often, even if the library is written in a different language than they're using.

  • Easier adoption of small languages: In the current marketplace, it's often difficult for languages without corporate support to get traction. New languages (and even languages like D with years of refinement) have to compete with languages with large ecosystems, and suffer from their own lack of libraries. Language interoperability would allow them to use existing ecosystems like Python's or Java's.

  • Better language-agnostic toolchains: Right now, most-languages have their own library loading scheme and package manager (or, in the case C/C++, several non-official ones). Writing a language-agnostic project builder is hard, because these languages often have subtle dependencies, and ABI incompatibilities, that require a monolithic project-wide solution to resolve. A robut inter-language type system would make it easier for projects to be split into smaller modules, that can be handled by a npm-like solution.

Overall, I think the first point is the most important, by a wide margin. A better type system means better access to other languages, which means more opportunities to reuse code instead of writing it from scratch. I can't overstate how important that is.

Requirements

With that in mind, I want to outline the requirements an inter-language type system would need to pass.

I'm writing under the assumptions that the type system would be strictly used to annotate functions passed between modules, and would not check how languages use their own linear or managed memory in any way.

To be truly useful in a wasm setting, such a type system would need:

1 - Safety

  • Type-safe: The callee must only have access to data specified by the caller, object-capabilities-style.
  • Memory should be "forgotten" at the end of a call. A callee shouldn't be able to gain access to a caller's data, return, and then access that data again in any form.

2 - Overhead

  • Developers should be comfortable making inter-module calls regularly, eg, in a render loop.
  • Zero-copy: The type system should be expressive enough to allow interpreters to implement zero-copy strategies if they want to, and expressive enough for these implementers to know when zero-copy is optimal.

3 - Struct graphs

  • The type system should include structures, optional pointers, variable-length arrays, slices, etc.
  • Ideally, the caller should be able to send an object graph scattered in memory while respecting requirements 1 and 2.

4 - Reference types

  • Modules should be able to exchange reference types nested deep within structure graphs.

5 - Bridge between memory layouts

  • This is a very important point. Different categories of languages have different requirements. Languages relying on linear memory would want to pass slices of memory, whereas languages relying on GC would want to pass GC references.
  • An ideal type system should express semantic types, and let languages decide how to interpret them in memory. While passing data between languages with incompatible memory layouts will always incur some overhead, passing data between similar languages should ideally be cheap (eg, embedders should avoid serialization-deserialization steps if a memcpy can do the same job).
  • Additional bindings may also allow for caching and other optimization strategies.
  • The conversion work when passing data between two modules should be transparent to the developer, as long as the semantic types are compatible.

6 - Compile-time error handling

  • Any error related to invalid function call arguments should be detectable and expressible at compile-time, unlike in, eg, JS, where TypeErrors are thrown at runtime when trying to evaluate the argument.
  • Ideally, language compilers themselves should detect type errors when importing wasm modules, and output expressive, idiomatic errors to the user. What form this error-checking should take would need to be detailed in the tool-conventions repository.
  • This means that an IDL with existing converters to other languages would be a plus.

7 - Provide a Schelling point for inter-language interaction

  • This is easier said than done, but I think wasm should send a signal to all compiler writers, that the standard way to interoperate between languages is X. For obvious reasons, having mutliple competing standards for language interoperability isn't desirable.

Proposed implementation

What I propose is for bindings to the Cap'n'Proto IDL by @kentonv to be added to Webassembly.

They would work in a similar fashion to WebIDL bindings: wasm modules would export functions, and use special instructions to bind them to typed signatures; other modules would import these signatures, and bind them to their own functions.

The following pseudo-syntax is meant to give an idea of what these bindings would look like; it's approximative and heavily inspired by the WebIDL proposal, and focuses more on the technical challenges than on providing exhaustive lists of instructions.

Capnproto binding instructions would all be stored in a new Cap'n'proto bindings section.

Cap'n'proto types

The standard would need an internal representation of capnproto's schema language. As an example, the following Capnproto type:

struct Person {
  name @0 :Text;
  birthdate @3 :Date;

  email @1 :Text;
  phones @2 :List(PhoneNumber);

  struct PhoneNumber {
    number @0 :Text;
    type @1 :Type;

    enum Type {
      mobile @0;
      home @1;
      work @2;
    }
  }
}

struct Date {
  year @0 :Int16;
  month @1 :UInt8;
  day @2 :UInt8;
}

might be represented as

(@capnproto type $Date (struct
    (field "year" Int16)
    (field "month" UInt8)
    (field "day" UInt8)
))
(@capnproto type $Person_PhoneNumber_Type (enum 0 1 2))
(@capnproto type $Person_PhoneNumber (struct
    (field "number" Text)
    (field "type" $Person_PhoneNumber_Type)
))
(@capnproto type $Person (struct
    (field "name" Text)
    (field "email" Text)
    (field "phones" (generic List $Person_PhoneNumber))
    (field "birthdate" $Data)
))

Serializing from linear memory

Capnproto messages pass two types of data: segments (raw bytes), and capabilities.

These roughly map to WebAssembly's linear memory and tables. As such, the simplest possbile way for webassembly to create capnproto messages would be to pass an offset and length to linear memory for segments, and an offset and length to a table for capabilities.

(A better approach could be devised for capabilities, to avoid runtime type checks.)

Note that the actual serialization computations would take place in the glue code, if at all (see Generating the glue code).

Binding operators

Operator Immediates Children Description
segment off‑idx
len‑idx
Takes the off-idx'th and len-idx'th wasm values of the source tuple, which must both be i32s, as the offset and length of a slice of linear memory in which a segment is stored.
captable off‑idx
len‑idx
Takes the off-idx'th and len-idx'th wasm values of the source tuple, which must both be i32s, as the offset and length of a slice of table in which the capability table is stored.
message capnproto-type
capability-table
segments Creates a capnproto message with the format capnproto-type, using the provided capability table and segments.

Serializing from managed memory

It's difficult to pin down specific behavior before the GC proposal lands. But the general implementation is that capnproto bindings would use a single conversion operator to get capnproto types from GC types.

The conversion rules for low-level types would be fairly straightforward: i8 converts to Int8, UInt8 and bool, i16 converts to Int16, etc. High-level types would convert to their capnproto equivalents: structure and array references convert to pointers, opaque references convert to capabilities.

A more complete proposal would need to define a strategy for enum and unions.

Binding operators

Operator Immediates Children Description
as capnproto-type
idx
Takes the idx'th wasm value of the source tuple, which must be a reference, and produces a capnproto value of capnproto-type.

Deserializing to linear memory

Deserializing to linear memory is mostly similar to serializing from it, with one added caveat: the wasm code often doesn't know in advance how much memory the capnproto type will take, and need to provide the host with some sort of dynamic memory management method.

In the WebIDL bindings proposal, the proposed solution is to pass allocator callbacks to the host function. For capnproto bindings, this method would be insufficient, because dynamic allocations need to happen both on the caller side and the callee side.

Another solution would be to allow incoming binding maps to bind to two incoming binding expressions (and thus two functions): one that allocates the memory for the capnproto data, and one that actually takes the data.

Deserializing to managed memory

Deserializing to managed memory would use the same kind of conversion operator as the opposed direction.

Generating the glue code

When linking two wasm modules together (whether statically or dynamically), the embedder should list all capnproto types common to both modules, bindings between function types and and capnproto types, and generate glue code between every different pair of function types.

The glue code would depend on the types of the bound data. Glue code between linear memory bindings would boil down to memcpy calls. Glue code between managed memory bindings would boil down to passing references. On the other hand, glue code between linear and managed memory would involve more complicated nested conversion operations.

For instance, a Java module could export a function, taking the arguments as GC types, and bind that function to a typed signature; the interpreter should allow a Python module and a C++ to import that type signature; the C++ binding would pass data from linear memory, whereas the Python binding would pass data from GC memory. The necessary conversions would be transparent to the Java, Python and C++ compilers.

Alternate solutions

In this section, I'll examine alternate ways to exchange data, and how they rate on the metrics defined in the Requirements section.

Exchange JSON messages

It's the brute-force solution. I'm not going to spend to much time on that one, because its flaws are fairly obvious. It fails to meet requirements 2, 4 and 6.

Send raw bytes encoded in a serialization format

It's a partial solution. Define a way for wasm modules to pass slices of linear memory and tables to other modules, and module writers can then use a serialization format (capnproto, protobuff or some other) to encode a structured graph into a sequence of bytes, pass the bytes, and use the same format to decode it.

It passes 1 and 3, and it can pass 2 and 4 with some tweaking (eg pass the references as indices to a table). It can pass 6 if the user makes sure to export the serialization type to a type definition in the caller's language.

However, it fails at requirements 5 and 7. It's impractical when binding between two GC implementations; for instance, a Python module calling a Java library with through Protobuf would need to serialize a dictionary as linear memory, pass that slice of memory, and then deserialize it as a Java object, instead making a few hashtable lookups that can be optimized away in a JIT implementation.

And it encourages each library writer to use their own serialization format (JSON, Protobuf, FlatBuffer, Cap'n Proto, SBE), which isn't ideal for interoperability; although that could be alleviated by defining a canonical serialization format in tool-conventions.

However, adding the possibility to pass arbitrary slices of linear memory would be a good first step.

Send GC objects

It would be possible to rely on modules sending each other GC objects.

The solution has some advantages: the GC proposal is already underway; it passes 1, 3, 4 and 7. GC-collected data is expensive to allocate, but cheap to pass around.

However, that solution is not ideal for C-like languages. For instance, a D module passing data to a Rust module would need to serialize its data into a GC graph, pass the graph to the Rust function, which would deserialize it into its linear memory. This process allocates GC nodes which are immediately discarded, for a lot of unnecessary overhead.

That aside, the current GC proposal has no built-in support for enums and unions; and error handling would either be at link time or run time instead of compile time, unless the compiler can read and understand wasm GC types.

Use other encodings

Any serialization library that defines a type system could work for wasm.

Capnproto seems most appropriate, because of its emphasis on zero-copy, and its built-in object capabilities which map neatly to reference types.

Remaining work

The following concepts would need to be fleshed out to turn this bare-bones proposal into a document that can be submitted to the Community Group.

  • Binding operators
  • GC type equivalences
  • Object capabilities
  • Bool arrays
  • Arrays
  • Constants
  • Generics
  • Type evolution
  • Add a third "getters and setters" binding type.
  • Possible caching strategies
  • Support for multiple tables and linear memories

In the meantime, any feedback on what I've already written would be welcome. The scope here is pretty vast, so I'd appreciate help narrowing down what questions this proposal needs to answer.

@bitwalker
Copy link

This is really interesting! I've only read through quickly, and just have some initial thoughts, but my first and foremost question would be to ask why the existing FFI mechanism that most languages already provide/use are not sufficient for WebAssembly. Virtually every language I'm familiar with has some form of C FFI, and thus are already capable of interoperating today. Many of those languages are able to do static type checking based on those bindings as well. Furthermore, there is already a great deal of tooling around these interfaces (for example, the bindgen crate for Rust, erl_nif for Erlang/BEAM, etc.). C FFI already addresses the most important requirements, and has the key benefit of already being proven and used in practice widely.

5 - Bridge between memory layouts

An ideal type system should express semantic types, and let languages decide how to interpret them in memory. While passing data between languages with incompatible memory layouts will always incur some overhead, passing data between similar languages should ideally be cheap (eg, embedders should avoid serialization-deserialization steps if a memcpy can do the same job).

The conversion work when passing data between two modules should be transparent to the developer, as long as the semantic types are compatible.

The transparent translation of one layout to another when passing data across the FFI barrier really seems like a job for compiler backends or language runtimes to me, and likely not desirable at all in performance-sensitive languages like C/C++/Rust/etc. In particular, for things you plan on passing back and forth across FFI, it would seem to me to always be preferable to use a common ABI, rather than do any kind of translation, as the translation would likely incur too high of a cost. The benefit of choosing a layout other than the common ABI of the platform is unlikely to be worth it, but I'll readily admit I may be misunderstanding what you mean by alternative layouts.

As an aside, putting the burden of solid FFI tooling on compilers/runtimes has an additional benefit, in that any improvements made are applicable on other platforms, and vice versa, as improvements to FFI for non-Wasm platforms benefit Wasm. I think the argument has to be really compelling to essentially start from square one and build a new FFI mechanism.

Apologies if I've misunderstood the purpose of the proposal, or missed something critical, as I mentioned above, I need to read through again more carefully, but felt like I needed to raise my initial questions while I had some time.

@KronicDeth
Copy link

Apache Arrow exists for this too, but is more focused on high performance applications.

@lukewagner
Copy link
Member

I think I agree with the general motivation here and it basically lines up with discussions we've had with how Web IDL Bindings could be generalized in the future. Indeed, earlier drafts of the explainer contained an FAQ entry mentioning this inter-language use case.

My main concern (and reason for omitting that FAQ entry) is scope: the general problem of binding N languages seems likely to generate a lot of open-ended (and possibly non-terminating) discussion, especially given that noone is doing it already (which of course is a chicken-and-egg problem). In contrast, the problems addressed by Web IDL Bindings are fairly concrete and readily demonstrated with Rust/C++ today, allowing us to motivate the (non-trivial) effort to standardize/implement and also eagerly prototype/validate the proposed solution.

But my hope is that Web IDL Bindings allows us to break this chicken-and-egg problem and start getting some experience with inter-language binding that could motivate a next wave of extension or something new and not Web IDL specific. (Note that, as currently proposed, if two wasm modules using compatible Web IDL Bindings call each other, an optimizing impl can do the optimizations you're mentioning here; just without the full expressivity of Cap'n Proto.)

@fgmccabe
Copy link

fgmccabe commented May 1, 2019

I should state up front that I have not yet had the time to fully grok the proposal.
The reason for that is that I believe the task to be impossible. There are two fundamental reasons for this:
a. Different languages have different semantics that are not necessarily captured in a type annotation. Case in point, Prolog's evaluation is radically different to C++'s evaluation: to the point where the languages are essentially not interoperable. (For Prolog you can substitute a number of other languages)

b. By definition, any LCD of type systems is not guaranteed to capture all of a given language's type language. That leaves the language implementer with a deeply uncomfortable choice: support their own language or forgo the benefits of their language's type system. Case in point: Haskell has 'type classes'. Any implementation of Haskell which involved not supporting type classes would effectively gut it and make it unusable.
Another example: C++'s support of generics requires compile-time elimination of the genericity; on the other hand, ML, Java (and a bunch of other languages) use a form of universal representation -- that is not compatible with the approach taken by C++.

On the other hand having two expressions of an exported/imported type seems to bring up its own issues: is the language system supposed to verify that the two expressions are consistent in some sense? Whose responsibility is it to do this work?

@fgmccabe fgmccabe closed this as completed May 1, 2019
@fgmccabe fgmccabe reopened this May 1, 2019
@bitwalker
Copy link

@lukewagner Thanks for the links! Definitely glad I got a chance to read that document!

It seems to me like there are two things kinda blended together in this particular discussion - some of what is below is written out so I can have my understanding double checked, so feel free to point out anything I may have misunderstood or missed:

  1. Efficient host bindings
    • Basically the problem WebIDL is intended to solve, at least for browser environments - an interface description that maps from module->host and host->module, essentially delegating the work of translating from one to the other to the host engine. This translation isn't necessarily guaranteed to be ideal, or even optimized at all, but optimizing engines can make use of it to do so. However, even optimized, translation is still performed to some degree, but this is acceptable because the alternative is still translation, just slower.
  2. Efficient heterogenous module-to-module bindings.
    • In other words, given two modules, one written in source and the other in dest, sharing types between them, calling from source->dest and/or dest->source
    • If no common FFI is available, and given something like WebIDL, i.e. piggy-backing on 1, the unoptimized path would be to translate through some common denominator type provided by the host environment when calling across language barriers, e.g. source type -> common type -> dest type.
      • An optimizing engine could theoretically make this translation direct from source to dest without the go-between, but still imposes translation overhead.
    • If a common FFI is available, i.e. source and dest share an ABI (e.g. C ABI), then source and dest can call each other directly with no overhead at all, via the FFI. This is probably the most likely scenario in practice.

So my take is that there are definitely benefits to leveraging WebIDL, or something like it (i.e. a superset that supports a broader set of host APIs/environments), but it is really only a solution to the problem outlined in 1, and the subset of 2 which deals with inter-language bindings where no FFI is available. The subset of 2 where FFI is available, is clearly preferable to the alternatives, since it incurs no overhead per se.

Are there good reasons for using an IDL even when FFI is an option? To be clear, I definitely agree with using an IDL for the other use cases mentioned, but I'm specifically asking in the context of language interoperability, not host bindings.

A couple additional questions I have, if both C FFI (as an example, since it is most common) and IDL are used/present at the same time:

  • If both source and dest languages provide different type definitions for a shared type with the same underlying in-memory representation according to their common ABI (for example, a common representation for a variable-length array) - will the host engine try to perform a translation between those types just because the IDL directives are present, even though they could safely call each other using their standard FFI?
    • If not, and it is opt-in, that seems like the ideal scenario, since you can add IDL to support interop with languages without FFI, while supporting languages with FFI at the same time. I'm not sure how a host engine would make that work though. I haven't thought it through completely, so I'm probably missing something
    • If so, how does the host engine unify types?:
      • If the engine only cares about layout, then how can static analysis detect when a caller provides incorrect argument types to a callee? If that kind of analysis is not a goal, then it would seem that IDL is really only ideally suited for host bindings, and less so cross-language.
      • If the engine cares about more than layout, in other words the type system requires both nominal and structural compatibility:
        • Who defines the authoritative type for some function? How do I even reference the authoritative type from some language? For example, let's say I'm calling a shared library written in another language which defines an add/2 function, and add/2 expects two arguments of some type size_t. My language doesn't necessarily know about size_t nominally, it has its own ABI-compatible representation of machine-width unsigned integers, usize, so the FFI bindings for that function in my language use my languages types. Given that, how can my compiler know to generate IDL that maps usize to size_t.
  • Are there examples of IDL interfaces used to call between modules in a program, where FFI is available but explicitly left unused in favor of the IDL-described interface? Specifically something not WebAssembly, mostly interested to study the benefits in those cases.

I'll admit I'm still trying to dig through the full details of WebIDL and its predecessors, how all this fits in with the different hosts (browser vs non-browser) and so on, definitely let me know if I've overlooked something.

@PoignardAzur
Copy link
Author

PoignardAzur commented May 1, 2019

@bitwalker

This is really interesting!

Glad you liked it!

but my first and foremost question would be to ask why the existing FFI mechanism that most languages already provide/use are not sufficient for WebAssembly.

The C type system has a few problems as an inter-language IDL:

  • It operates under the assumption of a shared address space, which unsafe and deliberately doesn't hold in WebAssembly. (my own experience with a JS-to-C FFI suggests that implementations tend to just trade safety for speed)

  • It doesn't have native support for dynamic length arrays, tagged unions, default values, generics, etc.

  • There isn't a direct equivalent to reference types.

C++ solves some of these problems (not the biggest one, shared address space), but adds a bunch of concepts that aren't really useful in IPC. Of course, you could always use a superset of C or a subset of C++ as your IDL and then devise binding rules around it, but at that point you're getting almost no benefits from existing code, so you may as well use an existing IDL.

In particular, for things you plan on passing back and forth across FFI

I'm not quite how you mean that, but to be clear: I don't think passing mutable data back and forth between modules is possible in the general case. This proposal tries to outline a way to send immutable data and get immutable data in return, between modules that don't have any information on how the other stores its data.

The benefit of choosing a layout other than the common ABI of the platform is unlikely to be worth it, but I'll readily admit I may be misunderstanding what you mean by alternative layouts.

The thing is, right now, the common ABI is a slice of bytes stored in linear memory. But in the future, when the GC proposal is implemented, some languages (Java, C#, Python) will store very little to nothing in linear memory. Instead, they will store all their data in GC structures. If two of these languages try to communicate, serializing these structures to a stream of bytes only to immediately deserialize them would be unnecessary overhead.


@KronicDeth Thanks, I'll look into it.

Although, from skimming the doc, this seems to be a superset of Flatbuffers, specifically intended to improve performance? Either way, what are its qualities that can uniquely help WebAssembly module interoperability, compared to Flatbuffers or Capnproto?


@lukewagner

But my hope is that Web IDL Bindings allows us to break this chicken-and-egg problem and start getting some experience with inter-language binding that could motivate a next wave of extension or something new and not Web IDL specific.

Agreed. My assumption when writing this proposal was that any capnproto bindings implementation would be based on feedback from implementing the WebIDL proposal.

My main concern (and reason for omitting that FAQ entry) is scope: the general problem of binding N languages seems likely to generate a lot of open-ended (and possibly non-terminating) discussion, especially given that noone is doing it already (which of course is a chicken-and-egg problem).

I think discussing a capnproto implementation does have value, though, even this early.

In particular, I tried to outline what requirements the implementation should/could try to fulfill. I think it would also be useful to list common use cases that an inter-language type system might try to address.

Regarding the N-to-N problem, I'm focusing on these solutions:

  • Only worry about RPC-style data transfer. Don't try to pass shared mutable data, classes, pointer lifetimes, or any other type on information more complicated than "a vector has three fields: 'x', 'y', and 'z', which are all floats".

  • Try to group languages and use cases into "clusters" of data-handling strategies. Establish strategies at the center of these clusters; language compilers bind to a given strategy, and the interpreter does the rest of the NxN work.


@fgmccabe

The reason for that is that I believe the task to be impossible. There are two fundamental reasons for this:
a. Different languages have different semantics that are not necessarily captured in a type annotation. Case in point, Prolog's evaluation is radically different to C++'s evaluation: to the point where the languages are essentially not interoperable. (For Prolog you can substitute a number of other languages)

Any implementation of Haskell which involved not supporting type classes would effectively gut it and make it unusable.

Yeah, the idea isn't to define a perfect "easily compatible with all languages" abstraction.

That said, I think most languages have some similarities in how they structure their data (eg, they have a way to say a "every person has a name, an email, and an age", or "every group has a list of people of arbitrary size").

I think it's possible to tap into these similarities to significantly reduce friction between modules. (see also my answer to lukewagner)

b. By definition, any LCD of type systems is not guaranteed to capture all of a given language's type language. That leaves the language implementer with a deeply uncomfortable choice: support their own language or forgo the benefits of their language's type system.

Yeah. I think the rule of thumb here is "If it's a shared library boundary, make it a capnproto type, otherwise, use your native types".

On the other hand having two expressions of an exported/imported type seems to bring up its own issues: is the language system supposed to verify that the two expressions are consistent in some sense? Whose responsibility is it to do this work?

Yeah, I initially wanted to include a section about invariant-checking, and another about type compatibility, but I lost courage.

The answer to "whose responsibility is it" is usually "the callee" (because they must assume any data they receive is suspect), but the checks could be elided if the interpreter can prove that the caller respects the type invariants.

@bitwalker
Copy link

The C type system has a few problems as an inter-language IDL

Just to be clear, I'm not suggesting it as an IDL. Rather I'm suggesting that the binary interface (the C ABI) already exists, is well-defined, and has extensive language support already. The implication then is that WebAssembly doesn't need to provide another solution unless the problem being solved goes beyond cross-language interop.

It operates under the assumption of a shared address space, which unsafe and deliberately doesn't hold in WebAssembly.

So I think I see part of the misunderstanding here. There are two classes of FFI that we're talking about here, one which involves sharing linear memory (more traditional shared memory FFI), and one which does not (more traditional IPC/RPC). I've been talking about the former, and I think you are more focused on the latter.

Sharing memory between modules when you are in control of them (such as the case where you are linking together multiple independent modules as part of an overall application) is desirable for efficiency, but does sacrifice security. On the other hand, it is possible to share a designated linear memory specifically for FFI, though I don't know how practical that is with the default tooling out there today.

Cross-module interop that doesn't use shared memory FFI, i.e. IPC/RPC, definitely seems like a good match for WebIDL, capnproto or one of the other suggestions in that vein, since that is their bread-and-butter.

The part I'm not sure about then is how to blend the two categories in such a way that you don't sacrifice the benefits of either, since the choice to go one way or the other is heavily dependent on use case. At least as stated it seems we could only have one or the other, if it is possible to support both, I think that would be ideal.

It doesn't have native support for dynamic length arrays, tagged unions, default values, generics, etc.

I think this probably isn't relevant now that I realize we were talking about two different things, but just for posterity: The ABI certainly has a representation for variable-length arrays and tagged unions, but you are right in that C does have a weak type system, but that's not really the point, languages aren't targeting C FFI for the C type system. The reason why the C ABI is useful is that it provides a common denominator that languages are able to use to communicate with others that may have no concept of the type system they are interacting with. The lack of higher-level type system features is not ideal, and limits the kind of things you can express via FFI, but the limitations are also part of why it is so successful at what it does, pretty much any language can find a way to represent the things exposed to it via that interface, and vice versa.

C++ solves some of these problems (not the biggest one, shared address space), but adds a bunch of concepts that aren't really useful in IPC. Of course, you could always use a superset of C or a subset of C++ as your IDL and then devise binding rules around it, but at that point you're getting almost no benefits from existing code, so you may as well use an existing IDL.

Agreed, for IPC/RPC, C is a terrible language for defining interfaces.

The thing is, right now, the common ABI is a slice of bytes stored in linear memory.

That's certainly the primtive we're working with, but the C ABI defines a lot on top of that.

But in the future, when the GC proposal is implemented, some languages (Java, C#, Python) will store very little to nothing in linear memory. Instead, they will store all their data in GC structures. If two of these languages try to communicate, serializing these structures to a stream of bytes only to immediately deserialize them would be unnecessary overhead.

I'm not convinced that those languages will jump on defering GC to the host, but that's just speculation on my part. In any case, languages that understand the host GC managed structures could just decide on a common representation for those structures using the C ABI just as easily as they could be represented using capnproto, the only difference is where the specification of that representation lives. That said, I have only a very tenuous grasp of the details of the GC proposal and how that ties in to the host bindings proposal, so if I'm way off the mark here, feel free to disregard.

TL;DR: I think we agree with regard to module interop where shared linear memory is not in play. But I think shared memory is important to support, and the C ABI is the sanest choice for that use case due to existing language support. My hope would be that this proposal as it evolves would support both.

@aardappel
Copy link

aardappel commented May 2, 2019

What we need is simply a maximally efficient way to exchange buffers of bytes, and a way for languages to agree on the format. There is no need to fix this to one particular serialization system. If Cap'n Proto is the most suitable for this purpose, it can arise as a common default organically, rather than being mandated by wasm.

I am of course biased, as I made FlatBuffers, which is similar to Cap'n Proto in efficiency, but more flexible and more widely supported. I however would not recommend this format to be mandated by wasm either.

There are many other formats that could be preferable to these two given certain use cases.

Note that both Cap'n Proto and FlatBuffers are zero copy, random access, and are efficient at nesting formats (meaning a format wrapped in another is not less efficient than not being wrapped), which are the real properties to consider for inter-language communication. You could imagine an IDL that allows you to specify very precise byte layouts for a buffer, including "the following bytes are Cap'n Proto schema X".

While I am subtly self-promoting, I might point people at FlexBuffers which is kind of like schema-less FlatBuffers. It has the same desirable zero-copy, random access and cheap nesting properties, but can allow languages to communicate without agreeing on a schema, without doing codegen, similar to how JSON is used.

@PoignardAzur
Copy link
Author

@aardappel

What we need is simply a maximally efficient way to exchange buffers of bytes, and a way for languages to agree on the format. There is no need to fix this to one particular serialization system. If Cap'n Proto is the most suitable for this purpose, it can arise as a common default organically, rather than being mandated by wasm.

I understand the implicit point, that wasm shouldn't be used as a way to impose one standard other its competitors, and I'm personally indifferent to which IDL gets picked.

That said, when all is said and done, the rubber needs to meet the road at some point. If wasm wants to facilitate inter-language communication (which, granted, isn't an assumption everyone shares), then it needs a standard format that can express more than "these bytes make up numbers". That format can be capnproto, C structures, flatbuffers or even something specific to wasm, but it can't be a subset of all of these at the same time, for the reasons @fgmccabe outlined.

While I am subtly self-promoting, I might point people at FlexBuffers which is kind of like schema-less FlatBuffers. It has the same desirable zero-copy, random access and cheap nesting properties, but can allow languages to communicate without agreeing on a schema, without doing codegen, similar to how JSON is used.

I see the appeal, I don't think this is what you want most of the time, when writing a library. The problem with JSON (aside from the terrible parse time) is that when you write import a JSON object somewhere in your code, you end up writing lots of sanitizing code before you can use your data, eg:

assert(myObj.foo);
assert(isJsonObject(myObj.foo));
assert(myObj.foo.bar);
assert(isString(myObj.foo.bar));
loadUrl(myObj.foo.bar);

with potential security vulnerabilities if you don't.

See also 6 - Compile-time error handling above.


@bitwalker

Right, I didn't really consider the possibility of shared linear memory. I'd need someone more familiar with webassembly design than me (@lukewagner ?) to discuss how feasible it is, and whether it's a good way to achieve inter-module calls; it would also depend on how many assumptions FFIs rely on that are invalidated by wasm's memory layout.

For instance, FFIs will often rely on the fact that their host language uses the C library, and give native libraries access to the malloc function directly. How well can that strategy be translated to wasm, in the context of two mutually suspicious modules?

@kentonv
Copy link

kentonv commented May 4, 2019

I guess I should say something on this thread, as the creator of Cap'n Proto, but weirdly enough, I haven't found that I have much of an opinion. Let me express a few adjacent thoughts that may or may not be interesting.

I am also the tech lead of Cloudflare Workers, a "serverless" environment that runs JavaScript and WASM.

We've been considering supporting Cap'n Proto RPC as a protocol for workers to talk to each other. Currently, they are limited to HTTP, so the bar is set quite low. :)

In Workers, when one Worker calls another, it is very commonly the case that both run on the same machine, even in the same process. For that reason, a zero-copy serialization like Cap'n Proto obviously makes a lot of sense, especially for WASM Workers since they operate on linear memory that could, in theory, be physically shared between them.

A second, less-well-known reason we think this is a good fit is the RPC system. Cap'n Proto features a full object capability RPC protocol with promise pipelining, modeled after CapTP. This makes it easy to express rich, object-oriented interactions in a secure and performant way. Cap'n Proto RPC is not just a point-to-point protocol, but rather models interactions between any number of networked parties, which we think will be a pretty big deal.

Meanwhile in WASM land, WASI is introducing a capability-based API. It seems like there could be some interesting "synergy" here.

With all that said, several design goals of Cap'n Proto may not make sense for the specific use case of FFI:

  • Cap'n Proto messages are designed to be position-independent and contiguous so that they can be transmitted and shared between address spaces. Pointers are relative, and all objects in a message need to be allocated in contiguous memory, or at least a small number of segments. This significantly complicates the usage model compared to native objects. When using FFI within the same linear memory space, this overhead is wasted, as you could be passing native pointers to loose heap objects just fine.
  • Cap'n Proto messages are designed to be forwards- and backwards-compatible between schema versions, including the ability to copy objects and sub-trees losslessly without knowing the schema. This requires some light type information be stored directly in the content, which Cap'n Proto encodes as metadata on every pointer. If two modules communicating over an FFI are compiled at the same time, then there is no need for this metadata.
  • Cap'n Proto RPC's promise pipelining, path-shortening, and ordering guarantees make sense when there is non-negligible latency between a caller and a callee. FFI on a single CPU has no such latency, in which case the promise pipelining machinery probably just wastes cycles.

In short, I think when you have independently-deployed modules in separate sandboxes talking to each other, Cap'n Proto makes tons of sense. But for simultaneously-deployed modules in a single sandbox, it's probably overkill.

@PoignardAzur
Copy link
Author

Thanks for the feedback!

Pointers are relative, and all objects in a message need to be allocated in contiguous memory, or at least a small number of segments. This significantly complicates the usage model compared to native objects. When using FFI within the same linear memory space, this overhead is wasted, as you could be passing native pointers to loose heap objects just fine.

I don't know how feasible a shared linear memory approach is for wasm (see above).

That said, either way, I don't think the overhead from relative pointers would be that bad. WebAssembly already uses offsets relative to the start of linear memory, and implementations have tricks to optimize the ADD instructions away in most cases (I think), so the overhead of using relative pointers could probably be optimized away as well.

Cap'n Proto messages are designed to be forwards- and backwards-compatible between schema versions, including the ability to copy objects and sub-trees losslessly without knowing the schema. [...] If two modules communicating over an FFI are compiled at the same time, then there is no need for this metadata.

I don't think that's true. Having a way for modules to define backwards-compatible types at their boundaries allows wasm to use a dependency tree model, while mostly avoiding Haskell's dependency diamond problem.

A bigger source of pointless overhead would be the way capnproto xors its variables against their default values, which is useful when zero-bytes are compressed away, but counter-productive in zero-copy workflows.

@kentonv
Copy link

kentonv commented May 5, 2019

I don't know how feasible a shared linear memory approach is for wasm (see above).

Ah, TBH I don't think I have enough context to follow that part of the discussion. If you don't have a shared address space then, yes, Cap'n Proto starts to make a lot of sense.

I'm happy to provide advice on how to design formats like this. FWIW there's a few little things I'd change in Cap'n Proto if I didn't care about compatibility with apps that already exist today... it's mostly, like, low-level pointer encoding details, though.

A bigger source of pointless overhead would be the way capnproto xors its variables against their default values, which is useful when zero-bytes are compressed away, but counter-productive in zero-copy workflows.

A bit off-topic, but the XOR thing is an optimization, not overhead, even in the zero-copy case. It ensures that all structures are zero-initialized, which means you don't have to do any initialization on object allocation if the buffer is already zero'd (which it often would be anyway). An XOR against a compile-time constant probably costs 1 cycle whereas any kind of memory access will cost much more.

@PoignardAzur
Copy link
Author

@lukewagner Any thoughts on the "sharing linear memory" part?

@lukewagner
Copy link
Member

I think there are use cases for both sharing and not sharing linear memory and ultimately tools need to support both:

Sharing makes sense where a native app today would use static or dynamic linking today: when all the code being combined is fully trusted and its combination is all using either the same toolchain or using a rigorously-defined ABI. It's more a more fragile software-composition model, though.

Not sharing memory makes sense for a more-loosely coupled collection of modules, where classic Unix-style design would put the code into separate processes connected by pipes. Personally, I think this is the more exciting/futuristic direction for a more compositional software ecosystem and so I've advocated for this to be the default for any toolchain aimed at participating in the ESM/npm ecosystem via ESM-integration (and indeed that is the case today with Rust's wasm-pack/wasm-bindgen). Using a mechanism in the general vicinity of Web IDL Bindings or the extension you've proposed makes a lot of sense to me as a form of efficient, ergonomic, typed (sync or async) RPC.

@jgravelle-google
Copy link

Having finally read this in full, it sounds a lot like my thinking in this area (which this comment box is too short to contain?).

In particular I've been thinking about the inter-module communication problem as being best described with a schema. Which is to say, we don't need the Cap'nProto serialization format, we can just use the schema. I have no opinion about Cap'nProto's schema language specifically at this time.

From the WASI/ESM+npm perspective, a solution of this form makes the most sense to me. It's an abstraction over ABIs, without depending on a shared ABI. It essentially allows one to describe an interface with a schema-lang API, and call across these language boundaries with native-seeming ABIs on both ends, letting the host handle translation.

In particular, this does not subsume the use case for having more coordination with another module: if you know for sure that you can share an ABI, you can in fact just use an ABI, any ABI, whether that be C or Haskell. If you control and compile all the wasm in question, that's a much easier problem to solve. It's only when you get into the npm case where you're loading arbitrary unknown code and you don't know its source language, that something like having schema-level interop between modules becomes incredibly attractive. Because we can either use the LCD of wasm itself - which I predict will follow a similar arc to native libraries, and use the C ABI - or we can use the LCD of languages, encoded in the schema language. And the schema can be more flexible by making requirement 2) a soft requirement, e.g. it should be possible to convert from C to Rust to Nim efficiently, but C to Haskell having more overhead isn't a dealbreaker.

@rossberg
Copy link
Member

In particular I've been thinking about the inter-module communication problem as being best described with a schema. Which is to say, we don't need [a] serialization format, we can just use the schema.

I tend to agree with the former, but I'm not sure that the latter follows. Who implements the schema? Even if the host does the transporting, at some point you have to define what Wasm values/bytes are actually consumed/produced on both ends, and each module has to bring its own data into a form the host understands. There may even be multiple forms available, but still that isn't dissimilar from a serialisation format, just slightly more high-level.

it should be possible to convert from C to Rust to Nim efficiently, C to Haskell having more overhead isn't a dealbreaker.

Perhaps not, but you have to be aware of the implications. Privileging C-like languages means that Haskell wouldn't use this abstraction for Haskell modules, because of the overhead induced. That in turn means that it wouldn't participate in the same "npm" ecosystem for its own libraries.

And "Haskell" here is just a stand-in for pretty much every high-level language. The vast majority of languages are not C-like.

I don't claim to have a better solution, but I think we have to stay realistic about how efficient and attractive any single ABI or schema abstraction can be for the general population of languages, beyond the usual FFI-style of oneway interoperability. In particular, I'm not convinced that a pan-linguistic package ecosystem is an overly realistic outcome.

@PoignardAzur
Copy link
Author

Privileging C-like languages means that Haskell wouldn't use this abstraction for Haskell modules, because of the overhead induced. That in turn means that it wouldn't participate in the same "npm" ecosystem for its own libraries.

And "Haskell" here is just a stand-in for pretty much every high-level language. The vast majority of languages are not C-like.

Could give some specific use cases? Ideally, existing libraries in Haskell or some other language that would be awkward to translate into a serialization schema?

I suspect that it will mostly come down to utility libraries vs business libraries. Eg containers, sorting algorithms, and other utilities relying on the language's generics won't translate well to wasm, but parsers, gui widgets, and filesystem tools will.

@rossberg
Copy link
Member

rossberg commented May 21, 2019

@PoignardAzur, it's not difficult to translate them, but it requires them to copy (serialise/deserialise) all arguments/results on both ends of each cross-module call. Clearly, you don't want to pay that cost for every language-internal library call.

In Haskell specifically you also have the additional problem that copying is incompatible with the semantics of laziness. In other languages it may be incompatible with stateful data.

@jgravelle-google
Copy link

Who implements the schema? Even if the host does the transporting, at some point you have to define what Wasm values/bytes are actually consumed/produced on both ends, and each module has to bring its own data into a form the host understands. There may even be multiple forms available, but still that isn't dissimilar from a serialisation format, just slightly more high-level.

The host implements the schema. The schema doesn't describe bytes at all, and lets that be an implementation detail. This is borrowing from the design of the WebIDL Bindings proposal, in which the interesting bit is in the conversions from C structs to WebIDL types. This sort of a design uses Wasm Abstract Interface Types (I suggest the acronym: WAIT) instead of WebIDL types. In the WebIDL proposal we don't need or want to mandate a binary representation of data when it's been "translated to WebIDL", because we want to be able to go straight from wasm to browser APIs without a stop in between.

Privileging C-like languages means that Haskell wouldn't use this abstraction for Haskell modules, because of the overhead induced.

Oh, agree 100%. I should have finished the example to make that more clear: Meanwhile, Haskell to Elm to C# can be similarly efficient (assuming they use wasm gc types), but C# to Rust may have overhead. I don't think there's a way to avoid overhead when jumping across language paradigms.

I think your observation is correct that we need to try avoiding privileging any languages, because if we fail to be sufficiently ergonomic + performant for a given language, they will not see as much value in using the interface, and thus not participate in the ecosystem.

I believe that by abstracting over the types and not specifying a wire format, we're able to give much more leeway to hosts to optimize. I think a non-goal is to say "C-style strings are efficient", but it is a goal to say "languages that [want to] reason about C-style strings can do so efficiently". Or, no one format should be blessed, but certain compatible call chains should be efficient, and all call chains should be possible.

By call chains I mean:

  1. C -> Rust -> Zig -> Fortran, efficient
  2. Haskell -> C# -> Haskell, efficient
  3. C -> Haskell -> Rust -> Scheme, inefficient
  4. Java -> Rust, inefficient

And "Haskell" here is just a stand-in for pretty much every high-level language. The vast majority of languages are not C-like.

Yes, that was my intent behind using Haskell as a concrete language. (Although Nim was probably a bad example of a C-like language because it makes heavy use of GC too)

--

Another way I've been thinking about the abstract types is as an IR. In the same way that LLVM describes a many-to-one-to-many relationship (many languages -> one IR -> many targets), wasm abstract types can mediate a many-to-many mapping, of languages+hosts -> languages+hosts. Something in this design space takes the N^2 mapping problem and turns it into an N+N one.

@rossberg
Copy link
Member

rossberg commented May 22, 2019

The host implements the schema.

Well, that can't be enough, each module has to implement something so that the host can find the data. If the host expects C layout then you have to define this C layout, and every client has to marshal/unmarshal to/from that internally. That isn't all that different from a serialisation format.

Even if we did that, it's still useful to define a serialisation format, e.g., for applications that need to transfer data between single engines, e.g. via networking or file-based persistence.

@jgravelle-google
Copy link

Well, that can't be enough, each module has to implement something so that the host can find the data. If the host expects C layout then you have to define this C layout

The host shouldn't expect anything, but needs to support everything. More concretely, using the webidl-bindings proposal as an illustrative example, we have utf8-cstr and utf8-str, which take i32 (ptr) and i32 (ptr), i32 (len) respectively. There's no need to mandate in the spec "the host internally represents this as C-strings" to be able to concretely map between them.
So, each module implements something, yes, but the representation of the data doesn't need to be expressed in the abstract data/schema layer, which is how this gives us the property of abstracting over that data layout.
Additionally, this is extensible at the bindings layer that maps between concrete wasm types and abstract intermediate types. To add Haskell support (which models strings as both cons lists of chars and arrays of chars), we can add utf8-cons-str and utf8-array-str bindings, which expect (and validate) wasm types of (using current gc proposal syntax) (type $haskellString (struct (field i8) (field (ref $haskellString)))) and (type $haskellText (array i8)).

Which is to say, each module decides how the data originates. The abstract types + bindings allow for conversions between how modules view the same data, without blessing a single representation as being somehow canonical.

A serialization format for (a subset of) the abstract types would be useful, but can be implemented as a consumer of the schema format, and I believe is an orthogonal concern. FIDL I believe has a serialization format for the subset of types that can be transferred across the network, disallows materializing opaque handles, while permitting opaque handles to transfer within a system (IPC yes, RPC no).

@PoignardAzur
Copy link
Author

PoignardAzur commented May 22, 2019

What you're describing is pretty close to what I had in mind, with one big caveat: the schema must have a small, fixed number of possible representations. Bridging between different representations is a N*N problem, which means the number of representations should be kept small to avoid overburdening VM writers.

So adding Haskell support would require using existing bindings, not adding custom bindings.

Some possible representations:

  • C-style structs and pointers.
  • Actual capnproto bytes.
  • GC classes.
  • Closures serving as getters and setters.
  • Python-style dictionaries.

The idea being that while each language is different, and there are some extreme outliers, you can fit a fairly large number of languages in a fairly small number of categories.

@jgravelle-google
Copy link

So adding Haskell support would require using existing bindings, not adding custom bindings.

Depends on the level of granularity of existing bindings you're thinking of. N<->N languages encoding each possible binding is 2*N*N, but N<->IR is 2*N, and further if you say N<->[common binding styles]<->IR, where the number of common formats is k, you're talking 2*k, where k < N.

In particular, with the scheme I describe, you get Scheme for free (it would reuse utf8-cons-str). If Java models strings as char arrays as well, that's a utf8-array-str binding. If Nim uses string_views under the hood, utf8-str. If Zig conforms to the C ABI, utf8-cstr. (I don't know Java/Nim/Zig's ABIs, so I didn't mention them as concrete examples earlier)

So, yes we do not want to add a binding for each possible language, but we can add a few bindings per IR type to cover the vast majority of languages. I think the space for disagreement here is, how many bindings is "a few", what's the sweet spot, how strict should the criteria be for whether we support a language's ABI?
I don't have specific answers to these questions. I'm trying to give lots of concrete examples to better illustrate the design space.

Also I would assert that we absolutely want to specify multiple bindings per abstract type, to avoid privileging any one style of data. If the only binding we expose to Strings is utf8-cstr, then all non-C-ABI -having languages have to deal with that mismatch. I'm ok with increasing VM writing complexity some not-small factor.
The total work in the ecosystem is O(VM effort + language implementation effort), and both of those terms scale in some way with N=number of languages. Let M=number of embedders, k=number of bindings, and a=average number of bindings a given language needs to implement, with a<=k. At a minimum we have M+N separate wasm implementations.
Naive approach, with every N language independently implementing ABI FFI with every other N language, is O(M + N*N). This is what we have on native systems, which is a strong signal that any O(N*N) will lead to results not different from native systems.
Second naive approach, where every VM needs to implement all N*N bindings: O(M*N*N + N), which is clearly even worse.
What we're trying to propose is that we have k bindings that map between an abstract language, that maps back to all languages. This implies k work for each VM. For each language, we only need to implement a subset of the bindings. The total work is M*k + N*a, which is O(M*k + N*k). Note that in the event that k=N that the VM side is "only" M*N, so for any given VM it's "only" linear in the number of languages. Clearly though we want k << N, because otherwise this is still O(N*N), no better than the first solution.
Still, O(M*k + N*k) is much more palatable. If k is O(1), that makes the whole ecosystem linear in the number of implementations, which is our lower bound on the amount of effort involved. A more likely bound is k being O(log(N)), which I'm still pretty satisfied with.

Which is a long way of saying, I'm completely ok with increasing the VM complexity for this feature by some constant factor.

@rossberg
Copy link
Member

we can add a few bindings per IR type to cover the vast majority of languages.

This is the crucial underlying assumption which I believe is simply not true. My experience is that there are (at least!) as many representation choices as there are language implementations. And they can be arbitrarily complicated.

Take V8, which alone has a few dozen(!) representations for strings, including different encodings, heterogeneous ropes, etc.

The Haskell case is far more complicated than you describe, because lists in Haskell are lazy, which means that for every single character in a string you might need to invoke a thunk.

Other languages use funny representations for the length of a string, or don't store it explicitly but require it to be computed.

These two examples already show that a declarative data layout doesn't cut it, you'd often need to be able to invoke runtime code, which in turn might have its own calling conventions.

And that's just strings, which are a fairly simple datatype conceptually. I don't even want to think about the infinite number of ways in which languages represent product types (tuples/structs/objects).

And then there is the receiving side, where you'd have to be able to create all these data structures!

So I think it is entirely unrealistic that we would ever get even remotely close to supporting the "vast majority of languages". Instead, we would start to privilege a few, while already growing a large zoo of arbitrary stuff. That seems fatal on multiple levels.

@aardappel
Copy link

My experience is that there are (at least!) as many representation choices as there are language implementations. And they can be arbitrarily complicated.

I completely agree. I think trying to design types that will somehow cover most language's internal representations of data is simply not tractable, and will make the eco-system overly complicated.

In the end there is only one lowest common denominator between languages when it comes to data: that of the "buffer". All languages can read and construct these. They're efficient and simple. Yes, they favor languages that are directly able to address their contents, but I don't think that is an in-equality that is solved by promoting (lazy) cons cells to the same level of support somehow.

In fact, you can get very far with just a single data type: the pointer + len pair. Then you just need a "schema" that says what is in those bytes. Does it promise to be conforming to UTF-8? Is the last byte guaranteed always 0? Are the first 4/8 bytes length/capacity fields? Are all these bytes little endian floats that can be sent straight to WebGL? Are these bytes maybe an existing serialization format's schema X? etc?

I'd propose a very simple schema specification that can answer all these questions (not an existing serialization format, but something more low level, simpler, and specific to wasm). It then becomes the burden of each language to efficiently read and write these buffers in the format specified. Layers in between can then pass around the buffers blindly without processing, either by copy, or where possible, by reference/view.

@jgravelle-google
Copy link

This is the crucial underlying assumption which I believe is simply not true. My experience is that there are (at least!) as many representation choices as there are language implementations. And they can be arbitrarily complicated.

I agree that this is the crucial underlying assumption. I disagree that it is not true, though I think because of a semantic nuance that I don't think I've made clear.

The bindings are not meant to map to all language representations perfectly, they only need to map to all languages well enough.

That is a crucial underlying assumption that makes this at all tractable, at all, regardless of encoding. @aardappel's proposal of going in the other direction and actually reifying the bytes into a buffer that is decodable is also built on the assumption that it is a lossy encoding of any given program's semantics, some more lossy than others.

The Haskell case is far more complicated than you describe, because lists in Haskell are lazy, which means that for every single character in a string you might need to invoke a thunk.

I had actually forgotten that, but I don't think it matters. The goal is not to represent Haskell Strings while preserving all of their semantic nuances across a module boundary. The goal is to convert a Haskell String to an IR String, by value. This necessarily involves computing the whole string.

These two examples already show that a declarative data layout doesn't cut it, you'd often need to be able to invoke runtime code, which in turn might have its own calling conventions.

The way to model that, regardless of how we specify bindings (or even IF we specify anything for bindings), is to handle that in userland. If a language's representation of a type does not map to a binding directly, it will need to convert to a representation that does. For example if Haskell's Strings are really represented as (type $haskellString (struct (field i8) (field (func (result (ref $haskellString)))))), it will either need to convert to a strict string and use a Scheme-like binding, or to a Text array and use a Java-like binding, or to a CFFIString and use a C-like binding. The value proposition of having multiple imperfect binding types is that some of those are less awkward for Haskell than others, and it's possible to construct Wasm-FFI types without needing to modify the compiler.

And that's just strings, which are a fairly simple datatype conceptually. I don't even want to think about the infinite number of ways in which languages represent product types (tuples/structs/objects).
And then there is the receiving side, where you'd have to be able to create all these data structures!

I'm confused, I see that as saying "binding between languages is completely impossible so we shouldn't try at all," but I believe what you've been saying has been more along the lines of "I don't believe the approach described here is tractable," which seems much more reasonable. In particular my objection to this line of argument is that it does not describe a path forward. Given that this problem is "very hard", what DO we do?

Instead, we would start to privilege a few

Almost-surely. The question is one of, what is the degree to which the few optimally-supported languages are privileged? How much leeway do the underprivileged languages have in finding an ergonomic solution?

while already growing a large zoo of arbitrary stuff.

I'm not sure what you mean by that. My interpretation of what's arbitrary is "which languages do we support", but that's the same as "privileging a few", which would be double-counting. And thus this would only be fatally flawed on that one level, rather than multiple :D

@kripken
Copy link
Member

kripken commented May 29, 2019

I don't think wasm has a greater need for an inter-language solution than other platforms

I think it does. Being used on the web by default means that code can and will be ran in different contexts. In the same way that one might <script src="http://some.other.site/jquery.js">, I would love to see people combining wasm libraries in a cross-origin way. Because of the ephemerality and composability properties that the web provides, the value-add of being able to interface with a foreign module is higher than it's ever been on native systems.

I don't quite see why being ephemeral etc. is directly relevant? Overall the need for a inter-language solution is everywhere, simply because there is good code in many languages. For example, I often want to use the nice python string utilities from my C++ code - and in native I don't even need to download python, it's already there. Aside from this general need, there are tons of specific needs for cross-language support in native apps - for example, scripting languages in game engines (for productivity), Rust/C/C++/JS in Firefox (for security, legacy, and other reasons), etc. etc.

This isn't just on native platforms. Likewise on .NET and JVM the interest for mixing languages has been very strong for many years, but often this ends up solved there by using the native types, like IronPython using C# strings instead of Python ones. Also, the need for mixing languages is not even new on the Web, it's been an issue for a very long time, for example, to compare to IronPython, pyjamas uses JS strings and even JS numbers.

Overall I don't think wasm adds more motivation here, since the motivation has always been huge. Maybe wasm adds more languages to the Web specifically, but tons of languages already compiled to JS before, and even more languages compile natively.

@aardappel
Copy link

The conversions need to be implemented in userland. The binding layer just prescribes a format that the user code has to produce/consume.

That's a good summary for me.

@PoignardAzur
Copy link
Author

While I'm writing a new draft, I have an open question for everybody on this thread:

Is there some existing library that, if it were compiled to WebAssembly, you would want to be able to use from any language?

I'm essentially looking for potential use cases to base the design on. I have my own (specifically, React, the Bullet engine, and plugin systems), but I'd like more examples to work with.

@KronicDeth
Copy link

@PoignardAzur A lot of languages in C use the same Perl-Compatible Regular Expression (PCRE) libraries, but in the browser embedding they should probably using JS's Regex API.

@kentonv
Copy link

kentonv commented Jun 1, 2019

@PoignardAzur BoringSSL and libsodium come to mind.

Also the Cap'n Proto RPC implementation, but this is a weird one: Cap'n Proto's serialization layer realistically must be implemented independently in each language since most of it is a wide-but-shallow API layer that needs to be idiomatic and inline-friendly. The RPC layer, OTOH, is narrow but deep. In principle it should be possible to use the C++ RPC implementation behind any arbitrary language's serialization implementation by passing capnp-encoded byte array references over the FFI boundary...

@scherrey
Copy link

I think doing what is proposed ultimately would require some fairly invasive changes to Web Assembly itself as it already exists - but may arguably be worth it.

I would note that the SmallTalk world had some positive experience with such an effort that could be informative in their development of their State Replication Protocol (SRP) which is an efficient serialization protocol that can represent any type of any size rather efficiently. I've considered making it a native memory layout for a VM or even FPGA but haven't gotten around to trying it. I do know that it was ported to at least one other language, Squeak, with good results. Certainly something to read up on as having strong overlap with the issues, challenges and experiences of this proposal.

@Hywan
Copy link

Hywan commented Jun 24, 2019

I understand why Web IDL was the default proposal as a binding language: It is the historical and somehow mature binding language for the Web. I greatly support that decision, and it's very likely I would have made the same. Nonetheless, we may recognize that it barely fits to other contexts (understand, other hosts/languages). Wasm is designed to be host-agnostic, or language/platform-agnostic. I like the idea of using mature Web technologies and to find a usecase for non-Web scenarios, but in the case of Web IDL, it seems really tied to the Web. Which is the reason I'm very closely following those conversations here.

I've opened WebAssembly/interface-types#40, which led me to ask a question here since I've seen no mention of it (or I missed it).

In the whole binding story, it's not clear “who” is responsible to generate the bindings:

  • Is it a compiler (that transforms a program into a Wasm module)?
  • Is it a program author (and so, are the bindings hand-written)?

I think both are valid. And in the case of Web IDL, it seems to show some limitations (see the link above). Maybe I just missed an important step in the process, and so, consider to forget my message.

Even if the aim is to “refocus” Web IDL to be less Web-centric, right now, it is very Web-centric. And proposals arise to propose alternatives, hence this thread. Therefore, I'm concerned by a potential fragmentation. Ideally (and this is how Wasm was designed so far), given a Wasm module including its bindings, it is possible to run it anywhere as is. With bindings written in Web IDL, Cap'n' Proto, FlatBuffers, whatever, I'm pretty sure not all compilers or program authors will write the same bindings in different syntaxes to be truly cross-platform. Funnily, this is an argument in favor of hand-written bindings: People can contribute to a program by writing bindings for platform P. But let's admit this won't be ideal at all.

So to sum up: I'm concerned by a possible fragmentation between Web and non-Web bindings. If a non-Web binding language is held, would it be realistically implemented by Web browsers? They would have to write bindings “Wasm ⟶ binding language B ⟶ Web IDL”. Note that this is the same scenario for all hosts: Wasm ⟶ binding language B ⟶ host API.

For those who are curious, I work at Wasmer and I'm the author of the PHP-, Python-, Ruby- and Go- Wasm integrations. We start having a nice playground to hack different bindings for very different hosts. If anyone wants me to integrate different solutions, to collect feedbacks, or try to experiment, we are all open and ready to collaborate and put more resources on it.

@fgmccabe
Copy link

fgmccabe commented Jun 24, 2019 via email

@Pauan
Copy link

Pauan commented Jun 24, 2019

We can already see this with the difficulty of integrating ref types into the tool chain.

I can't speak for other languages, but Rust + wasm-bindgen already has support for:

So I'm curious: what difficulties are you referring to?

@jgravelle-google
Copy link

My understanding of the difficulties are more on the C++ end. Rust has sufficiently powerful metaprogramming to make this more reasonable on the language end, but userland C++ has a harder time reasoning about anyrefs for example.

@kripken
Copy link
Member

kripken commented Jun 24, 2019

I'd be curious to hear more about the C++ specific issues here. (Are they C++ specific or LLVM specific?)

@fgmccabe
Copy link

fgmccabe commented Jun 24, 2019 via email

@kripken
Copy link
Member

kripken commented Jun 24, 2019

Talking to @fgmccabe offline, it's that's true both C++ and Rust can't directly store a ref type in a structure, since the structure will be stored in linear memory. Both C++ and Rust can of course handle ref types indirectly, the same way they handle file descriptors, OpenGL textures, etc. - with integer handles. I think his point is that neither of those two languages can handle ref types "well"/"natively" (correct me if I'm wrong!) which I agree with - those languages are always going to be at a disadvantage on ref type operations, compared to GC languages.

I remain curious about whether there is anything specific to C++ here. I don't think there is?

@jgravelle-google
Copy link

My understanding on what makes C++ hard here is if you have say:

struct Anyref; // opaque declaration
void console_log(Anyref* item); // declaration of ref-taking external JS API
Anyref* document_getElementById(const char* str);

void wellBehaved() {
  // This should work
  Anyref* elem = document_getElementById("foo");
  console_log(elem);
}

void notSoWellBehaved() {
  // ????
  Anyref* elem = document_getElementById("bar");
  Anyref* what = (Anyref*)((unsigned int)elem + 1);
  console_log(what);
}

The good news is that the latter example is UB I believe (invalid pointers are UB as soon as they're created), but how do we attempt to model that in LLVM IR?

@kripken
Copy link
Member

kripken commented Jun 24, 2019

@jgravelle-google I think even struct Anyref; presupposes it's something that makes sense in linear memory. Instead, why not model it with an integer handle as mentioned earlier, as OpenGL textures, file handles, and so forth?

using Anyref = uint32_t; // handle declaration
void console_log(Anyref item); // declaration of ref-taking external JS API
Anyref document_getElementById(const char* str);

void wellBehaved() {
  // This should work
  Anyref elem = document_getElementById("foo");
  console_log(elem);
}

The integer handle must be looked up in the table when it is to be used - again, this just a downside of languages that use linear memory like C++ and Rust. However, it could definitely be optimized at least locally - if not by LLVM, then at the wasm level.

@jgravelle-google
Copy link

That will work, but then you need to make sure to call table_free(elem) or you'll keep a reference to it forever. Which isn't -that- weird for C++, granted.

It's something of an odd mapping because it doesn't layer nicely I think? Like it feels like a library a la OpenGL, but it relies on compiler magic to provide - I don't think you can build an anyref.h in C++ even with an inline-wasm, if you depend on declaring a separate table.

Anyway I think it's all doable/tractable, but not-straightforward is all.

@Pauan
Copy link

Pauan commented Jun 24, 2019

@kripken It's true that "proper" native anyref support will require some changes to LLVM (and in rustc), but that isn't actually an obstacle.

wasm-bindgen stores genuine wasm anyrefs in a wasm table, and then in linear memory it stores an integer index into the table. So it can then access the anyref by using the wasm table.get instruction.

Until wasm-gc is implemented, GC languages will need to use that exact same strategy, so Rust (et al) aren't missing out.

So what would native anyref support in LLVM gain us? Well, it would enable directly passing/returning anyref from functions, rather than needing to indirect the anyref through a wasm table. That would be useful, yeah, but that's just a performance optimization, it doesn't actually prevent using anyref.

@kripken
Copy link
Member

kripken commented Jun 24, 2019

@Pauan

wasm-bindgen stores genuine wasm anyrefs in a wasm table, and then in linear memory it stores an integer index into the table. So it can then access the anyref by using the wasm table.get instruction.

Exactly, yes, that's the model I was referring to.

Until wasm-gc is implemented, GC languages will need to use that exact same strategy, so Rust (et al) aren't missing out.

Yes, right now GC languages have no advantage because we have no native wasm GC. But hopefully that will change! :) Eventually I expect GC languages to have a clear advantage here, at least if we do GC properly.

So what would native anyref support in LLVM gain us? Well, it would enable directly passing/returning anyref from functions, rather than needing to indirect the anyref through a wasm table. That would be useful, yeah, but that's just a performance optimization, it doesn't actually prevent using anyref.

Agreed, yes, this will only be a perf advantage of GC languages (eventually) over C++ and Rust etc. It doesn't prevent use.

Cycles are a bigger problem for C++ and Rust, though, since tables root. Maybe we can have a tracing API or "shadow objects", basically some way to map the structure of GC links inside C++/Rust so that the outside GC can understand them. But I don't think there's an actual proposal for any of those yet.

@Pauan
Copy link

Pauan commented Jun 24, 2019

Eventually I expect GC languages to have a clear advantage here, at least if we do GC properly.

I could be wrong, but I would be surprised if that was the case: GC languages would have to allocate a wasm GC struct and then the wasm engine would have to keep track of that as it flows through the program.

In comparison, Rust needs no allocation (just assign to a table), and only needs to store an integer, and the wasm engine only needs to keep track of 1 static unmoving table for GC purposes.

I suppose it's possible that anyref access might be optimizable for GC languages, since it wouldn't need to use table.get, however I expect table.get to be quite fast.

So could you explain more about how you expect a wasm-gc program to perform better than a program using a wasm table?

P.S. This is starting to get pretty far off-topic, so maybe we should move this discussion to a new thread?

@kripken
Copy link
Member

kripken commented Jun 25, 2019

Really just that: avoiding table.get/table.set. With GC you should have the raw pointer right there, saving the indirections. But yeah, you're right that Rust and C++ only need to store an integer, and they are pretty fast overall, so any GC advantage might not end up mattering!

I agree we might be getting off topic, yeah. I think what is on topic is @fgmccabe's point that ref types don't fit in as naturally in linear memory-using languages. That may influence us in certain ways with the bindings (in particular cycles are worrying because C++ and Rust can't handle them, but maybe the bindings can ignore that?), so just something to be careful about I guess - both to try to make things work for as many languages as possible, and to not be overly influenced by any particular language's limitations.

@PoignardAzur
Copy link
Author

PoignardAzur commented Jul 1, 2019

@kentonv

Cap'n Proto's serialization layer realistically must be implemented independently in each language since most of it is a wide-but-shallow API layer that needs to be idiomatic and inline-friendly. The RPC layer, OTOH, is narrow but deep

Which folder is it?

@kentonv
Copy link

kentonv commented Jul 5, 2019

@PoignardAzur Sorry, I don't understand your question.

@PoignardAzur
Copy link
Author

@kentonv I'm looking through the capnproto Github repository. Where is the serialization layer?

@kentonv
Copy link

kentonv commented Jul 5, 2019

@PoignardAzur So, this gets back to my point. There isn't really a single place you can point to and say "that's the serialization layer". Mostly, Cap'n Proto's "serialization" is just pointer arithmetic around loads/stores to an underlying buffer. Given a schema file, you use the code generator to generate a header file that defines inline methods that do the right pointer arithmetic for the particular fields defined in the schema. Application code then needs to call these generated accessors every time it reads or writes any field.

This is why it wouldn't make sense to try to call an implementation written in a different language. Using a raw FFI for every field access would be extremely cumbersome, so no doubt you would end up writing a code generator that wraps the FFI in something prettier (and specific to your schema). But that generated code would be at least as complicated as the code that Cap'n Proto already implements -- probably more complicated (and much slower!). So it makes more sense to just write a code generator for the target language directly.

There are perhaps some internal helper functions within the Cap'n Proto implementation that could be shared. Specifically, layout.c++/layout.h contains all the code that interprets capnp's pointer encoding, performs bounds checking, etc. The generated code accessors call into that code when reading/writing pointer fields. So I could maybe imagine wrapping that part in an FFI to be called from multiple languages; but I'd still expect to write code generators and some amount of runtime support library in each target language.

@PoignardAzur
Copy link
Author

Yeah, sorry, I meant the opposite ^^ (the RPC layer)

@kentonv
Copy link

kentonv commented Jul 6, 2019

@PoignardAzur Ohhh, and I suppose you're specifically interested in looking at the interfaces, since you're thinking about how to wrap them in an FFI. So then you want:

  • capability.h: Abstract interfaces for representing capabilities and RPC invocation, which in theory could be backed by a variety of implementations. (This is the most important part.)
  • rpc.h: Implementation of RPC over a network.
  • rpc-twoparty.h: Transport adapter for RPC over a simple connection.

@PoignardAzur
Copy link
Author

PoignardAzur commented Jul 21, 2019

This proposal is now superseded by #1291: OCAP bindings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests