Skip to content

Latest commit

 

History

History
180 lines (129 loc) · 6.96 KB

decompiler.md

File metadata and controls

180 lines (129 loc) · 6.96 KB

wasm-decompile

Decompiles binary wasm modules into a text format that is significantly more compact and familiar (for users of C-style languages).

Example:

bin/wasm-decompile test.wasm -o test.dcmp

Goals.

This tool is aimed at users that want to be able to "read" large volumes of Wasm code such as language, runtime and tool developers, or any programmers that may not have the source code of the generated wasm available, or are trying to understand what the generated code does.

The syntax has been designed to be as light-weight and as readable as possible, while still allowing one to see the underlying Wasm constructs clearly.

Non-goals.

Be a programming language.

Though compiling this output code back into a wasm module is possible, such functionality is currently not provided. The format is very low-level, much like Wasm itself, so even though it looks more high level than the .wat format, it wouldn't be any more suitable for general purpose programming.

Language guide.

This section shows some aspects of the language in terms of how they map to Wasm and/or how they might differ from a typical C-like language. It does not try to define the actual semantics of Wasm, the reader is expected to already be mostly familiar with that.

Naming.

wasm-decompile, much like wasm2wat, derives names from the name section (preferrably), or linker symbols (if available), or import/export (if not available in the other 2). For things that have no names, names are generated starting from a, b, c and so forth.

In addition, prefixes are used for things that are not arguments/locals: f_ for functions, g_ for globals, etc.

Existing names may be generated "demangled" C++ function signatures, which in the case of functions using STL types may end up several hundred characters long. Besides removing characters not typically part of an identifier, the decompiler also strips common keywords/types from these in an effort to reduce their size.

Linker symbols are typically only available in wasm .o files, though if useful for naming can be retained in fully linked wasm modules using the --emit-reloc flag to wasm.ld. This gives you names for most functions even when --strip-debug was used.

Top level declarations.

Top level items may be preceded with import or export.

Memory is declared like memory m(initial: 1, max: 0);

Globals: global my_glob:int;

Data: data d_a(offset: 0) = "Hello, World!";

Functions (see below for instructions that may appear between {}): function f(a:int, b:int):int { return a + b; }

Statements and expressions.

An expression is generated for any sequence of Wasm instructions that leave exactly 1 value on the stack.

For instructions that leave no value values on the stack, a statement is generated, which is an expression that sits on its own line in the context of a control-flow block, or the function itself. A statement may also be generated for expressions that return a value through control flow, such as a branch instruction.

Instructions that leave multiple values on the stack, or otherwise do stack operations that break the "expression order", instead force the values to be written to temporary variables (named t1, t2 etc) which the subsequent instructions can then operate upon (this does not happen with MVP-only code).

Declaration of arguments and locals.

Arguments are defined in the function signature, as shown above.

Locals are defined upon first use: var my_local:int = 1;

Types.

The decompiler uses int and long for 32-bit and 64-bit integers, and float and double for 32-bit and 64-bit floating point numbers.

Besides these, there are the types byte and ubyte (8-bit), short and ushort (16-bit), and uint, which are used exclusively with certain load/store operations.

Loads and stores.

These tend to be the hardest to "read" in Wasm code, as they've lost all context of the data structures and types the language that Wasm was compiled from was operating upon.

wasm-decompile has a few features to try and make these more readable.

The basic form looks like an array indexing operation, so o[2]:int says: read element 2 from o when seen as an array of ints. This thus accesses 4 bytes at byte-offset 8.

o is just declared as an int, since there is no such thing as a pointer type in Wasm. But wasm-decompile tries to derive them. For example, if the code is doing o[0]:int = o[1]:int + o[2]:int, then wasm-decompile assumes o points to a struct with 3 ints, and may instead compile this to:

var o:{ a:int, b:int, c:int };
o.a = o.b + o.c

The {} type is a nameless struct declaration (named ones tbd) that hints the reader at what kind of memory layout o is accessing. This seems more informative than just uncorrelated indices all over the code.

Sadly, optimized output from a compiler like LLVM often reworks memory accesses in such crazy ways that this "struct detection" fails, for example it falls back to indexing operations when there are holes or overlaps in the memory layout, or types are mixed, etc. This happens even more so when locals such as o are being re-used for unrelated things in memory.

For accesses that are not contiguous, but at least of the same type, the decompiler will change the pointer type from o:int to e.g. o:float_ptr (and similarly, it will omit the type from the actual access, o[2] instead of o[2]:int).

Additionally, wasm-decompile tried to clean up typical indexing operations. For example, when accessing any array of 32-bit elements, generated Wasm code often looks like (base + (index << 2))[0]:int, since Wasm has no built-in way to scale the index by the type of thing being loaded. wasm-decompile then transforms this into just base[index]:int, since the scaling of anything between the [] by the type size is already implied.

Control flow.

Wasm's if-then maps fairly directly to a C-like if (c) { 1; } else { 2; }. Unlike most languages, these if-thens can also be expressions, as shown in this example (wasm-decompile does not currently use the ?: ternary).

Wasm's loop becomes a loop L { ...; continue L; } structure. The inclusion of a label means nested loops can continue any of them.

Wasm's blocks are little more than a label for forward jumps, and cause excessive amounts of nesting in other text formats such as .wat, so here they are reduced to what they naturally are: a label. This label uses {} for denoting a block only when used as an expression, so typically does not indent, and thus doesn't cause endless nesting:

if (c) goto L;
...
label L:

Operator precedence.

wasm-decompile uses the following operator precedence to reduce the amount of () needed in expressions, from high (needs no ()) to low (always needs () when nested):

  • (), a, 1, a()
  • []
  • if () {} else {}
  • *, /, %
  • +, -
  • <<, >>
  • ==, !=, <, >, >=, <=
  • &, |
  • min, max
  • =

Only + and * are associative, i.e. can have multiple of them in sequence without additional ().