High-performance Brainfuck

This is a fast virtual machine for Brainfuck. This is as useless as it sounds.

Features

Configurable cell size between 8 bit, 16 bit, 32 bit, or 64 bit cells.
Decently fast in place interpreter.
Way faster optimizing bytecode interpreter.
Even faster baseline JIT compiler with basic register allocation and instruction selection.
Optionally, a LLVM based JIT compiler for faster runtime, at the expense of significantly slower startup times.
Unlike many other fast virtual machines for Brainfuck, this one uses dynamically sized memory by default.
Optionally use statically sized memory for even faster execution.

How fast is it?

hpbf provides five different backends with different runtime performance and startup time characteristics. The baseline JIT compiler is only supported on x86_64 platforms. The LLVM based JIT compiler is not enabled by default, so you must pass --features llvm to Cargo if you want to build this project with LLVM support. Note that you will need to have LLVM 17 installed for that to work.

The following table shows the performance of the virtual machine for executing different programs that can be found in the examples/ directory. All times include the reading of the code from a file, the parsing, optimization, and compilation of the code, in addition to the actual execution of the program.

`$FLAGS`	`mandel.bf`	`hanoi.bf`	`factor.bf`	`dbfi.bf^2(hello.bf)`
`--inplace`	17.272s	10.162s	235.852s	89.494s
`--ir-int -O0`	5.959s	826ms	132.505s	84.820s
`--ir-int -O1`	6.282s	32ms	51.007s	86.448s
`--ir-int -O3`	6.381s	34ms	51.062s	84.531s
`--bc-int -O0`	2.053s	175ms	35.887s	21.004s
`--bc-int -O1`	1.423s	18ms	11.848s	20.809s
`--bc-int -O3`	1.413s	21ms	11.731s	20.764s
`--bc-int --static`	1.310s	19ms	11.761s	15.409s
`--llvm-jit -O0`	9.245s	315ms	55.432s	272.589s
`--llvm-jit -O1`	1.158s	276ms	4.292s	18.345s
`--llvm-jit -O2`	1.173s	246ms	4.294s	18.305s
`--llvm-jit -O3`	1.222s	379ms	4.351s	18.486s
`--llvm-jit -O4`	1.481s	571ms	3.700s	18.719s
`--llvm-jit -O5`	1.472s	573ms	3.689s	18.789s
`--llvm-jit --static`	693ms	237ms	3.442s	11.338s
`--base-jit -O0`	1.146s	108ms	16.331s	18.880s
`--base-jit -O1`	796ms	13ms	3.508s	19.193s
`--base-jit -O3`	843ms	17ms	3.501s	19.257s
`--base-jit --static`	512ms	16ms	3.057s	10.336s

mandel.bf was run using ./target/release/hpbf --time -f examples/mandel.bf $FLAGS.
hanoi.bf was run using ./target/release/hpbf --time -f examples/hanoi.bf $FLAGS.
factor.bf was run using ./target/release/hpbf --time -f examples/factor.bf $FLAGS <<< "87178291199".
dbfi.bf^2(hello.bf) was run using cat examples/dbfi.bf examples/input.bf examples/hello.bf examples/input.bf | ./target/release/hpbf --time -f examples/dbfi.bf $FLAGS.

Note that all of these benchmarks are programs with a long runtime. For small programs with a long runtime, the LLVM JIT compiler performs best, as it can amortize the significant compile time overhead. The baseline JIT compiler is a good all around choice if supported on the given platform. The bytecode interpreter is the default choice for platforms not supported by the baseline JIT compiler. For programs that are very large, but for which runtime does not matter, consider using -O0 to reduce the startup time.

How does it work?

The following diagram depicts the different components of this virtual machine. The function of each are explained below.

flowchart TD
    file[File] -->|Read file| code
    code[BF code] --> inplace(In-place interpreter)
    code -->|IR builder| ir
    ir[Internal IR] -->|IR optimizer| ir
    ir --> irint(IR interpreter)
    ir -->|Bytecode generator| bc
    bc[Bytecode] -->|Bytecode optimizer| bc
    bc --> bcint(Bytecode interpreter)
    bc -->|Baseline compiler| mc
    mc[Machine code] --> jit(JIT execution)
    ir -->|LLVM IR generator| llvm
    llvm[LLVM IR] -->|LLVM| llvm
    llvm -->|LLVM| mc

A Brainfuck program

Let us look at the following code snippet, which is an extract from examples/primes.bf:

[>+>+<<-]>>[<<+>>-]<<<[>>>>>+<<<[>+>+<<-]>>[<<+>>-]
<[>>[-]<<-]>>[<<<<[>+>+<<-]>>[<<+>>-]<>>>-]<<<-<<-]

After loading the code into memory from RAM, it can immediately be executed using the in place interpreter. The interpreter does not create any tables for the loop jumps, so skipped loops will still have to iterate over all instructions in the loop.

Building the internal IR

The first step to improving the execution performance is to transform the Brainfuck program into a representation that is more amenable to optimization. During this translation, the implementation already coalesces increment instructions to the same cell, detects loops such as [-] or [+], and delays moves until the end of a basic block. The result of the IR building step for the above program is the following:

loop [0] {
  [0] -= 1
  [1] += 1
  [2] += 1
}
loop [2] {
  [0] += 1
  [2] -= 1
}
loop [-1] {
  loop [1] {
    [1] -= 1
    [2] += 1
    [3] += 1
  }
  loop [3] {
    [1] += 1
    [3] -= 1
  }
  [4] += 1
  loop [2] {
    [4] = 0
    [2] -= 1
  }
  loop [4] {
    loop [0] {
      [0] -= 1
      [1] += 1
      [2] += 1
    }
    loop [2] {
      [0] += 1
      [2] -= 1
    }
    [4] -= 1
  }
  [-1] -= 1
  [1] -= 1
}
move -1

This code can now already be executed using the IR interpreter, however, it is still possible to significantly optimize this code.

Optimizing the internal IR

Clearly there is still much room for optimization. For example, a loop like the following [>+<-] could be replaced by [1] = [0] and [0] = 0. The IR optimizer is a little more sophisticated than that, but in principle, it performs mainly conditional constant propagation, loop analysis, and dead code elimination. The IR optimizer performs multiple passes over the IR, that all perform the same operations. The first pass is performed without any analysis result, and all subsequent passes use the analysis results that were computed in the previous passes. The optimizer is particularly effective for cases where loops are balanced. Imbalanced loops cause the optimization to perform poorly, because it can not easily eliminate the loops.

The program above for example will be optimized to the following:

[1] += [0]
[0] += [2]
[2] = 0
loop [-1] {
  [2] = [1]
  [4] += 1
  if [2] {
    [2] = 0
    [4] = 0
  }
  [1] += [3] + [0]*[4] - 1
  [3] = 0
  [4] = 0
  [-1] -= 1
}

Generating Bytecode

While the above program is reasonably optimized, the data structures of the internal IR are not optimized for fast interpretation. For this reason, the IR can be transformed into a bytecode format, that can than be interpreted much faster. Since some instructions in the IR require temporary variables to be computed, the bytecode includes, in addition to the normal memory addresses also a few temporaries. Further, in the bytecode all ifs and loops are encoded as relative branches, allowing the bytecode to be represented with one flat array of instruction. The bytecode generator also performs some optimizations, like instruction fusion, dead store elimination, global value numbering, and register allocation.

The optimized program above will generate the following bytecode:

; temps 2
; min -65
; max 65
  copy %0, [0]
  add [1], [1], %0  ; %0
  add [0], [2], %0 & copy [2], 0
  brz [-1], .L280
.L268
  copy %0, [1]
  copy [2], %0  ; %0
  add [4], [4], 1  ; %0
  brz [2], .L274  ; %0
  copy [2], 0  ; %0
  copy [4], 0  ; %0
.L274
  mul %1, [0], [4] & copy [4], 0  ; %0
  add %1, %1, 255  ; %0
  add %1, %1, [3] & copy [3], 0  ; %0
  add [1], %0, %1
  add [-1], [-1], 255
  brnz [-1], .L268
.L280

The above bytecode can then be interpreted using the bytecode interpreter. The interpreter used direct threading with functions for each operation that have the following signature.

type Op<C> = unsafe fn(
    cxt: *mut OpsContext<C>,
    mem: *mut C,
    ip: *const OpCode<C>,
    r0: C,
    r1: C,
) -> *const OpCode<C>;

By exploiting trail call optimization, this allows for implementing a fast interpreter.

Generating LLVM IR

Instead of generating bytecode from the internal IR, the virtual machine also supports the translation to LLVM IR. After translating to LLVM IR, the LLVM optimization and JIT compiler infrastructure can be employed to optimize, compile, and run the program.

Compiling to native machine code

While the LLVM based JIT backend produces fast machine code, it does so at a considerable compile time cost. This means that for large programs, it is often faster to use the bytecode interpreter instead of the LLVM JIT compiler. For this reason, the virtual machine includes also a baseline compiler that uses the bytecode infrastructure and then generates machine code from the bytecode program.

The optimized program above will generate the following machine code:

55                   push   %rbp
53                   push   %rbx
41 54                push   %r12
41 55                push   %r13
41 56                push   %r14
41 57                push   %r15
48 83 ec 18          sub    $0x18,%rsp
48 8b df             mov    %rdi,%rbx
48 8b ee             mov    %rsi,%rbp
44 0f b6 65 00       movzbl 0x0(%rbp),%r12d
44 00 65 01          add    %r12b,0x1(%rbp)
44 02 65 02          add    0x2(%rbp),%r12b
44 88 65 00          mov    %r12b,0x0(%rbp)
c6 45 02 00          movb   $0x0,0x2(%rbp)
80 7d ff 00          cmpb   $0x0,-0x1(%rbp)
0f 84 4a 00 00 00    je     0x130d
44 0f b6 6d 01       movzbl 0x1(%rbp),%r13d
44 88 6d 02          mov    %r13b,0x2(%rbp)
fe 45 04             incb   0x4(%rbp)
80 7d 02 00          cmpb   $0x0,0x2(%rbp)
0f 84 08 00 00 00    je     0x12e1
c6 45 02 00          movb   $0x0,0x2(%rbp)
c6 45 04 00          movb   $0x0,0x4(%rbp)
44 0f b6 75 04       movzbl 0x4(%rbp),%r14d
4d 0f af f4          imul   %r12,%r14
49 ff ce             dec    %r14
44 02 75 03          add    0x3(%rbp),%r14b
4d 03 ee             add    %r14,%r13
44 88 6d 01          mov    %r13b,0x1(%rbp)
c6 45 03 00          movb   $0x0,0x3(%rbp)
c6 45 04 00          movb   $0x0,0x4(%rbp)
fe 4d ff             decb   -0x1(%rbp)
80 7d ff 00          cmpb   $0x0,-0x1(%rbp)
0f 85 b6 ff ff ff    jne    0x12c3
c7 c0 01 00 00 00    mov    $0x1,%eax
eb 06                jmp    0x131b
c7 c0 00 00 00 00    mov    $0x0,%eax
48 83 c4 18          add    $0x18,%rsp
41 5f                pop    %r15
41 5e                pop    %r14
41 5d                pop    %r13
41 5c                pop    %r12
5b                   pop    %rbx
5d                   pop    %rbp
c3                   ret

Usage

Usage: hpbf [option].. [-f file].. [code]..
Options:
   -f,--file file   Read the code from the given file
   -O{0|1|2|3|4|5}  Apply different levels of optimization
   -i8              Run the code using a cell size of 8 bit
   -i16             Run the code using a cell size of 16 bit
   -i32             Run the code using a cell size of 32 bit
   -i64             Run the code using a cell size of 64 bit
   --print-ir       Print ir code to stdout and do not execute
   --print-bc       Print byte code to stdout and do not execute
   --print-jit-bc   Print byte code used in the JIT to stdout and do not execute
   --print-jit-mc   Print machine code generated by the JIT to stdout
   --print-llvm     Print LLVM IR code to stdout and do not execute
   --inplace        Use the inplace non-optimizing interpreter
   --ir-int         Use the internal IR interpreter
   --bc-int         Use the optimizing bytecode interpreter
   --base-jit       Use the baseline JIT compiler
   --llvm-jit       Use the optimizing LLVM JIT compiler
   --limit limit    Limit the number of instructions executed
   --static         Do not perform memory bounds check, static memory size
   --time           Output the time required starting after parsing this argument.
   -h,--help        Print this help text
Arguments:
   code             Execute the code given in the argument

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
benches		benches
examples		examples
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benches

benches

examples

examples

src

src

.gitignore

.gitignore

Cargo.lock

Cargo.lock

Cargo.toml

Cargo.toml

LICENSE

LICENSE

README.md

README.md

Repository files navigation

High-performance Brainfuck

Features

How fast is it?

How does it work?

A Brainfuck program

Building the internal IR

Optimizing the internal IR

Generating Bytecode

Generating LLVM IR

Compiling to native machine code

Usage

About

Releases

Packages

Languages

License

rolandbernard/hpbf

Folders and files

Latest commit

History

Repository files navigation

High-performance Brainfuck

Features

How fast is it?

How does it work?

A Brainfuck program

Building the internal IR

Optimizing the internal IR

Generating Bytecode

Generating LLVM IR

Compiling to native machine code

Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Languages