Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement register allocator in JIT #244

Merged
merged 1 commit into from May 20, 2024
Merged

Conversation

zherczeg
Copy link
Collaborator

This branch contains an early version of a register allocator for Walrus. It is capable of compiling some WebAssembly code now. For example, our simple fibonacci code:

(module
 (func $func1 (export "func1") (param $num i32) (result i32) (local i32 i32 i32)
  i32.const 0
  local.set 0
  i32.const 1
  local.set 1

  i32.const 0
  local.set 2

  loop
    local.get 0
    local.get 1
    i32.add
    local.get 1
    local.set 0
    local.set 1

    local.get 2
    i32.const 1
    i32.add
    local.tee 2

    i32.const 100000000
    i32.lt_u
    br_if 0
  end

  local.get 2
)
)

(assert_return (invoke "func1" (i32.const 10)) (i32.const 100000000))

Code with register allocation (R0-R3 are scratch registers):

[[[[[[[  Function   0  ]]]]]]]
1: Const32 (0x5567783b3540)
  Unused
  Result[0]: 18.I32 (I:0x5567783b3540)
2: Const32 (0x5567783b2ff0)
  Unused
  Result[0]: 19.I32 (I:0x5567783b2ff0)
3: Const32 (0x5567783b3730)
  Unused
  Result[0]: 20.I32 (I:0x5567783b3730)
4: MoveI32 (0x5567783b3600)
  Param[0]: 18.I32 (I:0x5567783b3540)
  Result[0]: 21.I32 (R0) [4-14]
5: MoveI32 (0x5567783b3780)
  Param[0]: 19.I32 (I:0x5567783b2ff0)
  Result[0]: 22.I32 (R1) [5-14]
6: MoveI32 (0x5567783b37e0)
  Param[0]: 18.I32 (I:0x5567783b3540)
  Result[0]: 23.I32 (R2) [6-15]
7: Label (0x5567783b2800)
  Jump from: 14
8: I32Add (0x5567783b3840)
  Param[0]: 21.I32 (R0) [4-14]
  Param[1]: 22.I32 (R1) [5-14]
  Result[0]: 24.I32 (R3) [8-10]
9: MoveI32 (0x5567783b38a0)
  Param[0]: 22.I32 (R1) [5-14]
  Result[0]: 21.I32 (R0) [4-14]
10: MoveI32 (0x5567783b3900)
  Param[0]: 24.I32 (R3) [8-10]
  Result[0]: 22.I32 (R1) [5-14]
11: I32Add (0x5567783b3960)
  Param[0]: 23.I32 (R2) [6-15]
  Param[1]: 19.I32 (I:0x5567783b2ff0)
  Result[0]: 27.I32 (R3) [11-13]
12: MoveI32 (0x5567783b39c0)
  Param[0]: 27.I32 (R3) [11-13]
  Result[0]: 23.I32 (R2) [6-15]
13: I32LtU (0x5567783b3a20)
  MergeCompare
  Param[0]: 27.I32 (R3) [11-13]
  Param[1]: 20.I32 (I:0x5567783b3730)
  Result[0]: (flag)
14: JumpIfTrue (0x5567783b3a80)
  Jump to: 7
  Param[0]: (flag)
15: End (0x5567783b3ae0)
  Param[0]: 23.I32 (R2) [6-15]

Release mode runtime on an x86-64 machine:

interp: 0m0.994s
base-jit: 0m0.557s (1.78 times as fast)
reg-alloc-jit: 0m0.152s (6.53 times as fast)

@zherczeg zherczeg force-pushed the reg_alloc branch 2 times, most recently from 8bcd38f to 845a9eb Compare April 30, 2024 11:38
@zherczeg
Copy link
Collaborator Author

@clover2123 the 64 bit code is mostly working now, only one test case fails. The i32.reinterpret_f32 (and similar opcodes) are implemented as moves, which confuses the type tracking of register allocation. How could we fix this? Shall we introduce new opcodes for this, or do you have another suggestion? Thank you!

The 32 bit code will require more work, that is far more complex because of register pairs (64 bit values are stored in register pairs).

@zherczeg zherczeg force-pushed the reg_alloc branch 3 times, most recently from 1d75d90 to 1c16b7c Compare May 3, 2024 09:12
@zherczeg
Copy link
Collaborator Author

zherczeg commented May 3, 2024

Register allocation results on x86-64:

       Test case       |     interpreter  |    baseline_jit  |    regalloc_jit  |
-----------------------+------------------+------------------+------------------+
               change  |           3.48s  |   1.69s (2.06x)  |   0.83s (4.20x)  |
            factorial  |           3.64s  |   1.92s (1.90x)  |   1.32s (2.77x)  |
             fannkuch  |           3.20s  |   1.20s (2.66x)  |   0.75s (4.28x)  |
            fibonacci  |           5.21s  |   3.17s (1.64x)  |   2.32s (2.24x)  |
              gregory  |           3.79s  |   1.56s (2.43x)  |   0.88s (4.33x)  |
                hanoi  |           4.72s  |   2.35s (2.00x)  |   1.69s (2.79x)  |
             heapsort  |           5.22s  |   3.09s (1.69x)  |   2.40s (2.18x)  |
              huffman  |           4.62s  |   1.59s (2.91x)  |   0.86s (5.37x)  |
          kNucleotide  |           3.98s  |   1.49s (2.67x)  |   0.67s (5.96x)  |
      mandelbrotFloat  |           3.59s  |   1.39s (2.58x)  |   0.64s (5.57x)  |
     mandelbrotDouble  |           3.60s  |   1.37s (2.63x)  |   0.83s (4.32x)  |
       matrixMultiply  |           5.74s  |   2.21s (2.60x)  |   1.44s (3.99x)  |
           miniWalrus  |           5.35s  |   2.61s (2.05x)  |   1.81s (2.95x)  |
                nbody  |           4.59s  |   1.74s (2.64x)  |   0.82s (5.56x)  |
              nqueens  |           3.97s  |   1.54s (2.59x)  |   0.98s (4.05x)  |
                prime  |           4.29s  |   1.86s (2.30x)  |   0.78s (5.50x)  |
            quickSort  |           4.41s  |   1.42s (3.09x)  |   0.83s (5.30x)  |
             redBlack  |           5.35s  |   2.39s (2.24x)  |   1.37s (3.90x)  |
                  rsa  |           3.80s  |   1.75s (2.17x)  |   0.60s (6.37x)  |
             salesman  |           4.35s  |   1.78s (2.44x)  |   1.08s (4.01x)  |
  simdMandelbrotFloat  |           4.42s  |   1.45s (3.04x)  |   0.86s (5.11x)  |
  simdMandelbrotDouble  |           5.62s  |   2.03s (2.76x)  |   1.24s (4.54x)  |
            simdNbody  |           2.82s  |   1.04s (2.71x)  |   0.63s (4.46x)  |
   simdMatrixMultiply  |          11.13s  |   4.67s (2.39x)  |   2.27s (4.90x)  |
            ticTacToe  |           4.99s  |   1.94s (2.57x)  |   1.06s (4.73x)  |
-----------------------+------------------+------------------+------------------+
    Average speedup    |                  |           2.43x  |           4.38x  |

@zherczeg
Copy link
Collaborator Author

zherczeg commented May 3, 2024

x86-32 results:

        Test case        |     interpreter  |    baseline_jit  |    regalloc_jit  |
-------------------------+------------------+------------------+------------------+
                 change  |           7.86s  |   2.03s (3.87x)  |   1.27s (6.21x)  |
              factorial  |           7.66s  |   2.36s (3.24x)  |   1.56s (4.90x)  |
               fannkuch  |           6.43s  |   1.69s (3.81x)  |   1.00s (6.45x)  |
              fibonacci  |          13.23s  |   3.77s (3.51x)  |   3.18s (4.16x)  |
                gregory  |           9.42s  |   2.57s (3.67x)  |   1.18s (7.98x)  |
                  hanoi  |          10.03s  |   3.08s (3.26x)  |   2.25s (4.45x)  |
               heapsort  |          11.60s  |   3.52s (3.30x)  |   2.46s (4.71x)  |
                huffman  |           9.85s  |   1.75s (5.64x)  |  0.94s (10.50x)  |
            kNucleotide  |           7.70s  |   1.56s (4.95x)  |   0.88s (8.77x)  |
        mandelbrotFloat  |           6.56s  |   1.47s (4.45x)  |   0.92s (7.14x)  |
       mandelbrotDouble  |           8.06s  |   1.82s (4.42x)  |   1.23s (6.53x)  |
         matrixMultiply  |          13.12s  |   2.12s (6.19x)  |   1.36s (9.65x)  |
             miniWalrus  |          10.97s  |   3.22s (3.41x)  |   2.26s (4.86x)  |
                  nbody  |           8.81s  |   1.69s (5.21x)  |   0.96s (9.15x)  |
                nqueens  |           8.56s  |   1.82s (4.72x)  |   0.97s (8.80x)  |
                  prime  |          12.11s  |   2.47s (4.90x)  |   1.48s (8.16x)  |
              quickSort  |           8.83s  |   1.63s (5.43x)  |  0.82s (10.79x)  |
               redBlack  |           9.55s  |   2.46s (3.87x)  |   1.74s (5.50x)  |
                    rsa  |          10.80s  |   2.48s (4.34x)  |   1.36s (7.95x)  |
               salesman  |           8.31s  |   1.85s (4.49x)  |   1.15s (7.21x)  |
    simdMandelbrotFloat  |           9.05s  |   1.44s (6.27x)  |  0.88s (10.32x)  |
   simdMandelbrotDouble  |          13.77s  |   2.40s (5.73x)  |  1.35s (10.23x)  |
              simdNbody  |           5.56s  |   0.89s (6.24x)  |   0.73s (7.58x)  |
     simdMatrixMultiply  |          25.78s  |   4.81s (5.36x)  |   2.67s (9.66x)  |
              ticTacToe  |           9.03s  |   2.06s (4.38x)  |   1.32s (6.82x)  |
-------------------------+------------------+------------------+------------------+
     Average speedup     |                  |           4.59x  |           7.54x  |

@zherczeg
Copy link
Collaborator Author

zherczeg commented May 7, 2024

@clover2123 may I ask your help? The only remaining issue for this patch is i32.reinterpret_f32 and similar opcodes. These are converted to moves, or omitted in some cases. This confuses the register allocator, since it cannot detect these opcodes. What should we do? I have two ideas:

  1. Introducing new interpreter operations for them. Their implementation is simple.
  2. Introduce an additional type descriptor bitset (vector<bool>) for the byte code. This is only used by the jit compiler. Small memory increase.

I am open to other ideas as well. What do you think?

@clover2123
Copy link
Collaborator

@zherczeg IMHO 1st option seems more proper than the other because its simple as you mentioned

Introducing new interpreter operations for them. Their implementation is simple.

@zherczeg zherczeg marked this pull request as ready for review May 10, 2024 15:28
@zherczeg
Copy link
Collaborator Author

The patch now passes the regression tests. I have added a --jit-no-reg-alloc (JITFlagValue::disableRegAlloc) to disable the register allocation at runtime.

@clover2123
Copy link
Collaborator

Why did you add an option (--jit-no-reg-alloc) for register allocation? Just for comparison?

@zherczeg
Copy link
Collaborator Author

And for debugging. It is not necessary, I can remove it. Or it can be removed later.

Signed-off-by: Zoltan Herczeg zherczeg.u-szeged@partner.samsung.com
Copy link
Collaborator

@clover2123 clover2123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@clover2123 clover2123 merged commit a44b113 into Samsung:main May 20, 2024
12 checks passed
@zherczeg zherczeg deleted the reg_alloc branch May 20, 2024 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants