Idea: CNEWI sinking across trace boundaries #251

lukego · 2019-04-09T13:55:20Z

The feature that I am most excited about supporting in raptorjit is unboxed FFI pointers and 64-bit integers (#174.) I especially want to always be able to load these typed 64-bit values into local variables and perform arithmetic on them without incurring heap allocations and GC in JITed code.

I have been focused on expanding the VM native word size as the solution to this problem. Then a TValue could accommodate both a type tag and a 64-bit cdata value. This would even permit storing unboxed 64-bit values in tables. This may very well be the right solution.

However, this issue exists to explore whether there is a simpler way to support the special case of values stored in local variables in JIT code. Today we already have allocation-sinking to eliminate allocations that don't escape from a trace. How could we extend this to also eliminate allocations that escape between traces but not onto the heap or into the interpreter?

(I believe that this idea is inspired by a comment from @javierguerragiraldez but I can't immediately find the reference.)

The text was updated successfully, but these errors were encountered:

lukego · 2019-04-09T14:09:31Z

First idea:

Suppose that we extended lua_State with a parallel "immediate cdata stack" that contains unboxed immutable cdata values i.e. FFI pointers or 64-bit integers. This stack would have exactly the same size and structure as the Lua stack except that most of the values would be empty.

Then when a trace exits and has a sunk cnewi instruction referenced by a snapshot we don't have to transfer that value onto the heap. Instead we:

Write the actual value onto the immediate cdata stack for unboxed values.
Write a special TValue onto the main Lua stack slot to indicate that the value needs to be loaded from the immediate cdata stack.

This becomes complex if we have to always maintain both of these stacks in parallel and make sure that every stack access is prepared for a potential indirection onto the immediate cdata stack. However, suppose that we could restrict the circumstances when the immediate cdata stack is actually maintained and used.

Suppose that the immediate cdata stack is only valid when branching to a root trace from another trace. That is, if your trace is going to exit and link with a root trace, then you transfer any sunk cnewi values onto the immutable cdata stack, and then when a trace loads a cdata value from the stack using an sload IR instruction it always checks for the special value saying that the value is unboxed on the immutable cdata stack.

Then we have solved the problem quite neatly?

Sunk allocations will always stay sunk across trace boundaries.
Only two pieces of code need to be modified: snapshot restore, to separately handle transferring sunk values to other traces verses to the interpreter, and the sload IR instruction to accept values either as boxed TValues or as unboxed on the immediate cdata stack.

What do I miss?

mejedi · 2019-04-11T13:30:42Z

What do I miss?

You enter a root trace with a "dirty" "immutable cdata stack", the trace doesn't touch the "special TValue" (hence not added to a snapshot). Now you are in a situation when a snapshot alone isn't enough to properly restore Lua stack.

I am also concerned about the overhead (essentially doubling the Lua stack footprint of a function) and limited applicability (only handles 64bit things).

lukego · 2019-04-12T07:37:04Z

Now you are in a situation when a snapshot alone isn't enough to properly restore Lua stack.

True. This approach would require extra snapshot-like bookkeeping.

I am also concerned about the overhead (essentially doubling the Lua stack footprint of a function) and limited applicability (only handles 64bit things).

On the one hand it is important to keep overhead low. On the other hand optimizing Lua code often involves using FFI data structures and today this can unpredictably cause ~50x slowdown (#252.) So I do want to find an efficient solution but almost anything would be better than the status quo.

lukego · 2019-04-12T15:21:09Z

@mejedi Thanks for shooting that naïve idea down.

Could we attack this problem during compilation instead of at runtime?

Suppose that we have two linked traces T1->T2 with a CNEWI in T1 that cannot sink because its value escapes into T2 via the last snapshot. This causes an unwanted heap allocation in trace T1.

Further suppose that the value does not escape from T2 into the interpreter or onto the heap i.e. that it would have been sunk had it been allocated in T2 instead of in T1.

Is there a way that we could "transfer the sinkage" from trace T1 to trace T2?

Then the allocation would sink in T1 and the sunk value would escape into T2 which could then sink it too, and then we wouldn't have an allocation. This seems similar to the way that sunk values are already allowed to escape into side-traces today, with those side-traces being responsible for deciding whether to keep the value sunk or not.

Example

Here is the abbreviated IR code for the hot path in example #252:

---- TRACE 1 start xx.lua:20
---- TRACE 1 IR
....              SNAP   #0   [ ---- ]
0001 rbx      int SLOAD  #3    CI
0002 rax   >  cdt SLOAD  #2    T
0003          u16 FLOAD  0002  cdata.ctypeid
0004       >  int EQ     0003  +96 
0005 rbp      p64 FLOAD  0002  cdata.ptr
0006 rbp    + p64 ADD    0005  +1  
0007  {sink}+ cdt CNEWI  +96   0006
....              SNAP   #1   [ ---- ---- 0007 0001 ---- ---- ---- ]
[[[ Exit to side trace 2 happens here ]]]

---- TRACE 2 start 1/1 xx.lua:20
---- TRACE 2 IR
0001 rbx      int SLOAD  #3    PI
0002 rbp      p64 PVAL   #6  
0003 [8]      cdt CNEWI  +96   0002
....              SNAP   #0   [ ---- ---- 0003 0001 ---- ---- ---- ]
0004       >  nil GCSTEP 
0005 rbx      int ADD    0001  +1  
....              SNAP   #1   [ ---- ---- 0003 ]
0006       >  int LE     0005  +1000000
0007 xmm7     num CONV   0005  num.int
....              SNAP   #2   [ ---- ---- 0003 0007 ---- ---- 0007 ]
---- TRACE 2 stop -> 1

So here we see that:

Trace 1 has a CNEWI that sinks.
Trace 2 loads the raw sunk pointer value using PVAL.
Trace 2 has a duplicate CNEWI to make the pointer value respectable/usable.
Trace 2 is not able to sink the CNEWI because it is referenced in the last snapshot.

The nice aspect of this is that sunk values can be passed between traces. The restriction is that those values can't just be loaded from the stack using SLOAD and must instead be wrapped in a CNEWI so that the compiler can unsink them if necessary.

So maybe a solution is that root traces would not load immediate cdata values using SLOAD but instead with a special CNEWI that loads its value from the stack and can accept either sunk or unsunk values?

So at the start of trace 1 we would replace

-0002 rax   >  cdt SLOAD  #2    T
+0002  {sink}+ cdt CNEWI  +96   #2

... which is a new form of CNEWI in the spirit of PVAL that will receive a sunk value (if available) instead of loading from the stack. (Handwave on the details of this for now.)

This all sounds tricky but I am not so sure. The promising aspect is that when we compile the link T1->T2 we already have full snapshot information, etc, about both of the traces.

lukego mentioned this issue Apr 11, 2019

Demo: Over 50x slowdown on pointer arithmetic due to single branch #252

Open

lukego mentioned this issue Apr 13, 2019

[wip] Allow sunk CNEWI values to be transferred to linked traces #253

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: CNEWI sinking across trace boundaries #251

Idea: CNEWI sinking across trace boundaries #251

lukego commented Apr 9, 2019

lukego commented Apr 9, 2019 •

edited

mejedi commented Apr 11, 2019 •

edited

lukego commented Apr 12, 2019

lukego commented Apr 12, 2019

Idea: CNEWI sinking across trace boundaries #251

Idea: CNEWI sinking across trace boundaries #251

Comments

lukego commented Apr 9, 2019

lukego commented Apr 9, 2019 • edited

mejedi commented Apr 11, 2019 • edited

lukego commented Apr 12, 2019

lukego commented Apr 12, 2019

Example

lukego commented Apr 9, 2019 •

edited

mejedi commented Apr 11, 2019 •

edited