Side traces #32

lukego · 2016-12-31T13:27:15Z

I am really happy to discover this project! I am also interested in writing code that specifically exploits the strengths of LuaJIT. I find your example on the README very powerful.

There is one optimization challenge on my mind that I think we share: how to prevent unwanted side-traces. I have written a little bit about this over at LuaJIT/LuaJIT#208 and had some success with optimizations - but not really nailed the issue closed.

Here is an illustration of how it seems to affect luafun. Consider this example program that is based on the README but with multiple operations:

require("fun")()
local n = 100
print(reduce(operator.add, 0, map(function(x) return x^2 end, range(n)))) -- first
print(reduce(operator.add, 0, map(function(x) return x^2 end, range(n)))) -- second

The first call does indeed compile down to a tight loop of around 10 instructions. However, the second call compiles down to around 62 instructions - even though it is exactly the same code! The difference is that the second call is compiling to a side trace within the reduce() subroutine and on each iteration it re-enters the root trace and repeats all of the loop invariant code. So you end up with the first call compiling very tight while the subsequent ones are less efficient. There is a cumulative effect too: each call will create a new side-trace, and the side traces will be chained together, and each link in the chain will add around 9 more instructions.

Just wondering: are you already aware of this issue? If not then you may be able to solve it in a neat way using the tips from @corsix in LuaJIT/LuaJIT#208. If you are already aware of this I would be very interested in your view e.g. if you already have some recommendations that you make to users to explain the best usage and generally set expectations.

Sadly it seems like the simplest workaround would be to avoid programming in a functional style :-) because the problem only arises when many different pieces of code are all executing the same loop statement (for, while, etc) in the source code. If you would write for ... end every time then you have no problem because each for bytecode will have a separate trace complex.

The text was updated successfully, but these errors were encountered:

lukego · 2016-12-31T13:40:17Z

(One "magical" solution to give each call a separate root trace might be for reduce() to internally use the debug API to inspect the stack and determine where it is being called from. Then you could use the specialize() function from LuaJIT/LuaJIT#208 to make a table providing each call site with a separate copy of the reduce subroutine. This is a kludge but it might work and I cannot immediately think of an elegant alternative.)

lukego · 2017-01-01T12:05:46Z

Could not resist continuing this line of investigation a little bit. I made an empirical test case that repeatedly (100x) runs this same snippet of code:

reduce(operator.add, 0, map(function(x) return x^2 end, range(10e6)))

... and uses the x86 Performance Monitoring Unit to measure how many instructions are executed per iteration. On each of the 100 runs the snippet is loaded with loadstring() to simulate being 100 separate calls in a large program (rather than the same call being made 100 times.) This is the bad case I am interested in where each call in the source is compiled to a separate side-trace without a dedicated loop.

Code and results for 100 iterations is here: https://gist.github.com/lukego/5c64368bdfdcf7ffcc4dc067fc630182

Here is the summary as I see it:

First call takes 9 instructions. Trace with a loop.
Second call takes 63 instructions. Side-trace without a loop.
Calls 3..92 each take 3 more instructions that the previous one. Side-traces are forming a chain. Ends up with 333 instructions for trace 92.
Calls 93..100 each take 865 instructions. The JIT has given up at this point and fallen back to the interpreter.

Quite an undertaking to understand and optimize code with a tracing JIT!

Solution will be to find a way to give each call a separate root trace. Then they should each take only 9 instructions like the first one.

Proof-of-concept solution to issue luafun#32. Compiles the root trace for the foldl-loop separately for each calling function. This reduces excessive side tracing. For the specific test case from luafun#32 it resolves the problem perfectly and makes sure that each of the 100 traces has a nine-instruction loop. This solution is not perfect though: if a single function makes multiple calls to foldl then they will be compiled together. Could be nicer to find a way to compile each call-site separately. Just have to compute a better key to the foldl_impl table. For now this should be sufficient for evaluating whether the optimization is beneficial in practice. This change uses one very recent LuaJIT feature, the 'proto' field of debug information, so you need at least LuaJIT/LuaJIT@fb61f7c.

lukego mentioned this issue Jan 1, 2017

How to control the scope of traces? LuaJIT/LuaJIT#208

Closed

lukego mentioned this issue Jan 1, 2017

Compile separate traces for each reduce() caller #33

Open

kyukhin added this to the wishlist milestone Oct 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Side traces #32

Side traces #32

lukego commented Dec 31, 2016

lukego commented Dec 31, 2016

lukego commented Jan 1, 2017 •

edited

Side traces #32

Side traces #32

Comments

lukego commented Dec 31, 2016

lukego commented Dec 31, 2016

lukego commented Jan 1, 2017 • edited

lukego commented Jan 1, 2017 •

edited