Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Side traces #32

Open
lukego opened this issue Dec 31, 2016 · 2 comments
Open

Side traces #32

lukego opened this issue Dec 31, 2016 · 2 comments
Milestone

Comments

@lukego
Copy link

lukego commented Dec 31, 2016

I am really happy to discover this project! I am also interested in writing code that specifically exploits the strengths of LuaJIT. I find your example on the README very powerful.

There is one optimization challenge on my mind that I think we share: how to prevent unwanted side-traces. I have written a little bit about this over at LuaJIT/LuaJIT#208 and had some success with optimizations - but not really nailed the issue closed.

Here is an illustration of how it seems to affect luafun. Consider this example program that is based on the README but with multiple operations:

require("fun")()
local n = 100
print(reduce(operator.add, 0, map(function(x) return x^2 end, range(n)))) -- first
print(reduce(operator.add, 0, map(function(x) return x^2 end, range(n)))) -- second

The first call does indeed compile down to a tight loop of around 10 instructions. However, the second call compiles down to around 62 instructions - even though it is exactly the same code! The difference is that the second call is compiling to a side trace within the reduce() subroutine and on each iteration it re-enters the root trace and repeats all of the loop invariant code. So you end up with the first call compiling very tight while the subsequent ones are less efficient. There is a cumulative effect too: each call will create a new side-trace, and the side traces will be chained together, and each link in the chain will add around 9 more instructions.

Just wondering: are you already aware of this issue? If not then you may be able to solve it in a neat way using the tips from @corsix in LuaJIT/LuaJIT#208. If you are already aware of this I would be very interested in your view e.g. if you already have some recommendations that you make to users to explain the best usage and generally set expectations.

Sadly it seems like the simplest workaround would be to avoid programming in a functional style :-) because the problem only arises when many different pieces of code are all executing the same loop statement (for, while, etc) in the source code. If you would write for ... end every time then you have no problem because each for bytecode will have a separate trace complex.

@lukego
Copy link
Author

lukego commented Dec 31, 2016

(One "magical" solution to give each call a separate root trace might be for reduce() to internally use the debug API to inspect the stack and determine where it is being called from. Then you could use the specialize() function from LuaJIT/LuaJIT#208 to make a table providing each call site with a separate copy of the reduce subroutine. This is a kludge but it might work and I cannot immediately think of an elegant alternative.)

@lukego
Copy link
Author

lukego commented Jan 1, 2017

Could not resist continuing this line of investigation a little bit. I made an empirical test case that repeatedly (100x) runs this same snippet of code:

reduce(operator.add, 0, map(function(x) return x^2 end, range(10e6)))

... and uses the x86 Performance Monitoring Unit to measure how many instructions are executed per iteration. On each of the 100 runs the snippet is loaded with loadstring() to simulate being 100 separate calls in a large program (rather than the same call being made 100 times.) This is the bad case I am interested in where each call in the source is compiled to a separate side-trace without a dedicated loop.

Code and results for 100 iterations is here: https://gist.github.com/lukego/5c64368bdfdcf7ffcc4dc067fc630182

Here is the summary as I see it:

  • First call takes 9 instructions. Trace with a loop.
  • Second call takes 63 instructions. Side-trace without a loop.
  • Calls 3..92 each take 3 more instructions that the previous one. Side-traces are forming a chain. Ends up with 333 instructions for trace 92.
  • Calls 93..100 each take 865 instructions. The JIT has given up at this point and fallen back to the interpreter.

Quite an undertaking to understand and optimize code with a tracing JIT!

Solution will be to find a way to give each call a separate root trace. Then they should each take only 9 instructions like the first one.

lukego added a commit to lukego/luafun that referenced this issue Jan 1, 2017
Proof-of-concept solution to issue luafun#32. Compiles the root trace for
the foldl-loop separately for each calling function. This reduces
excessive side tracing. For the specific test case from luafun#32 it
resolves the problem perfectly and makes sure that each of the 100
traces has a nine-instruction loop.

This solution is not perfect though: if a single function makes
multiple calls to foldl then they will be compiled together. Could be
nicer to find a way to compile each call-site separately. Just have to
compute a better key to the foldl_impl table. For now this should be
sufficient for evaluating whether the optimization is beneficial in
practice.

This change uses one very recent LuaJIT feature, the 'proto' field of
debug information, so you need at least
LuaJIT/LuaJIT@fb61f7c.
@kyukhin kyukhin added this to the wishlist milestone Oct 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants