Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLVMDoubleVisitor init is slow, added example as benchmark #1612

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ichumuh
Copy link
Contributor

@ichumuh ichumuh commented Sep 11, 2019

Hi guys,

similar to diff on expressions with common sub expressions, llvm init is also very slow.
Since you managed to make diff super fast for my usecase, I was wondering if the same would be possible for llvm init?
The benchmark is an example, similar to diff_cache.
My output is

$ ./benchmarks/llvm_diff_cache 
Time for jacobian with common subexpressions (with cache) : 	     0.108437747
Time for llvm init with common subexpressions             : 	    12.805364401
Time for jacobian without common subexpressions (without cache) : 	     3.684057946
Time for llvm init without common subexpressions                : 	     2.156445590

Thanks!

@isuruf-bot
Copy link

isuruf-bot commented Sep 11, 2019

Hi,

I've run clang-format and found that the code formatting is good.

Thanks for fixing the formatting.

@certik
Copy link
Contributor

certik commented Sep 11, 2019

@ichumuh thanks for writing a test case. What is the difference in the llvm init? Is the only difference that in the first case you give it Jacobian that is using a cache, and in the second case it is not using a cache? Or is there another difference.

Why would it be 6x slower? Any ideas?

@ichumuh
Copy link
Contributor Author

ichumuh commented Sep 11, 2019

I'm only using cache in the 1. test and no cache in the 2. because its the faster way to calculate the jacobian, it doesn't influence llvm init. My goal was to highlight the similarity to #1592.
There it was also the case that a lot of common sub expressions slowed diff down. Maybe its something similar?

@certik
Copy link
Contributor

certik commented Sep 11, 2019

What I don't understand is that just by switching to use a cache to compute the Jacobian it somehow makes the LLVM init slower. It seems the LLVM init sees exactly the same expression in both cases. Although in the first case, some subexpressions are reused (have higher reference count).

We just need to debug this to see what is happening.

@isuruf
Copy link
Member

isuruf commented Sep 11, 2019

@certik, the first 2 timings are for 1 expression and the last 2 are a different expression. llvm timing shouldn't have had with cache or without cache in the description

@isuruf
Copy link
Member

isuruf commented Sep 11, 2019

@ichumuh, can you run callgrind on just the llvmvisitor call and post the the functions that take the most time?

@ichumuh
Copy link
Contributor Author

ichumuh commented Sep 11, 2019

So I'm kind of a noob with callgrind.
I did this to my code (didn't get --toggle-collect=<function> to run):

[...]
    CALLGRIND_START_INSTRUMENTATION;
    v3.init(v, {
        J.get(0,0), 
        J.get(0,1),
        J.get(0,2),
        J.get(0,3),
        J.get(0,4),
        J.get(0,5),
        J.get(0,6),
        J.get(0,7),
        J.get(0,8),
        J.get(0,9)}, symbolic_cse, opt_level);
    CALLGRIND_STOP_INSTRUMENTATION;
[...]

and ran
valgrind --tool=callgrind --instr-atstart=no ./benchmarks/llvm_diff_cache
I didn't know how to interpret the output, but google told me how I can turn it into a png.
I guess this means that llvm::MCJIT::finalizeObject() takes the most time?
graph2

@isuruf
Copy link
Member

isuruf commented Sep 11, 2019

I usually use kcachegrind to interpret the results.

@isuruf
Copy link
Member

isuruf commented Sep 11, 2019

Looking at the png, I don't think there's much we can do. You can try reducing the optimization level, but that would affect the runtime.

@bjodah
Copy link
Contributor

bjodah commented Sep 11, 2019

We're currently using essentially "-O3" here:

.setOptLevel(llvm::CodeGenOpt::Level::Aggressive)

We should probably use our option "opt_level" to set this in accordance. Perhaps also expose opt_level in the python bindings.

But as Isuru points out, it will have an effect on runtime. (If one really wanted, one could probably build a two-tier backend to do some latency hiding by compiling a llvm-callback in one thread and and using a LambdaDoubleVisitor in the main thread until the compilation is done)

@certik
Copy link
Contributor

certik commented Sep 11, 2019

Alternatively we can precompute the Lambda and save the optimized LLVM IR, or perhaps even the compiled object code, so that one can precompute it (that will take some time), but then the second time it can get loaded immediately.

@isuruf
Copy link
Member

isuruf commented Sep 12, 2019

Alternatively we can precompute the Lambda and save the optimized LLVM IR, or perhaps even the compiled object code, so that one can precompute it (that will take some time), but then the second time it can get loaded immediately.

This is already possible with the dumps and loads functions in LLVMVisitor.

@ichumuh
Copy link
Contributor Author

ichumuh commented Sep 12, 2019

You can try reducing the optimization level, but that would affect the runtime.

Reducing the optimization level does not significantly influence the runtime.
opt_level=2

$ ./benchmarks/llvm_diff_cache 
Time for jacobian with common subexpressions (with cache) : 	     0.110041993
Time for llvm init with common subexpressions (with cache) : 	    12.312735768

opt_level=1

$ ./benchmarks/llvm_diff_cache 
Time for jacobian with common subexpressions (with cache) : 	     0.109979281
Time for llvm init with common subexpressions (with cache) : 	    12.704904674

opt_level=0

$ ./benchmarks/llvm_diff_cache 
Time for jacobian with common subexpressions (with cache) : 	     0.112378036
Time for llvm init with common subexpressions (with cache) : 	    19.536180906

This is already possible with the dumps and loads functions in LLVMVisitor.

That's what I'm currently doing. So it seems like there is not much that can be done. Thanks for looking into it. You can close the PR if you'd like.

@bjodah
Copy link
Contributor

bjodah commented Sep 12, 2019

@ichumuh I'm just curious: if you change LLVMDoubleVisitor to LambdaDoubleVisitor, what does the timings look like then? (how many evaluations do you perform compared to init?)

(also: I wouldn't expect opt_level to help in its current state, I suspect we need to have it influence llvm::CodeGenOpt::Level::* as I alluded to above)

@ichumuh
Copy link
Contributor Author

ichumuh commented Sep 12, 2019

@bjodah with LambdaRealDoubleVisitor I get

Time for jacobian with common subexpressions (with cache) : 	     0.107010267
Time for llvm init with common subexpressions (with cache) : 	     3.321364823
Time for jacobian without common subexpressions (without cache) : 	     3.547528868
Time for llvm init without common subexpressions (without cache) : 	     0.325697398

But the resulting function takes too much time to evaluate.
I did a comparison a while back if you are interested: SemRoCo/giskardpy#5 (comment)
Well its been two years, at least the jacobian times are much better now, idk about the rest.
I'm doing constrained based motion control. I have some expressions describing motion, compute the jacobian, evaluate that funciton at a high frequency. Each function is at least evaluated a few hundred times.

@isuruf
Copy link
Member

isuruf commented Sep 12, 2019

@ichumuh, can you try isuruf@ac7b9bb with the opt_level?

@ichumuh
Copy link
Contributor Author

ichumuh commented Sep 13, 2019

(edit 2. without common subexpressions (without cache) now also uses llvm with opt level)

$ ./benchmarks/llvm_diff_cache 
opt_level 0
Time for jacobian with common subexpressions (with cache) : 	     0.102599152
Time for llvm init with common subexpressions (with cache) : 	     4.445941176
Time for jacobian without common subexpressions (without cache) : 	     3.636012695
Time for llvm init without common subexpressions (without cache) : 	     1.287224548
opt_level 1
Time for jacobian with common subexpressions (with cache) : 	     0.085668137
Time for llvm init with common subexpressions (with cache) : 	    12.942262733
Time for jacobian without common subexpressions (without cache) : 	     3.657792644
Time for llvm init without common subexpressions (without cache) : 	     2.181998724
opt_level 2
Time for jacobian with common subexpressions (with cache) : 	     0.085927170
Time for llvm init with common subexpressions (with cache) : 	    12.988112252
Time for jacobian without common subexpressions (without cache) : 	     3.720069452
Time for llvm init without common subexpressions (without cache) : 	     2.217352695
opt_level 3
Time for jacobian with common subexpressions (with cache) : 	     0.095672296
Time for llvm init with common subexpressions (with cache) : 	    13.151314117
Time for jacobian without common subexpressions (without cache) : 	     3.726995016
Time for llvm init without common subexpressions (without cache) : 	     2.217865557

I was also trying to compare how much time call takes, but there doesn't seem to be an implementation for this case? I basically need one that returns a list and not a single double, but that doesn't exist? I wonder why it works in symengine.py.

@bjodah
Copy link
Contributor

bjodah commented Sep 13, 2019

There are two signatures for call, one takes two pointers where the non-const pointer is the "output array" (you need to allocate for it using e.g. malloc or making a std::vector and passing its .data().

@ichumuh
Copy link
Contributor Author

ichumuh commented Sep 13, 2019

@bjodah Thanks.

$ ./benchmarks/llvm_diff_cache 
opt_level 0
Time for jacobian with common subexpressions: 	     0.108219366
Time for llvm init with common subexpressions: 	     4.474706773
Time for llvm call with common subexpressions: 	     0.000230163
Time for jacobian without common subexpressions: 	     3.747388515
Time for llvm init without common subexpressions: 	     1.289677737
Time for llvm call without common subexpressions: 	     0.000236636
opt_level 1
Time for jacobian with common subexpressions: 	     0.090473821
Time for llvm init with common subexpressions: 	    13.098013865
Time for llvm call with common subexpressions: 	     0.000173625
Time for jacobian without common subexpressions: 	     3.674321327
Time for llvm init without common subexpressions: 	     2.199979134
Time for llvm call without common subexpressions: 	     0.000197330
opt_level 2
Time for jacobian with common subexpressions: 	     0.087502084
Time for llvm init with common subexpressions: 	    13.013300478
Time for llvm call with common subexpressions: 	     0.000143014
Time for jacobian without common subexpressions: 	     3.663375264
Time for llvm init without common subexpressions: 	     2.195645444
Time for llvm call without common subexpressions: 	     0.000290689
opt_level 3
Time for jacobian with common subexpressions: 	     0.087973188
Time for llvm init with common subexpressions: 	    13.013416809
Time for llvm call with common subexpressions: 	     0.000143401
Time for jacobian without common subexpressions: 	     3.704199721
Time for llvm init without common subexpressions: 	     2.204054393
Time for llvm call without common subexpressions: 	     0.000178715

@ichumuh
Copy link
Contributor Author

ichumuh commented Sep 13, 2019

with LambdaRealDoubleVisitor:

$ ./benchmarks/llvm_diff_cache 
Time for jacobian with common subexpressions: 	     0.112163930
Time for llvm init with common subexpressions: 	     3.560271043
Time for llvm call with common subexpressions: 	     0.002026663
Time for jacobian without common subexpressions: 	     3.660975644
Time for llvm init without common subexpressions: 	     0.336053675
Time for llvm call without common subexpressions: 	     0.000473094

@bjodah
Copy link
Contributor

bjodah commented Sep 13, 2019

Great.

So it looks like opt_level=0 is a win for your use case here? (am I reading your timings correctly?)

I see the timings for call is in the sub millisecond range. You said you did "hundreds" of evaluations for each init? Perhaps time a few a loop of a few hundred calls then? (it would also make the comparison with the LambdaRealDoubleVisitor more fair?)

@ichumuh
Copy link
Contributor Author

ichumuh commented Sep 13, 2019

$ ./benchmarks/llvm_diff_cache 
opt_level 0
Time for llvm init with common subexpressions: 	     4.452341642
Time for 500 llvm calls with common subexpressions: 	     0.056452646
Time for llvm init without common subexpressions: 	     1.291209508
Time for 500 llvm calls without common subexpressions: 	     0.078078626
opt_level 1
Time for llvm init with common subexpressions: 	    12.994182719
Time for 500 llvm calls with common subexpressions: 	     0.043232439
Time for llvm init without common subexpressions: 	     2.167738079
Time for 500 llvm calls without common subexpressions: 	     0.062677866
opt_level 2
Time for llvm init with common subexpressions: 	    12.942163120
Time for 500 llvm calls with common subexpressions: 	     0.043223322
Time for llvm init without common subexpressions: 	     2.175001043
Time for 500 llvm calls without common subexpressions: 	     0.061535840
opt_level 3
Time for llvm init with common subexpressions: 	    12.931606667
Time for 500 llvm calls with common subexpressions: 	     0.045553398
Time for llvm init without common subexpressions: 	     2.229260492
Time for 500 llvm calls without common subexpressions: 	     0.061938380
=================================
Time for lambda init with common subexpressions: 	     3.681221311
Time for 500 lambda calls with common subexpressions: 	     0.547503499
Time for lambda init without common subexpressions: 	     0.331109800
Time for 500 lambda calls without common subexpressions: 	     0.118502574

So it looks like opt_level=0 is a win for your use case here? (am I reading you timings correctly?)

Yes! Exposing opt_level in the python wrappers would be great.

lkeegan added a commit to lkeegan/symengine that referenced this pull request Mar 21, 2021
lkeegan added a commit to lkeegan/symengine that referenced this pull request Mar 22, 2021
lkeegan added a commit to lkeegan/symengine that referenced this pull request Mar 22, 2021
  - move expressions to separate visitor_expressions.h header
  - add SYMENGINE_BENCHMARK_VISITORS macro to generate benchmarks for all expressions

add visitor_init benchmark to benchmark init() for
  - LambdaRealDoubleVisitor
  - LLVMDoubleVisitor
  - LLVMFloatVisitor

add more expressions
  - two large expressions based on symengine#1612
  - one expression copied from the llvm tests
lkeegan added a commit to lkeegan/symengine that referenced this pull request Mar 23, 2021
  - LambdaRealDoubleVisitor
  - LLVMDoubleVisitor
  - LLVMFloatVisitor

refactor visitor benchmarks
  - move expressions to visitor_expressions.h
  - add SYMENGINE_BENCHMARK_VISITORS macro to generate benchmarks for all expressions

add more expressions
  - two large expressions based on symengine#1612
  - one expression with many intrinsic functions copied from test_lambda_double.cpp
lkeegan added a commit to lkeegan/symengine that referenced this pull request Apr 11, 2021
  - LambdaRealDoubleVisitor
  - LLVMDoubleVisitor
  - LLVMFloatVisitor

refactor visitor benchmarks
  - move expressions to visitor_expressions.h
  - add SYMENGINE_BENCHMARK_VISITORS macro to generate benchmarks for all expressions

add more expressions
  - two large expressions based on symengine#1612
  - one expression with many intrinsic functions copied from test_lambda_double.cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants