LLVMDoubleVisitor init is slow, added example as benchmark #1612

ichumuh · 2019-09-11T14:06:15Z

Hi guys,

similar to diff on expressions with common sub expressions, llvm init is also very slow.
Since you managed to make diff super fast for my usecase, I was wondering if the same would be possible for llvm init?
The benchmark is an example, similar to diff_cache.
My output is

$ ./benchmarks/llvm_diff_cache 
Time for jacobian with common subexpressions (with cache) : 	     0.108437747
Time for llvm init with common subexpressions             : 	    12.805364401
Time for jacobian without common subexpressions (without cache) : 	     3.684057946
Time for llvm init without common subexpressions                : 	     2.156445590

Thanks!

isuruf-bot · 2019-09-11T14:06:50Z

Hi,

I've run clang-format and found that the code formatting is good.

Thanks for fixing the formatting.

certik · 2019-09-11T15:43:14Z

@ichumuh thanks for writing a test case. What is the difference in the llvm init? Is the only difference that in the first case you give it Jacobian that is using a cache, and in the second case it is not using a cache? Or is there another difference.

Why would it be 6x slower? Any ideas?

ichumuh · 2019-09-11T16:09:54Z

I'm only using cache in the 1. test and no cache in the 2. because its the faster way to calculate the jacobian, it doesn't influence llvm init. My goal was to highlight the similarity to #1592.
There it was also the case that a lot of common sub expressions slowed diff down. Maybe its something similar?

certik · 2019-09-11T16:53:05Z

What I don't understand is that just by switching to use a cache to compute the Jacobian it somehow makes the LLVM init slower. It seems the LLVM init sees exactly the same expression in both cases. Although in the first case, some subexpressions are reused (have higher reference count).

We just need to debug this to see what is happening.

isuruf · 2019-09-11T17:21:33Z

@certik, the first 2 timings are for 1 expression and the last 2 are a different expression. llvm timing shouldn't have had with cache or without cache in the description

isuruf · 2019-09-11T17:25:59Z

@ichumuh, can you run callgrind on just the llvmvisitor call and post the the functions that take the most time?

ichumuh · 2019-09-11T19:34:18Z

So I'm kind of a noob with callgrind.
I did this to my code (didn't get --toggle-collect=<function> to run):

[...]
    CALLGRIND_START_INSTRUMENTATION;
    v3.init(v, {
        J.get(0,0), 
        J.get(0,1),
        J.get(0,2),
        J.get(0,3),
        J.get(0,4),
        J.get(0,5),
        J.get(0,6),
        J.get(0,7),
        J.get(0,8),
        J.get(0,9)}, symbolic_cse, opt_level);
    CALLGRIND_STOP_INSTRUMENTATION;
[...]

and ran
valgrind --tool=callgrind --instr-atstart=no ./benchmarks/llvm_diff_cache
I didn't know how to interpret the output, but google told me how I can turn it into a png.
I guess this means that llvm::MCJIT::finalizeObject() takes the most time?

isuruf · 2019-09-11T19:38:10Z

I usually use kcachegrind to interpret the results.

isuruf · 2019-09-11T19:41:23Z

Looking at the png, I don't think there's much we can do. You can try reducing the optimization level, but that would affect the runtime.

bjodah · 2019-09-11T20:46:22Z

We're currently using essentially "-O3" here:

symengine/symengine/llvm_double.cpp

Line 294 in 4ea4ba7

.setOptLevel(llvm::CodeGenOpt::Level::Aggressive)

We should probably use our option "opt_level" to set this in accordance. Perhaps also expose opt_level in the python bindings.

But as Isuru points out, it will have an effect on runtime. (If one really wanted, one could probably build a two-tier backend to do some latency hiding by compiling a llvm-callback in one thread and and using a LambdaDoubleVisitor in the main thread until the compilation is done)

certik · 2019-09-11T21:04:50Z

Alternatively we can precompute the Lambda and save the optimized LLVM IR, or perhaps even the compiled object code, so that one can precompute it (that will take some time), but then the second time it can get loaded immediately.

isuruf · 2019-09-12T00:12:00Z

Alternatively we can precompute the Lambda and save the optimized LLVM IR, or perhaps even the compiled object code, so that one can precompute it (that will take some time), but then the second time it can get loaded immediately.

This is already possible with the dumps and loads functions in LLVMVisitor.

ichumuh · 2019-09-12T07:52:17Z

You can try reducing the optimization level, but that would affect the runtime.

Reducing the optimization level does not significantly influence the runtime.
opt_level=2

$ ./benchmarks/llvm_diff_cache 
Time for jacobian with common subexpressions (with cache) : 	     0.110041993
Time for llvm init with common subexpressions (with cache) : 	    12.312735768

opt_level=1

$ ./benchmarks/llvm_diff_cache 
Time for jacobian with common subexpressions (with cache) : 	     0.109979281
Time for llvm init with common subexpressions (with cache) : 	    12.704904674

opt_level=0

$ ./benchmarks/llvm_diff_cache 
Time for jacobian with common subexpressions (with cache) : 	     0.112378036
Time for llvm init with common subexpressions (with cache) : 	    19.536180906

This is already possible with the dumps and loads functions in LLVMVisitor.

That's what I'm currently doing. So it seems like there is not much that can be done. Thanks for looking into it. You can close the PR if you'd like.

bjodah · 2019-09-12T08:06:52Z

@ichumuh I'm just curious: if you change LLVMDoubleVisitor to LambdaDoubleVisitor, what does the timings look like then? (how many evaluations do you perform compared to init?)

(also: I wouldn't expect opt_level to help in its current state, I suspect we need to have it influence llvm::CodeGenOpt::Level::* as I alluded to above)

ichumuh · 2019-09-12T08:42:06Z

@bjodah with LambdaRealDoubleVisitor I get

Time for jacobian with common subexpressions (with cache) : 	     0.107010267
Time for llvm init with common subexpressions (with cache) : 	     3.321364823
Time for jacobian without common subexpressions (without cache) : 	     3.547528868
Time for llvm init without common subexpressions (without cache) : 	     0.325697398

But the resulting function takes too much time to evaluate.
I did a comparison a while back if you are interested: SemRoCo/giskardpy#5 (comment)
Well its been two years, at least the jacobian times are much better now, idk about the rest.
I'm doing constrained based motion control. I have some expressions describing motion, compute the jacobian, evaluate that funciton at a high frequency. Each function is at least evaluated a few hundred times.

isuruf · 2019-09-12T18:36:54Z

@ichumuh, can you try isuruf@ac7b9bb with the opt_level?

ichumuh · 2019-09-13T13:06:41Z

(edit 2. without common subexpressions (without cache) now also uses llvm with opt level)

$ ./benchmarks/llvm_diff_cache 
opt_level 0
Time for jacobian with common subexpressions (with cache) : 	     0.102599152
Time for llvm init with common subexpressions (with cache) : 	     4.445941176
Time for jacobian without common subexpressions (without cache) : 	     3.636012695
Time for llvm init without common subexpressions (without cache) : 	     1.287224548
opt_level 1
Time for jacobian with common subexpressions (with cache) : 	     0.085668137
Time for llvm init with common subexpressions (with cache) : 	    12.942262733
Time for jacobian without common subexpressions (without cache) : 	     3.657792644
Time for llvm init without common subexpressions (without cache) : 	     2.181998724
opt_level 2
Time for jacobian with common subexpressions (with cache) : 	     0.085927170
Time for llvm init with common subexpressions (with cache) : 	    12.988112252
Time for jacobian without common subexpressions (without cache) : 	     3.720069452
Time for llvm init without common subexpressions (without cache) : 	     2.217352695
opt_level 3
Time for jacobian with common subexpressions (with cache) : 	     0.095672296
Time for llvm init with common subexpressions (with cache) : 	    13.151314117
Time for jacobian without common subexpressions (without cache) : 	     3.726995016
Time for llvm init without common subexpressions (without cache) : 	     2.217865557

I was also trying to compare how much time call takes, but there doesn't seem to be an implementation for this case? I basically need one that returns a list and not a single double, but that doesn't exist? I wonder why it works in symengine.py.

bjodah · 2019-09-13T13:09:49Z

There are two signatures for call, one takes two pointers where the non-const pointer is the "output array" (you need to allocate for it using e.g. malloc or making a std::vector and passing its .data().

ichumuh · 2019-09-13T13:28:11Z

@bjodah Thanks.

$ ./benchmarks/llvm_diff_cache 
opt_level 0
Time for jacobian with common subexpressions: 	     0.108219366
Time for llvm init with common subexpressions: 	     4.474706773
Time for llvm call with common subexpressions: 	     0.000230163
Time for jacobian without common subexpressions: 	     3.747388515
Time for llvm init without common subexpressions: 	     1.289677737
Time for llvm call without common subexpressions: 	     0.000236636
opt_level 1
Time for jacobian with common subexpressions: 	     0.090473821
Time for llvm init with common subexpressions: 	    13.098013865
Time for llvm call with common subexpressions: 	     0.000173625
Time for jacobian without common subexpressions: 	     3.674321327
Time for llvm init without common subexpressions: 	     2.199979134
Time for llvm call without common subexpressions: 	     0.000197330
opt_level 2
Time for jacobian with common subexpressions: 	     0.087502084
Time for llvm init with common subexpressions: 	    13.013300478
Time for llvm call with common subexpressions: 	     0.000143014
Time for jacobian without common subexpressions: 	     3.663375264
Time for llvm init without common subexpressions: 	     2.195645444
Time for llvm call without common subexpressions: 	     0.000290689
opt_level 3
Time for jacobian with common subexpressions: 	     0.087973188
Time for llvm init with common subexpressions: 	    13.013416809
Time for llvm call with common subexpressions: 	     0.000143401
Time for jacobian without common subexpressions: 	     3.704199721
Time for llvm init without common subexpressions: 	     2.204054393
Time for llvm call without common subexpressions: 	     0.000178715

ichumuh · 2019-09-13T13:32:20Z

with LambdaRealDoubleVisitor:

$ ./benchmarks/llvm_diff_cache 
Time for jacobian with common subexpressions: 	     0.112163930
Time for llvm init with common subexpressions: 	     3.560271043
Time for llvm call with common subexpressions: 	     0.002026663
Time for jacobian without common subexpressions: 	     3.660975644
Time for llvm init without common subexpressions: 	     0.336053675
Time for llvm call without common subexpressions: 	     0.000473094

bjodah · 2019-09-13T13:33:32Z

Great.

So it looks like opt_level=0 is a win for your use case here? (am I reading your timings correctly?)

I see the timings for call is in the sub millisecond range. You said you did "hundreds" of evaluations for each init? Perhaps time a few a loop of a few hundred calls then? (it would also make the comparison with the LambdaRealDoubleVisitor more fair?)

ichumuh · 2019-09-13T13:50:12Z

$ ./benchmarks/llvm_diff_cache 
opt_level 0
Time for llvm init with common subexpressions: 	     4.452341642
Time for 500 llvm calls with common subexpressions: 	     0.056452646
Time for llvm init without common subexpressions: 	     1.291209508
Time for 500 llvm calls without common subexpressions: 	     0.078078626
opt_level 1
Time for llvm init with common subexpressions: 	    12.994182719
Time for 500 llvm calls with common subexpressions: 	     0.043232439
Time for llvm init without common subexpressions: 	     2.167738079
Time for 500 llvm calls without common subexpressions: 	     0.062677866
opt_level 2
Time for llvm init with common subexpressions: 	    12.942163120
Time for 500 llvm calls with common subexpressions: 	     0.043223322
Time for llvm init without common subexpressions: 	     2.175001043
Time for 500 llvm calls without common subexpressions: 	     0.061535840
opt_level 3
Time for llvm init with common subexpressions: 	    12.931606667
Time for 500 llvm calls with common subexpressions: 	     0.045553398
Time for llvm init without common subexpressions: 	     2.229260492
Time for 500 llvm calls without common subexpressions: 	     0.061938380
=================================
Time for lambda init with common subexpressions: 	     3.681221311
Time for 500 lambda calls with common subexpressions: 	     0.547503499
Time for lambda init without common subexpressions: 	     0.331109800
Time for 500 lambda calls without common subexpressions: 	     0.118502574

So it looks like opt_level=0 is a win for your use case here? (am I reading you timings correctly?)

Yes! Exposing opt_level in the python wrappers would be great.

… benchmarks

- move expressions to separate visitor_expressions.h header - add SYMENGINE_BENCHMARK_VISITORS macro to generate benchmarks for all expressions add visitor_init benchmark to benchmark init() for - LambdaRealDoubleVisitor - LLVMDoubleVisitor - LLVMFloatVisitor add more expressions - two large expressions based on symengine#1612 - one expression copied from the llvm tests

- LambdaRealDoubleVisitor - LLVMDoubleVisitor - LLVMFloatVisitor refactor visitor benchmarks - move expressions to visitor_expressions.h - add SYMENGINE_BENCHMARK_VISITORS macro to generate benchmarks for all expressions add more expressions - two large expressions based on symengine#1612 - one expression with many intrinsic functions copied from test_lambda_double.cpp

added benchmark that turns big expressions into llvm functions

c24671e

applied commit suggested by bot

a86cf45

ichumuh mentioned this pull request Sep 16, 2019

exposed llvm lambdify opt_level in python api symengine/symengine.py#300

Merged

lkeegan added a commit to lkeegan/symengine that referenced this pull request Mar 21, 2021

add two large expressions based on symengine#1612 to the visitor init…

c91df99

… benchmarks

lkeegan added a commit to lkeegan/symengine that referenced this pull request Mar 22, 2021

add two large expressions based on symengine#1612 to the visitor init…

05cdb93

… benchmarks

lkeegan mentioned this pull request Mar 23, 2021

add visitor_init benchmark #1755

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLVMDoubleVisitor init is slow, added example as benchmark #1612

LLVMDoubleVisitor init is slow, added example as benchmark #1612

ichumuh commented Sep 11, 2019 •

edited by isuruf

isuruf-bot commented Sep 11, 2019 •

edited

certik commented Sep 11, 2019

ichumuh commented Sep 11, 2019

certik commented Sep 11, 2019

isuruf commented Sep 11, 2019

isuruf commented Sep 11, 2019

ichumuh commented Sep 11, 2019

isuruf commented Sep 11, 2019

isuruf commented Sep 11, 2019

bjodah commented Sep 11, 2019

certik commented Sep 11, 2019

isuruf commented Sep 12, 2019

ichumuh commented Sep 12, 2019

bjodah commented Sep 12, 2019 •

edited

ichumuh commented Sep 12, 2019 •

edited

isuruf commented Sep 12, 2019

ichumuh commented Sep 13, 2019 •

edited

bjodah commented Sep 13, 2019

ichumuh commented Sep 13, 2019

ichumuh commented Sep 13, 2019

bjodah commented Sep 13, 2019 •

edited

ichumuh commented Sep 13, 2019

LLVMDoubleVisitor init is slow, added example as benchmark #1612

Are you sure you want to change the base?

LLVMDoubleVisitor init is slow, added example as benchmark #1612

Conversation

ichumuh commented Sep 11, 2019 • edited by isuruf

isuruf-bot commented Sep 11, 2019 • edited

certik commented Sep 11, 2019

ichumuh commented Sep 11, 2019

certik commented Sep 11, 2019

isuruf commented Sep 11, 2019

isuruf commented Sep 11, 2019

ichumuh commented Sep 11, 2019

isuruf commented Sep 11, 2019

isuruf commented Sep 11, 2019

bjodah commented Sep 11, 2019

certik commented Sep 11, 2019

isuruf commented Sep 12, 2019

ichumuh commented Sep 12, 2019

bjodah commented Sep 12, 2019 • edited

ichumuh commented Sep 12, 2019 • edited

isuruf commented Sep 12, 2019

ichumuh commented Sep 13, 2019 • edited

bjodah commented Sep 13, 2019

ichumuh commented Sep 13, 2019

ichumuh commented Sep 13, 2019

bjodah commented Sep 13, 2019 • edited

ichumuh commented Sep 13, 2019

ichumuh commented Sep 11, 2019 •

edited by isuruf

isuruf-bot commented Sep 11, 2019 •

edited

bjodah commented Sep 12, 2019 •

edited

ichumuh commented Sep 12, 2019 •

edited

ichumuh commented Sep 13, 2019 •

edited

bjodah commented Sep 13, 2019 •

edited