Best method to compute gradients, Jacobians and Hessians #25933

sandeep026 · 2023-11-27T10:45:24Z

sandeep026
Nov 27, 2023

I work on solving nonlinear optimization problems. The state of the art method to compute derivatives is through automatic differentiation which is provided by libraries like JAX, Aesera, casadi etc.

Symbolic expression can be converted into functions via lambdify. Within labdify options can be passed and they can evaluated using JAX or aeseara function.

From a point of computational efficiency, which of the approaches is better?

A

Start with a sympy variables and create expression
Compute the derivative using symbolic differentiation.
Use lamdify to create functions based on jax/numpy/aesera for numerical evaluation

Remark: sym diff is much slower than auto diff, in priciple I expect this to be slowest

B

Start with a sympy variables and create expression
Use lamdify to create functions based on jax/numpy/aesera for numerical evaluation
Use grad operator from these libraries to compute the derivatives

C

Create a expression graph from within aesara.
Compute derivatives

Secondly, in which situation can I benefit from using sympy over the above mentioned appraoches?

Answered by moorepants

Nov 27, 2023

If you are concerned with the performance of numerical evaluation, then your A will most likely provide the fastest result.

If you are concerned with the performance of generating the numerical functions, then A may likely be slower than B and C.

Automatic differentiation is only "state of the art" in the sense that it allows you to differentiate more arbitrary numerical codes. But there is a performance cost for using these more generalized methods.

I show in this recent blog post a solution to a nonlinear optimization problem where careful implementations that follows your "A" gives very high performance numerical evaluation: https://mechmotum.github.io/blog/czi-sympy-wrapup.html which …

View full answer

moorepants · 2023-11-27T11:46:39Z

moorepants
Nov 27, 2023
Collaborator

If you are concerned with the performance of numerical evaluation, then your A will most likely provide the fastest result.

If you are concerned with the performance of generating the numerical functions, then A may likely be slower than B and C.

Automatic differentiation is only "state of the art" in the sense that it allows you to differentiate more arbitrary numerical codes. But there is a performance cost for using these more generalized methods.

I show in this recent blog post a solution to a nonlinear optimization problem where careful implementations that follows your "A" gives very high performance numerical evaluation: https://mechmotum.github.io/blog/czi-sympy-wrapup.html which I doubt any alternative methods you mention above can beat.

opty solves optimal control problems using your method A and we have introduced a very fast symbolic differentiating algorithm there. pycollo also solves optimal control problems but does what you describe in "C" and relies on Casadi's compute graph, but Casadi currently can't manage such a large compute graph, so SymPy is the only option.

Bottom line is that you have to very precisely define your performance metrics and then try A, B, and C for specific problems. Opty supports JAX and pycollo supports Casadi, so you can do some comparisons easily with those two tools. For small expression trees, then there may be little difference, but for large ones my bet is on method A for the highest numerical evaluation performance that retains maximum accuracy.

8 replies

moorepants Nov 27, 2023
Collaborator

This is the fast Jacobian we implemented in opty: csu-hmc/opty#102 and we have this new PR in SymPy that will also implement it: #25801

moorepants Nov 27, 2023
Collaborator

Also here is an old blog post I made. It does some comparisons with lambdify, theano, and C code gen: https://www.moorepants.info/blog/pydy-code-gen.html All of these things have changed over the years and the results will be different, but it gives some idea.

sandeep026 Nov 28, 2023
Author

Dear @moorepants ,

I ran a simple prelimnary test to look at speed of numerical evaluation. It is exactly like what you had mentioned in the reply. A>C>B

I also tested casadi's numerical evaluation time for the particular example and found it to be roughly 90 times faster than method A, currently.

Will test it again after the pull request merge (#25801).

windows 10
python 3.11.4
ryzen 4600h

from jax.config import config
config.update("jax_enable_x64", True)

from sympy import *
from sympy.abc import x
import jax.numpy as jnp
from jax import  grad
from time import perf_counter
import sys
import casadi as cs
sys.setrecursionlimit(3000)

# create a complex function which maps from R->R

# number of loops to create a complex function
N=200   
# number of times to be evaluated
m=int(100)

# data points at which derivative needs to be evaluated
data = jnp.linspace(1, 10, m)

# sympy
expr=x
for i in range(N):
    expr= sin(expr)
# compute derivative using sympy   
fd= lambdify(x, diff(expr,x), "jax",cse=True)

# derivative in sympy to jax function using lambdify
t1 = perf_counter() 
for i in data:
    fd(i)
t2 = perf_counter() 
print('(A) derivative in sympy to jax function using lambdify',t2-t1)

# sympy to jax function using lambdify and derivative computed using grad
f = lambdify(x, expr, "jax",cse=True)    
fd_jax=grad(f)

t1 = perf_counter() 
for i in data:
    fd_jax(i)
t2 = perf_counter()       
print('(B) sympy to jax function using lambdify and derivative computed using grad',t2-t1)

# use JAX directly
def F(x,N):
    for i in range(N):
        x=jnp.sin(x)
    return x    
Fd=grad(F)
t1 = perf_counter()
for i in data:
    Fd(i,N)
t2 = perf_counter()       
print('(C) without sympy and direct use of JAX',t2-t1)

# casadi
x=cs.SX.sym('x',1)
y=x
for i in range(N):
    y= cs.sin(y)
#create function for numerical evaluation    
f_cs=cs.Function('f_cs',[x],[cs.gradient(y,x)])

# evaluate the derivative numerically in casadi
t1 = perf_counter()
for i in data:
    (f_cs(i).full())
t2 = perf_counter()
print('casadi der',t2-t1)

# check for numerical values
#for i in data:
#    print([ fd(i), fd_jax(i), Fd(i,N), f_cs(i).full()])

(A) derivative in sympy to jax function using lambdify 0.7544325001072139
(B) sympy to jax function using lambdify and derivative computed using grad 10.531043299939483
(C) without sympy and direct use of JAX 10.387901399983093
casadi der 0.008336500031873584

moorepants Nov 28, 2023
Collaborator

Nice comparison.

I also tested casadi's numerical evaluation time for the particular example and found it to be roughly 90 times faster than method A, currently.

The A you have in your example generates Python code. To get best performance with A, C or Fortran code should be generated, which lambdify can't do.

Other notes:

I'm not sure cse=True does anything if you set the module to "jax". Maybe, but not sure.
I'm not sure using "jax" in your fd will result in the fastest code. For example, is jax faster than numpy? or slower?
The expression you build doesn't really benefit from common subexpressions, as there are only sin(x) and cos(x) present in the expression.
You could try using ufuncify instead of lambdify, but I don't think it supports cse (but as mentioned above, that may not matter for this example).

moorepants Nov 28, 2023
Collaborator

Will test it again after the pull request merge (#25801).

This PR will not speed up numerical evaluation relative to what you've shown. That is to speed up symbolic differentiation, which for larger expressions (millions of operations) can be slow. I'd consider your example here a small to medium sized expression.

oscarbenjamin · 2023-11-28T12:54:10Z

oscarbenjamin
Nov 28, 2023
Maintainer

The problem with lambdify is that it "compiles" to Python code and Python code is not fast for this kind of floating point evaluation. Actually with pypy this sort of thing is a lot faster so that might be worth trying with lambdify(..., modules='math').

You either want to generate C code as @moorepants says or you should use something that compiles to machine code in memory like numba or llvmlite.

You might want to try SymPy's llvmjitcode module. First pip install llvmlite and then:

In [1]: from sympy.printing.llvmjitcode import llvm_callable as lambdify_llvm

In [2]: N = 200

In [3]: # sympy
   ...: expr=x
   ...: for i in range(N):
   ...:     expr= sin(expr)
   ...: 

In [4]: %time ed = expr.diff(x)
CPU times: user 8.05 s, sys: 2.17 s, total: 10.2 s
Wall time: 10.3 s

In [5]: f = lambdify_llvm([x], ed)

In [6]: %time f(1)
CPU times: user 1.34 ms, sys: 225 µs, total: 1.56 ms
Wall time: 1.58 ms
Out[6]: 0.0013500947626989782

In [7]: %time f(1)
CPU times: user 1.54 ms, sys: 0 ns, total: 1.54 ms
Wall time: 1.56 ms
Out[7]: 0.0013500947626989782

In [9]: %timeit f(1)
425 µs ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [11]: %timeit [f(v) for v in vals]
43.1 ms ± 71.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Also here are timings for protosym using llvmlite (pip install protosym llvmlite). protosym is very incomplete but has enough pieces to compute the given benchmark:

In [1]: from protosym.simplecas import sin, x, lambdify

In [2]: N = 200

In [3]: expr=x
   ...: for i in range(N):
   ...:     expr= sin(expr)
   ...: 

In [4]: %time ed = expr.diff(x)
CPU times: user 24.1 ms, sys: 15 µs, total: 24.1 ms
Wall time: 22.6 ms

In [6]: %time f = lambdify([x], ed) # uses llvm
CPU times: user 71.9 ms, sys: 7.44 ms, total: 79.3 ms
Wall time: 77.4 ms

In [7]: vals = list(map(float,np.linspace(1,10,100)))

In [8]: %time f(1)
CPU times: user 65 µs, sys: 9 µs, total: 74 µs
Wall time: 89.6 µs
Out[8]: 0.0013500947626989782

In [9]: %timeit [f(v) for v in vals]
593 µs ± 753 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Also SymEngine plus llvmlite (pip install symengine llvmlite):

In [10]: from symengine import sin,lambdify,symbols

In [11]: x = symbols('x')

In [12]: expr=x
    ...: for i in range(N):
    ...:     expr= sin(expr)
    ...: 

In [13]: %time ed = expr.diff(x)
CPU times: user 13.4 ms, sys: 23 µs, total: 13.4 ms
Wall time: 16.2 ms

In [14]: %time f = lambdify([x], [ed]) # uses llvm
CPU times: user 67.7 ms, sys: 38.2 ms, total: 106 ms
Wall time: 337 ms

In [15]: %time f(1)
CPU times: user 203 µs, sys: 16 µs, total: 219 µs
Wall time: 234 µs
Out[15]: array(0.00135009)

In [16]: %timeit [f(v) for v in vals]
1.71 ms ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

1 reply

sandeep026 Nov 28, 2023
Author

@oscarbenjamin , thank you for the insights. I will try them out too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best method to compute gradients, Jacobians and Hessians #25933

{{title}}

Replies: 2 comments 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Best method to compute gradients, Jacobians and Hessians #25933

sandeep026 Nov 27, 2023

Replies: 2 comments · 9 replies

moorepants Nov 27, 2023 Collaborator

moorepants Nov 27, 2023 Collaborator

moorepants Nov 27, 2023 Collaborator

sandeep026 Nov 28, 2023 Author

moorepants Nov 28, 2023 Collaborator

moorepants Nov 28, 2023 Collaborator

oscarbenjamin Nov 28, 2023 Maintainer

sandeep026 Nov 28, 2023 Author

sandeep026
Nov 27, 2023

Replies: 2 comments 9 replies

moorepants
Nov 27, 2023
Collaborator

moorepants Nov 27, 2023
Collaborator

moorepants Nov 27, 2023
Collaborator

sandeep026 Nov 28, 2023
Author

moorepants Nov 28, 2023
Collaborator

moorepants Nov 28, 2023
Collaborator

oscarbenjamin
Nov 28, 2023
Maintainer

sandeep026 Nov 28, 2023
Author