Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating Jacobians as Tensors #80

Open
optstat opened this issue Oct 9, 2022 · 12 comments
Open

Generating Jacobians as Tensors #80

optstat opened this issue Oct 9, 2022 · 12 comments

Comments

@optstat
Copy link

optstat commented Oct 9, 2022

There are use cases e.g. in optimal control where the trajectories and associated gradients and jacobians are required to be calculated in parallel. An simple example would be generating optimal trajectories in phase space using some statistical objective for uncertain initial conditions.

Other AD frameworks such as Google Jax do a poor job of calculating Jacobians explicitly, but a very good job vectorizing the calculations in parallel using a GPU.

Is there a way to generate tensors (e.g. pytorch tensors) as the output of CppAD instead of matrix jacobians? This would allow for the efficient calculation of explicit Jacobians on a GPU without much effort. I don't think this would be a huge change. I am not asking you to do this but if you could point out which files need to be modified I can tackle it.

@bradbell
Copy link
Collaborator

I am not sure what you are asking.
Is this question about calculating using CppAD or using CppADCodeGen ?
Are you asking how can derivatives be calculate in parallel ?
When you ask about tensors do you mean derivatives that have more than two dimensions (a matrix has two dimensions) ?

@optstat
Copy link
Author

optstat commented Oct 11, 2022

Sorry I didn't explain very well.

Suppose we have a function y=f(x) where y \in R^m and x \in R^n. The Jacobian or sensitivity matrix is then R^{mxn}. So far so good.

Now suppose we want to evaluate the function for N independent samples. We can the write x \in R^{Nxn} and y \in R^{Nxm} for the same function f. In matlab parlance this is y[i,:]=f(x[i,:]). In other words we expand this into a batch dimension so a given batch dimension in y depends only on the exact same batch dimension in x only. I call this construct a tensor.

Now further suppose this function can be executed in a Data Parallel fashion, and moreover the function is AD compatible. This is faster than executing the function in thread parallel fashion because of the typically wide SIMD capabilities of modern CPUs and GPUs. Currently in CppADCodeGen to calculate the jacobian we have to run the jacobian in parallel in its own memory space e.g. using OpenMP or tbb. Now if the generated jacobian and all the associated intermediate elements can be emitted as tensors during the code generation, that is the jacobian is now R^{Nxmxn} we can use one of the off the shelf frameworks e.g. pytorch, to calculate the Jacobian in a data parallel manner using a GPU. That would greatly speed up the computation.

I hope this makes sense.

@bradbell
Copy link
Collaborator

As I understand, you are asking is for the derivative computation to be vectorized ?

I think this may be possible in CppAD by defining you own Base type, much the way CppADCodeGen does.

This base type would would implement element-wise operations on vectors: e.g.,
z = x * y
would implement z[j] = x[j] * y[j] . Then you could vectorize the derivative computations.

This would not be an easy task. You can find out about defining your own Base type at
https://coin-or.github.io/CppAD/doc/base_require.htm

There is one operation there that would not fit in, namely the Integer function. I think this is mainly used by VecAD operations
https://coin-or.github.io/CppAD/doc/vecad.htm

@optstat
Copy link
Author

optstat commented Oct 11, 2022

Thank you-I think this is definitely a good approach. I will take a look at the documentation you pointed out and see how easy it is to implement.

@bradbell
Copy link
Collaborator

Start small, just do a minimal extension that only implements a few operations like +, -, *, /.
Make all the other operations generate an assertion.

@joaoleal
Copy link
Owner

CppADCodeGen can already be used for parallel executions but it has to be done manually using different threads. This will likely produce faster executions than using CppAD directly with vector operations. A single thread using CppADCodeGen can already produce results that can be more than 10 times faster than regular CppAD.

If you would like to benefit from running on different devices (e.g., GPUs) I would recommend choosing a technology (such as OpenCL or CUDA) and modifying the LanguageC source generate so that it supports this.
Perhaps you can generate the C code, identify what needs to be changed, and then update the LanguageC to generate this.

@bradbell
Copy link
Collaborator

I agree that the best speeds would have to include the conversion to compiled code using CppADCodeGen or CppAD JIT
https://coin-or.github.io/CppAD/doc/example_jit.htm

The CppAD JIT does not support vector base types (at this time) so you would still have to do as suggested above.

@optstat
Copy link
Author

optstat commented Nov 22, 2022

Sorry for my late response. I had to go brush up on my computer architecture. I did a lot of numerical experiments. Here is what I discovered. Compilers will not vectorize functions with many statements. They will skip them to save optimization time. However if you split each generated statement into its own function then the statements can be very easily vectorized.
For example something like this would vectorize over M samples
void calc_jac_elem(std::vector &x, int M, std::vector &v, std::vector &jac) {
for (int i = 0 ;i < M; i++)
jac[i * 7] =
v[11623i+612] + v[11623i+591] * v[11623i+594] + v[11623i+618] * v[11623i+587] + v[11623i+592] * v[11623i+582] + v[11623i+571] * v[11623i+574] + v[11623i+564] * v[11623*i+567];
}

I think this would be a lot of work to do, however if you point me to the right place I can attempt to do this for a small problem. I believe this will result in tremendous speedup for some interesting problems in control. The bottom line is every generated statement needs to be wrapped in a function for this to help. It cannot be in-lined.

@joaoleal
Copy link
Owner

joaoleal commented Nov 26, 2022

There are several avenues that you can take.

If what you need is to compute derivative information for several independent variable vectors then the best would be to parallelize at the highest level possible to avoid the overhead of handling/synchronizing several threads.

If you would like to parallelize the evaluation of derivative information for a single independent variable vector then you can try to use the existing support. See https://github.com/joaoleal/CppADCodeGen/tree/master/test/cppad/cg/model/threadpool
You can choose between using openMP or using a custom threadpool library. From my experience this is only worth it if your model is very very complex.
The parallelization is done at the level of execution of each forward or reverse pass through the model.

To be able to create loops with CppADCodeGen, you can try to use the pattern/loop detection options but this does not work with openMP/threadpool.
This also requires each equation to have the exact same pattern. For instance:

for (i=0; i<n;++i)
   y_{i+1} = y_i + alpha 

would not work.
See https://github.com/joaoleal/CppADCodeGen/blob/master/example/patterns.cpp

You can also try to place the common code in an atomic function and then call it multiple times (it will reduce the compilation time but I am not sure what will happen to the execution time).
See the wiki: https://github.com/joaoleal/CppADCodeGen/wiki/AtomicFunctions
See also https://github.com/joaoleal/CppADCodeGen/blob/master/test/cppad/cg/model/dynamiclib/dynamic_atomic.cpp
There also an example here (but the atomic function is not compiled):
https://github.com/joaoleal/CppADCodeGen/blob/master/example/atomic.cpp

You can also try to mess around with the compiler options (e.g., use -O3) with compiler.setCompileFlags.

You could also try to combine multiple of these approaches.

@bradbell
Copy link
Collaborator

However if you split each generated statement into its own function then the statements can be very easily vectorized.
For example something like this would vectorize over M samples
void calc_jac_elem(std::vector &x, int M, std::vector &v, std::vector &jac) { for (int i = 0 ;i < M; i++) jac[i * 7] = v[11623_i+612] + v[11623_i+591] * v[11623_i+594] + v[11623_i+618] * v[11623_i+587] + v[11623_i+592] * v[11623_i+582] + v[11623_i+571] * v[11623_i+574] + v[11623_i+564] * v[11623*i+567]; }

I think that the Blass linear algerbra package does this for you
https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms

@joaoleal
Copy link
Owner

Also consider using eigen for blas operations: https://eigen.tuxfamily.org/index.php?title=Main_Page

@bradbell
Copy link
Collaborator

@optstat see
coin-or/CppAD#185

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants