Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss of performance in Crocoddyl parallelization #1182

Open
TheoMF opened this issue Nov 14, 2023 · 4 comments
Open

Loss of performance in Crocoddyl parallelization #1182

TheoMF opened this issue Nov 14, 2023 · 4 comments

Comments

@TheoMF
Copy link

TheoMF commented Nov 14, 2023

Crocoddyl parallelization seems to face issues.
When measuring execution time while parallelizing the function calcdiff of the class IntegratedActionModelEuler on the robot Talos, we can observe a decrease in performance. On the repository https://gitlab.laas.fr/gsaurel/croco-benchs, we have plots of it that shows the results on multiple computers :

This is what the execution time looks like on multiple machines as we increase the number of cores. Now if we plot the execution time multiplied by the number of cores, if the parallelization was successfull, we would have something constant when using more cores. Actually, the more we use cores, the higher the multiplication is :

The increase of this multiplication shows the loss of performance.

I am writing this issue in the prospect of solving this problem. I am currently investigating the causes of it.
I think I have found one of the cause of this issue : the function calcdiff in the class CostModelSum. I can open a PR to discuss about it. If we agree, I could open a PR to discuss this first step with more details.
I'm also looking at the calcdiff in the class IntegratedActionModelEuler, there may be a problem of aliasing that also decrease the performance.

@TheoMF TheoMF added the bug Something isn't working label Nov 14, 2023
@cmastalli
Copy link
Member

Hi @TheoMF, thanks for sharing these plots!

At the moment, I can't claim that there is an issue. For instance, I could explain the computation increase in the second plot with thread synchronization. Could you provide more explanatory details on why you believe there is an issue? I am also keen to hear more about what your findings are regarding the CostModelSum.

On another note, improvements in the CostModelSum might be OK. However, I am keen to know more, especially about other potential efforts. It is worth mentioning that we (internally) have already substantial improvements in the Crocoddyl API for both: frontends and backends.

As part of our research work, we are extending Crocoddyl features, which I can't talk about now. However, these new features required us to revisit Crocoddyl API. In short, some classes will be deprecated when realising this code.

@cmastalli cmastalli added efficiency and removed bug Something isn't working labels Nov 14, 2023
@cmastalli
Copy link
Member

Btw, CostModelSum shares memory across the differential action model. This might be an interesting remark here.

@nmansard
Copy link
Collaborator

We clearly see that the thread parallelization does not allow to linearly divide the computational cost of calcdiff. There is no sync between the tasks, so no reason for that beyond a minimal loss. Here, it is much more significant.

First, we would like to agree on the loss. If it is not evident, we can provide more analysis.

Second, we would like to fix it. @TheoMF obtained better results with replacing the map of CostSum by a vector. Once we agree on the initial observation, we can share the results of improvement.

@cmastalli
Copy link
Member

We clearly see that the thread parallelization does not allow to linearly divide the computational cost of calcdiff. There is no sync between the tasks, so no reason for that beyond a minimal loss. Here, it is much more significant.

First, we would like to agree on the loss. If it is not evident, we can provide more analysis.

I understand this, but my question is the follows: have you observed the same increase in the computation cost per thread when the number of OC nodes is increased?

If I understand correctly these results, you have a bench with 100 nodes, right? Could it be possible to analyse this for a different number of nodes?

It would be also useful to do it for different systems: quadrotors, manipulators and quadrupeds. For instance, the parallelization in quadrotors (or low-dimensional systems) is much more expensive than the non-parallelization case.

In short, a few extra results will show a better picture of what is happening, I believe.

Second, we would like to fix it. @TheoMF obtained better results with replacing the map of CostSum by a vector. Once we agree on the initial observation, we can share the results of improvement.

I am with you on looking to fix this issue if exists. When replacing the map with a vector internally, are you planning to have an internal mechanism to identify cost functions? That would be important for having the flexibility to specialise references or activate/deactivate costs. This is something quite used in our MPC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants