Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParallelSGD does not work as a drop-in replacement for SGD and it is not compatible with the rest of the code (does not compile) #2731

Closed
zsogitbe opened this issue Nov 23, 2020 · 5 comments

Comments

@zsogitbe
Copy link

Issue description

ParallelSGD does not work as a drop-in replacement for SGD and it is not compatible with the rest of the code (it does not compile). SGD uses only 1 thread and therefore it is slow. It would be interesting to have a faster optimization if ParallelSGD would work. It seems to me that ParallelSGD is not worked out well. What is the reasons for this? Is it not a good algorithm? If the ParallelSGD algorithm is not good, would it be possible to speed up the SGD optimizer with parallel processing in some way?

Steps to reproduce

Use for example a simple neural network example with SGD and replace the SGD optimizer with ParallelSGD (needs a modified function...)

Expected behavior

The code should compile well and the optimization should be much faster.

Actual behavior

The code does not compile (several errors with mentioning the wrong number of parameters on several places (e.g. the Evaluate function).

@rcurtin
Copy link
Member

rcurtin commented Dec 1, 2020

This looks like an issue that should be opened in the ensmallen repository. Nonetheless, the documentation for ParallelSGD points out that the API required for a function being optimized with ParallelSGD is slightly different than for regular separable differentiable functions, and suggests changes that you can make to your function's implementation that should allow the usage of ParallelSGD.

The Hogwild! algorithm is most performant when the gradients of the objective function are sparse. That may not be the case for a neural network in general, but it can work well for, e.g., a high-dimensional sparse logistic regression problem. Check out the paper on the algorithm for more details.

Please include a minimum reproducible example with bug reports (so that we can compile that directly), as well as the exact errors that are being encountered. Although you provided some directions about how to reproduce the issue, it would take a while to write the code and debug it, and there would also be no guarantee that we would even see the same issue you are reporting. 👍

@zsogitbe
Copy link
Author

zsogitbe commented Dec 1, 2020

Thank you for your answer Ryan!

Would you recommend ParallelSGD for recurrent neural networks?

Please find an example project in attachment. This is a slightly modified former version of the RNN electricity consumption example. I have dropped in ParallelSGD instead of SGD. Things I have removed are signed with '//@-psgd' and things I have added are signed with '//@+psgd'. There are only a very few things changed. It would be interesting to see if this works if it will be able to compile.
LSTMTimeSeriesUnivariatePSGD.zip

@rcurtin
Copy link
Member

rcurtin commented Dec 1, 2020

I wouldn't---I don't expect an RNN (unless very specifically constructed) to have sparse gradients. There may be other parallel SGD variants that could work for dense data and gradients but honestly I think that the best level of single-node parallelism for neural networks is not at the optimizer level but at the linear algebra level. So if you are using OpenBLAS already then the large linear algebra operations should already be using multiple cores.

Let me try out the example code you sent...

@rcurtin
Copy link
Member

rcurtin commented Dec 1, 2020

Ok, I see what's going on here. The issue isn't actually anything with ParallelSGD; it's that the RNN class does not implement the sparse separable differentiable function API required by ParallelSGD. As I mentioned earlier, it's not likely that an RNN will be able to make effective use of the Hogwild algorithm because the gradients will, in general, be fully dense.

In order to fix this issue, the RNN::Evaluate() and RNN::Gradient() methods would need to be adapted such that they could take an arbitrary matrix type as input. However, that's quite an undertaking and given the reasons above I don't think it's worthwhile to do that right now.

At the same time, it's worth pointing out that #290 is an issue that has been open for a long tine with the intention of fully templatizing mlpack's algorithms to work with any matrix type. If/when that is done, then RNNs will work with Hogwild, but like I mentioned I don't think it would yield noticeable speedup even if it did work. 👍

@zsogitbe
Copy link
Author

zsogitbe commented Dec 1, 2020

OK! I understand. I will close this issue.

@zsogitbe zsogitbe closed this as completed Dec 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants